Sunday, January 26, 2014

It all starts with data.

Let us dig a bit deeper into the first step of big data - the creation of the bytes by most of us going about our ways. This is by no means comprehensive but you should get an idea of the deluge we are responsible for.

The Eyes

The biggest source of bytes is the ccd chip; a small device that captures images in a device/feature called a camera.

Why do we use this device so much? well someone spent a lot of money to find out in 2005 - which makes for interesting reading. These instruments are everywhere, very easy to use and we might regret not using them later so we use them. I have so many great memories not on film or digital form that i wish i had; but that is a different story.

The point is we just make lots of memory in digital form and as it gets easier to make more crisper memories, we go right ahead in burst mode. Burst mode is taking 100's of pictures with the hope that one of them will be perfect- emotion, angle, light and no nosy noise in the background. Ahem, we then conveniently forget to delete the nosy and now have 99 nosies along with one perfect.

Just these habits with an ever increasing megapixel capability and we have just managed to create a monster album.

The Multipliers

Now take the huge bits of pictures, videos and then, sync them with icloud, box, skydrive, google drive, instagram, facebook and a bunch of other places popping up all over the place- with sync and share capability and we just multiplied the data.

Imagine you are an apple fan, you have a ipad and a iphone (die-hard fanboys have mini's, retina's imacs etc). You take One picture of a soaring eagle on a crisp sunny day and the picture is replicated across all your icloud connected devices. You can conveniently show off your great luck in witnessing the glories of nature with your device closest to your reach at all times. 

The price of convenience? A whole bunch of hard disks that keep multiple copies of your pictures on icloud or any other sharing service you subscribe to. I have heard they keep anywhere from 2-6 copies.. folks that know for sure- ping me on this. 

So every picture taken * the number of devices connected * the icloud storage service = bytes now in this world.

The Act

With the ubiquitous internet, we act a lot- like visit websites, click on a bunch of interesting items and so on just because we are browsing or surfing. In the simplest form a webpage has some text, a logo, some pictures and hosted on a complex system consisting of network equipment, computers, databases and software that makes sure you get the page you asked for 90% of the time or more. One web page view can generate anywhere from 4-40 requests to fetch text, stylesheets, templates, logos, pictures, etc that touches at a minimum three electronic devices that make one line of note each saying you asked for something and you either got it or did not.

The Multipliers

In a simplest form, one request generates 4 requests x 3 devices that is 12 lines of notes also known as logs at a minimum on a simple but professional system. If you add on the analytics, caching, reporting and tracking each added feature starts  multiplying the 4 in the request with that number and you can see this adding up rather quickly. 

Considering that just Google handles over 100 billion searches per month each search carries at least 27 requests (on chrome per the netmon for my computer); that translates to over five trillion (5400 billion) lines of logs. Assuming no other shady tracking.  

The Shadows

Many internet sites install trackers on your computer that watch where you go, what you do and build profiles on where you live, how much you earn and where you spend money. This adds requests to every click, scroll of internet browsing you do.

Multipliers

I don't want to spook you as that is another separate discussion.

Bottom-line

Big data is created intentionally or systemically- call it machine data, human generated data or behavioral data, this is a digital footprint we leave just by being online. I wonder what is the carbon footprint of all this as the bytes are stored for all eternity? 

Friday, January 24, 2014

What is all the fuss about big data?

So we have managed to gather a huge amount of data, based on some sources at Gartner/Intel, some 2.7 Zeta bytes another mind boggling huge number by 2012 (2013 numbers are still pending tally). So are we as humans spending all of our waking hours typing away at our keyboards? Are we creating knowledge at a ever faster rate than ever? Do we need some massive systems help to process this gigantic bytes we are creating?

Bytes follow a pattern

  1. Bytes or Zetabytes in whole are first created, in the form of pictures, videos, sounds and of course texts. Other more questionable bytes come from our behaviors line browsing, sharing, liking, poking and the new e-verbs popping up daily. Also, some of us are in the habit of snapping pictures in burst mode for that one perfect picture and never deleting the bad ones. 
  2. The bytes need to be stored, shared, communicated, linked, #tagged or acted upon or be available for acting upon at a more convenient time. You want to store them so that minor hiccups of life like dropping your camera/phone in coffee or under a bus or just unfriendly hands making off with do not take away the precious bytes
  3. The Bytes need to be made available to processes, organize, structure and generally made ready for presentation around the right context. All these technical terms mean that, the applications like Instagram, twitter, facebook, youtube etc can get to the right picture in the right place and wow you with their ability. Also in this bucket are more dubious creatures that predict your preferences, make suggestions based on you, your peers and what someone is paying them. Its always "them", "they" spend billions on dollars to make you pay more towards their pockets.
  4. The newer tools that make stuff pretty- visualizations, dashboards, infographics etc.
The purpose of all of these activities is to help us understand ourselves and others- particularly around events, behaviors and actions that had some consequences in the past that we might want to avoid or mimic. 

Its all for profit. A lot of folks have gone ahead and promised much more than profit and have the cases to back their statements. I remain skeptical so far - pending further investigation. 

Thursday, January 23, 2014

What is Big data?

Curiosity is a great thing. One thing leading to the next in a domino effect makes a lot of noise and sometimes fantastic music. Having come to a point of developing expertise in Big Data, I am attempting to tell a story.

Let me frame why everyone thinks about big data. Since we started converting our thoughts, ideas and experiences into digital form, information and therefore capacity has grown in leaps and bounds.  If you don't believe the very nice graphic here,  look at the capacity of any electronic device you have.

The average costs per Gigabyte (10003) of storage keeps on going down from over $6 million in 1980 to just under $6 in 2013. Now a Gigabyte can store over 600,000 pages of text. An exabyte is a giga of a gigabyte or (10006) and can fit about 18,000 times all the books ever written (by 2006) according to an IDC paper sponsored by EMC Corporation in 2006. 

Now that we have some context of what we have and what we can store, It does not take a math wizard to tell that there is a lot of duplication/noise in the data that we need to separate from the information/signal. I could not resist the electronic reference, but to separate the signal from the noise we have been building techniques mentally and scientifically over a long long time as humans have evolved to create such capacities.


All of this has just touched on one aspect of big data the Volume- the sheer amount of data- separate from information see the best definition, i have found here. Add on the growth as seen in the chart above leading us to the second factor Velocity- its cheaper so we will just store more; in different ways, ergo: Variety- literally as a picture is worth a 1000 words, 10 second movie 100,000 words in a language best available at the time; we have parroted the 2001 research report[20]  .


When data is processed, organized, structured or presented in a given context so as to make it useful, it is called Information.  This is the promise of Big Data- over the course of my exploration, I will write about what this means