Thursday, April 17, 2014

How does big data work

Simplistically speaking, big data starts with data that is already collected, or is being generated or a combination of both. Which is collected and brought together in ways not possible earlier to be analyzed to extract some learnings.


this is an iterative process that is repeated several times to  get finer and deeper insights that actually can make a difference in a business.

Big data vendors approach this process from various angles. IBM, Oracle, Microsoft have a good handle on the data, analysis and processing model so they push solutions that apply there. Hadoop vendors are good at the collect and converge and are building newer tools for analysis so like to come at this pie from that view. Others start with data- specific to certain applications like Machine data, web, mobile, geo, financial, geological, remote sensing, weather... you name it and they know how to handle it. build a few collection, convergence and analysis modules and you have a workable big data solution that at least provides directional guidance if not actionable.

IDC, ESG, Forester, Booz Allen Hamilton put their own spin to this with big data workflows, converged infrastructure or third platform or the data lake or a bunch of other catchy terms. This is basic science and my friends the data scientists would say we have been doing this in our own ways first on note pads, then on spreadsheets then on databases then on data warehouses and more recently on massively parallel processing data warehouses or data appliances for a while.

The only difference is the scale and the tools to interpret the variety of data has expanded in scope.

In the next set of blogs i will explore four strategies and tactics that help work your way through the big data minefield where we straighten the iterative process into

  • Acquisition
  • Storage
  • Analysis
  • Application
Strategy 1.  Start with systems that acquire the data and move towards application. A path suggested by Oracle, Microsoft, SAP, Teradata etc.

Strategy 2. Stitch together a system that connects the four components by partnering with different component vendors. A path taken by various consortia like EMC, AWS and others

Strategy 3. Start with application  through  data exploration and contextualization applications like Splunk for machine data and Palantir for fraud detection

Strategy 4. Build the acquisition, storage and analysis engine with Hadoop and connect it to the applications through custom tools. A path recommended by Cloudera, Pivotal, hortonworks and MapR


Wednesday, March 26, 2014

What can you really do with big data?

If any analyst, guru, visionary was asked the question; we would get an answer simillar to

 "transforms a business from a data-aware organization to data-driven organization"

This is management speak for good science: an organization does things, that generates data, which gets analyzed so you do things better, different, more or less based on what the effects are.  This paradigm does not change; call it data-driven, analytics-driven or a whole lot of other names; you just have more tools to work with so can see a bigger rearview mirror better.

does that make big data less useful? no- just make sure you know where you are and what you see.

Monday, March 17, 2014

The Big data pie

Following the money patterns is interesting, after much thought and staring at spreadsheets; the breakdown is

  1. Professional services (35%): not a wonder since Apache Hadoop is free; its the vendors support that is the biggest slice of the pie
  2. Compute (17%) is after all commodity hardware;
  3. Storage (14%) was a surprise. Its the data that is growing but even with the falling storage costs.. this comes in a distant third.
  4. Apps and Analytics (13%): the splunks, tableaus of the world are very small. compared to the large and growing pie.
Did I miss the professional services boat? probably, but the real work is yet to be done.

I have heard the toy elephant story many times, but seriously the numbers tell that the real elephant is IBM- well they took the elephant sized pie.. the rest well seem like crumbs.

Sunday, March 2, 2014

Big data is big money

We are creating the data, we are storing, slicing dicing and presenting it but where is the money?

Big data is big business.  $18 billion last year, expected to grow up to $28 billion this year and keep on growing upto $50 billion in 2017. and who takes the largest slice of the pie?

Professional services take the biggest bite of the big apple. All the hadoop vendors are slightly under the #5 on the list Teradata which is less than half of #1 IBM.

Making the noise is one business and money is another business..


Friday, February 28, 2014

Data needs to be processed

When you have a lot of stuff, you need to be able to know what you are looking for and then find it. That's where processing comes into picture; and its a huge business.

In the big data world, "tools" make a huge difference to get some use out of the large banks of information we are collecting, creating, duplicating and throwing in one huge heap. Tools fall into a few buckets but broadly

  1. Organizing tools that help you to put some structure to madness- these force you to make a plan and stick to it.
  2. Search tools to help you locate something specific in a huge pile- these work great if you know whats in the pile and how to look for it. 
  3. Presentation tools that organize stuff, sort it and make it look like its all neatly stacked. 
We would love services like that for the physical world but the apps, systems that do just these basic items focus on specifics and don't cross boundaries. For example, there are business card managers, calendar managers, mail managers, news feed managers, photo managers/organizers, video managers/organizers. try to mix any of these up and you have a fruit salad. 

This is the big data opportunity and tons of companies are jumping on the bandwagon!

Monday, February 3, 2014

Data needs to be stored

We create bytes to revisit, relive, make sense of the events or experiences at a later date. Real, hopeful or anticipated, once we digitize something, its there to stay.

Stay

To keep bytes from losing, we employ ingenious strategies. We make copies, tag-em, bag-em, sort-em, order-em, file-em, or in many ways archive-em. Sending ourselves e-mails, burn DVD's, copy on USB drives, or store them safely on the cloud so that if we need them in the future, we will have them. 

By just trying not to lose we compound the problem of a large number of bytes we create.

Speed

We are a hugely impatient lot when it comes to browsing. When we say for instance, keep 1000 photos on instagram or icloud or facebook, we cannot wait a couple of seconds for the poor site to bring it up. Now to give us our pictures really fast, these sites do something ingenious, they break down the pictures into smaller blurry images and throw them at you so you have something to work with. 

You guessed it right- that added more bytes. now you had large crisp files to store, not lose them and keep them coming your way fast so you multiplied them up. 

Pry-proof

Some of us dont like others watching our pictures, especially while we are moving them over the internet- so we do something called encryption. This in effect adds a dark screen over your pictures while moving your bytes from one point to another.  Even though this is not a huge addition, you temporarily create a mini monster for a short duration of time, that is created, stored, transmitted in a pry-proof manner.

So now you see, we have built a capacity to create digital monsters, fearing losses, we pad them to weather time and prying eyes; to overcome our impatience, miniaturize them. Now our already big and growing data got bigger and grew some more.

Sunday, January 26, 2014

It all starts with data.

Let us dig a bit deeper into the first step of big data - the creation of the bytes by most of us going about our ways. This is by no means comprehensive but you should get an idea of the deluge we are responsible for.

The Eyes

The biggest source of bytes is the ccd chip; a small device that captures images in a device/feature called a camera.

Why do we use this device so much? well someone spent a lot of money to find out in 2005 - which makes for interesting reading. These instruments are everywhere, very easy to use and we might regret not using them later so we use them. I have so many great memories not on film or digital form that i wish i had; but that is a different story.

The point is we just make lots of memory in digital form and as it gets easier to make more crisper memories, we go right ahead in burst mode. Burst mode is taking 100's of pictures with the hope that one of them will be perfect- emotion, angle, light and no nosy noise in the background. Ahem, we then conveniently forget to delete the nosy and now have 99 nosies along with one perfect.

Just these habits with an ever increasing megapixel capability and we have just managed to create a monster album.

The Multipliers

Now take the huge bits of pictures, videos and then, sync them with icloud, box, skydrive, google drive, instagram, facebook and a bunch of other places popping up all over the place- with sync and share capability and we just multiplied the data.

Imagine you are an apple fan, you have a ipad and a iphone (die-hard fanboys have mini's, retina's imacs etc). You take One picture of a soaring eagle on a crisp sunny day and the picture is replicated across all your icloud connected devices. You can conveniently show off your great luck in witnessing the glories of nature with your device closest to your reach at all times. 

The price of convenience? A whole bunch of hard disks that keep multiple copies of your pictures on icloud or any other sharing service you subscribe to. I have heard they keep anywhere from 2-6 copies.. folks that know for sure- ping me on this. 

So every picture taken * the number of devices connected * the icloud storage service = bytes now in this world.

The Act

With the ubiquitous internet, we act a lot- like visit websites, click on a bunch of interesting items and so on just because we are browsing or surfing. In the simplest form a webpage has some text, a logo, some pictures and hosted on a complex system consisting of network equipment, computers, databases and software that makes sure you get the page you asked for 90% of the time or more. One web page view can generate anywhere from 4-40 requests to fetch text, stylesheets, templates, logos, pictures, etc that touches at a minimum three electronic devices that make one line of note each saying you asked for something and you either got it or did not.

The Multipliers

In a simplest form, one request generates 4 requests x 3 devices that is 12 lines of notes also known as logs at a minimum on a simple but professional system. If you add on the analytics, caching, reporting and tracking each added feature starts  multiplying the 4 in the request with that number and you can see this adding up rather quickly. 

Considering that just Google handles over 100 billion searches per month each search carries at least 27 requests (on chrome per the netmon for my computer); that translates to over five trillion (5400 billion) lines of logs. Assuming no other shady tracking.  

The Shadows

Many internet sites install trackers on your computer that watch where you go, what you do and build profiles on where you live, how much you earn and where you spend money. This adds requests to every click, scroll of internet browsing you do.

Multipliers

I don't want to spook you as that is another separate discussion.

Bottom-line

Big data is created intentionally or systemically- call it machine data, human generated data or behavioral data, this is a digital footprint we leave just by being online. I wonder what is the carbon footprint of all this as the bytes are stored for all eternity? 

Friday, January 24, 2014

What is all the fuss about big data?

So we have managed to gather a huge amount of data, based on some sources at Gartner/Intel, some 2.7 Zeta bytes another mind boggling huge number by 2012 (2013 numbers are still pending tally). So are we as humans spending all of our waking hours typing away at our keyboards? Are we creating knowledge at a ever faster rate than ever? Do we need some massive systems help to process this gigantic bytes we are creating?

Bytes follow a pattern

  1. Bytes or Zetabytes in whole are first created, in the form of pictures, videos, sounds and of course texts. Other more questionable bytes come from our behaviors line browsing, sharing, liking, poking and the new e-verbs popping up daily. Also, some of us are in the habit of snapping pictures in burst mode for that one perfect picture and never deleting the bad ones. 
  2. The bytes need to be stored, shared, communicated, linked, #tagged or acted upon or be available for acting upon at a more convenient time. You want to store them so that minor hiccups of life like dropping your camera/phone in coffee or under a bus or just unfriendly hands making off with do not take away the precious bytes
  3. The Bytes need to be made available to processes, organize, structure and generally made ready for presentation around the right context. All these technical terms mean that, the applications like Instagram, twitter, facebook, youtube etc can get to the right picture in the right place and wow you with their ability. Also in this bucket are more dubious creatures that predict your preferences, make suggestions based on you, your peers and what someone is paying them. Its always "them", "they" spend billions on dollars to make you pay more towards their pockets.
  4. The newer tools that make stuff pretty- visualizations, dashboards, infographics etc.
The purpose of all of these activities is to help us understand ourselves and others- particularly around events, behaviors and actions that had some consequences in the past that we might want to avoid or mimic. 

Its all for profit. A lot of folks have gone ahead and promised much more than profit and have the cases to back their statements. I remain skeptical so far - pending further investigation. 

Thursday, January 23, 2014

What is Big data?

Curiosity is a great thing. One thing leading to the next in a domino effect makes a lot of noise and sometimes fantastic music. Having come to a point of developing expertise in Big Data, I am attempting to tell a story.

Let me frame why everyone thinks about big data. Since we started converting our thoughts, ideas and experiences into digital form, information and therefore capacity has grown in leaps and bounds.  If you don't believe the very nice graphic here,  look at the capacity of any electronic device you have.

The average costs per Gigabyte (10003) of storage keeps on going down from over $6 million in 1980 to just under $6 in 2013. Now a Gigabyte can store over 600,000 pages of text. An exabyte is a giga of a gigabyte or (10006) and can fit about 18,000 times all the books ever written (by 2006) according to an IDC paper sponsored by EMC Corporation in 2006. 

Now that we have some context of what we have and what we can store, It does not take a math wizard to tell that there is a lot of duplication/noise in the data that we need to separate from the information/signal. I could not resist the electronic reference, but to separate the signal from the noise we have been building techniques mentally and scientifically over a long long time as humans have evolved to create such capacities.


All of this has just touched on one aspect of big data the Volume- the sheer amount of data- separate from information see the best definition, i have found here. Add on the growth as seen in the chart above leading us to the second factor Velocity- its cheaper so we will just store more; in different ways, ergo: Variety- literally as a picture is worth a 1000 words, 10 second movie 100,000 words in a language best available at the time; we have parroted the 2001 research report[20]  .


When data is processed, organized, structured or presented in a given context so as to make it useful, it is called Information.  This is the promise of Big Data- over the course of my exploration, I will write about what this means