Thursday, April 17, 2014

How does big data work

Simplistically speaking, big data starts with data that is already collected, or is being generated or a combination of both. Which is collected and brought together in ways not possible earlier to be analyzed to extract some learnings.


this is an iterative process that is repeated several times to  get finer and deeper insights that actually can make a difference in a business.

Big data vendors approach this process from various angles. IBM, Oracle, Microsoft have a good handle on the data, analysis and processing model so they push solutions that apply there. Hadoop vendors are good at the collect and converge and are building newer tools for analysis so like to come at this pie from that view. Others start with data- specific to certain applications like Machine data, web, mobile, geo, financial, geological, remote sensing, weather... you name it and they know how to handle it. build a few collection, convergence and analysis modules and you have a workable big data solution that at least provides directional guidance if not actionable.

IDC, ESG, Forester, Booz Allen Hamilton put their own spin to this with big data workflows, converged infrastructure or third platform or the data lake or a bunch of other catchy terms. This is basic science and my friends the data scientists would say we have been doing this in our own ways first on note pads, then on spreadsheets then on databases then on data warehouses and more recently on massively parallel processing data warehouses or data appliances for a while.

The only difference is the scale and the tools to interpret the variety of data has expanded in scope.

In the next set of blogs i will explore four strategies and tactics that help work your way through the big data minefield where we straighten the iterative process into

  • Acquisition
  • Storage
  • Analysis
  • Application
Strategy 1.  Start with systems that acquire the data and move towards application. A path suggested by Oracle, Microsoft, SAP, Teradata etc.

Strategy 2. Stitch together a system that connects the four components by partnering with different component vendors. A path taken by various consortia like EMC, AWS and others

Strategy 3. Start with application  through  data exploration and contextualization applications like Splunk for machine data and Palantir for fraud detection

Strategy 4. Build the acquisition, storage and analysis engine with Hadoop and connect it to the applications through custom tools. A path recommended by Cloudera, Pivotal, hortonworks and MapR


No comments:

Post a Comment