|
||||||||||||||||
At last week's Big Data Show we were lucky enough to speak to Lauren Walker, Sales Leader at IBM Big Data Solutions, who gave us a great message from her real-time analytics talk: Babies, Brains, and Buses.
This case study focused on the big data's ability to help the survival rate of premature babies by combining machine information and human content in real time.
I want to tackle Hadoop, but before we get there, we're going to need to explore MapReduce. MapReduce is a programming model for processing large datasets, and the clue to its function is in its name.
When you want to pull certain information from your datasets, it "maps" out the relevant information for your query.
Then it "reduces" the information down, sorts it based on any rules you've applied, and gives you just the data you were after.
An example:
Virginia is a medical researcher looking to carry out research on diabetes patients. For the purposes of her study, she wants to see any geographical concentrations of diabetes patients who are male, between the ages of 40 and 50, and who smoke.
The map in the MapReduce model finds the data sets which fit Virginia's needs.
Then begins the reduce function -- aggregating geographical data of these records and providing an ordered list of cities with the highest population of the defined type. This simple process has allowed Virginia to identify areas of concentration for further study.
MapReduce itself is pretty straightforward, but once we start ramping up the amount and types of data used we will need Hadoop's help -- which is where things get a bit more complex.