Christian Prokopp, Data Scientist, Rangespan, 12/12/2013 Comment now
Big data architecture paradigms are commonly separated into two diametrically opposed models, the more traditional batch and the (nearly) real-time processing. The most popular technologies representing the two are Hadoop with mapreduce and Storm.
However, a hybrid solution, the Lambda architecture, challenges the idea that these approaches have to ...
John Edwards, Technology Journalist & Author, 12/10/2013 Comment now
An Ovum analyst has peered toward 2014 and predicts that both big data and fast data are set to transition from the early-adopter phase to maturity.
According to Tony Baer, principal analyst of software-enterprise solutions for the market research firm, and lead author of Ovum's "Big Data 2014 Trends to Watch" report, big data trends in 2014 will ...
Our one-to-one conversations with data scientists continues as we get to know Josh Wills,
senior director of data science at Cloudera.
Cloudera is among a handful of organisations essentially synonymous with big data -- so this chat presented us with a great opportunity to get under the skin of one of the guys driving innovation in big data tool ...
Daniel D. Gutierrez, Data Scientist, 12/3/2013 Comment now
If your company is planning to hire a shiny new data scientist in the near future, grab a number. There's a long line of firms looking to do the same. Data scientists are all the rage, and you need to plan the acquisition well. There are definite strategies for setting the stage for data science in your organization to streamline the process. Let's take ...
Robert Mullins, Technology Journalist, 12/2/2013 Comment now
Big data analytics is all the rage these days, particularly the growth of apps built on the open-source Hadoop framework. But at least one industry executive says Hadoop has its limitations and wonders why venture capital firms are investing so much money in Hadoop-based startups.
Always keeping an eye on the horizon
While Hadoop is touted as the ...
The nub of the big news from Cloudera at this week's Strata/Hadoop World event in New York is the beta release of Cloudera Enterprise 5, the latest version of the vendor's Hadoop platform.
The bigger picture is what Cloudera calls the Enterprise Data Hub, which is how the company says Hadoop is now being used by advanced practitioners and how it will ...
Robert Plant, Associate Professor, School of Business Administration, University of Miami, 10/30/2013 Comment now
Analysts are trying hard to establish the value of Twitter before its IPO, yet one key aspect is causing more problems than the company's leadership probably expected: its patent inventory.
Twitter's apparent lack of patents
Twitter has only nine patents, compared to Facebook's 774 prior to its IPO. Central to this relative scarcity ...
Christian Prokopp, Data Scientist, Rangespan, 10/29/2013 Comment now
Last year, GraphChi, a spin-off of GraphLab, a distributed, graph-based, high-performance computation framework, did something remarkable.
GraphChi outperformed a 1,636 node Hadoop cluster processing a Twitter graph (dataset from 2010) with 1.5 billion edges -- using a single Mac Mini. The task was triangle counting and the Hadoop cluster required ...
Much has been made of Hadoop 2.0's ability to utilise YARN -- but what on earth does that even mean?
YARN stands for Yet Another Resource Negotiator -- so let's have a look at its impact with an example.
Sarianne is working on the biggest group of the biggest data sets she's had access to since starting work at the Financial Services provider. Previously, the versions of Hadoop and MapReduce would have limited how much could be done at the same time because everything was batch-oriented.
Now, MapReuce 2.0 (or YARN) allows more to happen simultaneously, by splitting the old MapReduce JobTracker into a ResourceManager and ApplicationMaster, meaning less of a queue, and more everything-at-once functionality.
Long story short, with YARN and Hadoop 2.0, Sarianne has more power than ever.
We all know Hadoop by now, but what is new with Hadoop 2.0 and beyond?
Major components of Hadoop have been rewritten for 2.0 – meaning Hadoop can now burst through the 4,000 machines per cluster barrier. It can also now support YARN, the newer, more scalable and flexible version of MapReduce.
At this stage, though, 2.0 isn't as stable as the original Hadoop. So before you jump in you have to think: Do you want stability, or can't you wait for that extended level of flexibility and scalability?
Hue is the Hadoop User Experience, but what can it do for you?
At its core, Hue provides a user interface which makes Hadoop (and Hadoop services) easier to use.
As an example, let's take Sarianne -- she's hard at work looking for insight in the millions of transactions her financial services employer has collected.
It's time to open the workload out to the wider team -- but rather than having everyone learn to code the hard way like she does, she makes use of HUE. This slaps an easy to navigate graphical interface onto many Hadoop tasks -- so that the people around her can have access to the data without worrying about the inherent complexities in the Hadoop and HDFS world.
Hive allows users to take advantage of Hadoop using a language similar to SQL, something most relational database developers have in their toolkit.
Let's examine how it helps with an example
Michael is a medical researcher who has had experience running relational data-bases, but knows that real insight could be found by accessing more data. After years of lobbying, he's managed to create a project which combines data from a variety of hospitals.
The upside is he has much more data to experiment with, the downside is that to get results quickly, he's having to use Hadoop, something he is unfamiliar with.
However, by leveraging Hive, he can write instructions in Hive Query Language (HQL), which isn't a huge leap from the SQL he knows so well. That means less time studying up on his language, and more time looking for correlations that can help patients recover quicker.
OpenStack is a cloud computing project that aims to provide scalable, open-source solutions to businesses. Let's look at how this applies to big data with an example.
An educational collective is looking to get insights on its syllabus. Sara is in charge and has had some early successes. As a result, many other educational institutions have become interested and want to get involved.
Being on OpenStack, Sara's project benefits from being able to scale in a robust fashion (due to the open-source nature), bringing in data sets and computational needs of the added schools. It's a dynamic system, with schools adding and removing data, and the OpenStack allows virtual machines to be created and destroyed as needed.
Hadoop and Cassandra integrate really well with the established OpenStack setup. All this cloud-stored data can be mined and analyzed until the data teams come up with new syllabus hypothesis they can test and adjust -- creating a more seamless year of study for prospective students.
At The Big Data Show we caught up with Mark Young, from the Big Data Insight Group, who talked us through the wealth of opportunities now available due to the accessibility of big data tools, as well as how we can look to Shazam and Maclaren F1 to see how other organisations have seen benefit from big data.
At The Big Data Show, we caught up with James Robinson of Open Signal, who encourages a team approach to visualizations. One of the reasons is that it sometimes takes a graphic designer or project manager to get the technical-minded visualization producer to go the extra distance.
Big data is awash with acronyms at the moment, none more widely used than HDFS. Let's cut to the chase... it stands for Hadoop Distributed File System.
This is the system of distributing files that allows Hadoop to work on huge data sets at speed. It spreads blocks of data across different servers, as well as duplicating those blocks of data, and storing them distinctly.
Let's see why with an example.
Sarianne works in the financial markets, and runs a lot of predictive models to make sure her investments are minimum risk.
Utilising HDFS, her queries through Hadoop can run quickly because the data blocks are stored separately -- meaning all the computation can happen in one go, rather than queuing up behind each other.
As an added benefit, if one server fails (as one is bound to, given the amount of servers and disk drives needed to run big data projects) it won't stop Sarianne's models from pulling the data they need, because HDFS duplicated those blocks -- meaning Hadoop can return Sarianne's results in double quick time.
In the first of a series of interviews with business leaders who leverage big data, we talk to James Robinson, CTO and co-founder of OpenSignal.
OpenSignal combines big data technologies and sensor data from mobile phones to give insight to both mobile consumers and telecommunications giants. Robinson is also a contributing writer on Big Data Republic.
Hadoop is the open-source software framework that quickly became almost synonymous with big data. But what does it actually do?
Whereas traditional data queries were run on one server, Hadoop enables you to run data queries across a large number of machines. By spreading the computational load across many servers, Hadoop enables you to deal with big data in a timely fashion.
Tobias runs an online DVD store -- and he wants to increase sales by recommending products to customers as they check out. But he doesn't just want to recommend bestsellers, he wants a smart system that recommends based on the buyer's demographics and taste.
That's where Hadoop helps out. For each customer, Hadoop enables Tobias to spot patterns across all of his customers' data, based on age, sex, genre preference, actor preference, period of production, and many other defining elements. He can access this information quickly, because different elements of the search can be carried out individually and simultaneously, instead of having to take place on a single machine.
I want to tackle Hadoop, but before we get there, we're going to need to explore MapReduce. MapReduce is a programming model for processing large datasets, and the clue to its function is in its name.
When you want to pull certain information from your datasets, it "maps" out the relevant information for your query.
Then it "reduces" the information down, sorts it based on any rules you've applied, and gives you just the data you were after.
Virginia is a medical researcher looking to carry out research on diabetes patients. For the purposes of her study, she wants to see any geographical concentrations of diabetes patients who are male, between the ages of 40 and 50, and who smoke.
The map in the MapReduce model finds the data sets which fit Virginia's needs.
Then begins the reduce function -- aggregating geographical data of these records and providing an ordered list of cities with the highest population of the defined type. This simple process has allowed Virginia to identify areas of concentration for further study.
MapReduce itself is pretty straightforward, but once we start ramping up the amount and types of data used we will need Hadoop's help -- which is where things get a bit more complex.
Today we're going to take a look at the V that makes big data big: Volume.
It's no secret we're inundated with data these days, from mobile devices, machines, social media, transactions, satellites… pretty much everything is throwing data out. And technology has reached a point that allows us to capture and keep everything, too.
Why would we bother?
Because controlling such a vast quantity of data can reveal information and patterns about the people and objects that we otherwise can't see.
John runs a tradeshow and wants to make it a really unique and repeatable experience for all his attendees.
Tim is an attendee at the show, and has been for five years. John's company has been tracking his every data point with the show for that whole time – from his online activity before the show, checking in to his hotel, scanning his ticket as he enters the show, the stands and sessions he has attended in previous years, even down to what he has had for his lunch.
Keeping hold of all of this data on Tim means John can present him with a really personalized experience – with a dedicated map and timetable guiding Tim to the content he has a history of making a beeline for, and even getting him a voucher for his favorite vegetarian lunch!
That's a lot of data, but what makes this really big data is that John's company has been collecting this information from every one of the attendees at every one of its shows – allowing it to offer this personalized and highly valued experience to everyone.
With good management of the volume of data, big data allows organizations to grow and experiment based on previous encounters.