Christian Prokopp, Data Scientist, Rangespan, 12/12/2013 Comment now
Big data architecture paradigms are commonly separated into two diametrically opposed models, the more traditional batch and the (nearly) real-time processing. The most popular technologies representing the two are Hadoop with mapreduce and Storm.
However, a hybrid solution, the Lambda architecture, challenges the idea that these approaches have to ...
Olivier Janus, Global Data Director, Havas EHS, 12/11/2013 Comment now
It goes without saying that those who work with data are dealing with an amazing asset. However, although insight is quantifiable it isn't something that you can engage with in a tactile way.
Even though data translates into practicable insight that deals with human behaviour, its nebulous nature could possibly lead to a level of disassociation for ...
Paul McCormack, Associate, IP, Technology & Sourcing, DLA Piper, 12/10/2013 Comment now
Cloud computing has grown exponentially over the past few years, and with the ability to provide one too many solutions, which brings with it cost efficiencies, ease of access, and global scale, it is an attractive solution for many businesses.
Get your data law questions into Paul during our Data Protection in the Cloud online talk.
John Edwards, Technology Journalist & Author, 12/10/2013 Comment now
An Ovum analyst has peered toward 2014 and predicts that both big data and fast data are set to transition from the early-adopter phase to maturity.
According to Tony Baer, principal analyst of software-enterprise solutions for the market research firm, and lead author of Ovum's "Big Data 2014 Trends to Watch" report, big data trends in 2014 will ...
Deepinder Singh Dhingra, Head of Products & Strategy, Mu Sigma, 12/9/2013 Comment now
Harvard Business Review recently described data analytics as the "sexiest job" of the 21st century, and for good reason. Data is the new oil, and there is an ever-growing need for people to refine it.
A report by e-Skills last week showed that demand for people with big data skills is expected to rise by 243% in the UK over the next five years. While ...
Susan Fourtané, Science & Technology Writer, 12/6/2013 Comment now
On November 7 Imperial College London launched a tool that ranks the most influential tweeters on any topic. It has now published how the tool works. During the launch, the team behind the T-index revealed the 10 most influential tweeters on five topics.
Our one-to-one conversations with data scientists continues as we get to know Josh Wills,
senior director of data science at Cloudera.
Cloudera is among a handful of organisations essentially synonymous with big data -- so this chat presented us with a great opportunity to get under the skin of one of the guys driving innovation in big data tool ...
Frank Lo, Senior Manager of Data Science, Wayfair, 12/5/2013 Comment now
Data in its raw form is a mess. Data scientists need to dedicate effort towards latent variable engineering to surface the valuable parts to the top.
Of the hyped up V's of big data, "variety" is one that is often only superficially understood. It is easy to say that we want lots of different types of data to analyze -- e.g., transaction data, ...
I've spent most of my career working with new technology, most recently helping companies make sense of mountains of incoming data. This means, as I like to tell people, that I have the sexiest job in the 21st century.
That's not me talking. SmartData Collective, an online community moderated by Social Media Today, has called big data scientist "the ...
Robert Plant, Associate Professor, School of Business Administration, University of Miami, 12/4/2013 Comment now
For many computer science students, especially those who were at University in the 1980s, the pinnacle of their programming experience was developing some code that would actually run on a supercomputer, but preferably "the" supercomputer, a Cray.
A pre-cursor to in-memory
Often jokingly referred to as the world's most expensive love seat, the X-MP ...
Cassandra is an open-source distributed database system that is designed for storing and managing large amounts of data across commodity servers.
There are many benefits to using Cassandra, but the most obvious in a big data context is its ability in a real time environment.
For instance, the speed at which Sairanne can gain insight when placing financial trades can mean the difference between a big gain and huge loss. Because of the caching techniques at work within cassandra, this data can serve her purposes with an emphasis on speed and opening up great opportunities for analytics.
It's the ideal distributed system when Velocity is the main V on your agenda.
Much has been made of Hadoop 2.0's ability to utilise YARN -- but what on earth does that even mean?
YARN stands for Yet Another Resource Negotiator -- so let's have a look at its impact with an example.
Sarianne is working on the biggest group of the biggest data sets she's had access to since starting work at the Financial Services provider. Previously, the versions of Hadoop and MapReduce would have limited how much could be done at the same time because everything was batch-oriented.
Now, MapReuce 2.0 (or YARN) allows more to happen simultaneously, by splitting the old MapReduce JobTracker into a ResourceManager and ApplicationMaster, meaning less of a queue, and more everything-at-once functionality.
Long story short, with YARN and Hadoop 2.0, Sarianne has more power than ever.
We all know Hadoop by now, but what is new with Hadoop 2.0 and beyond?
Major components of Hadoop have been rewritten for 2.0 – meaning Hadoop can now burst through the 4,000 machines per cluster barrier. It can also now support YARN, the newer, more scalable and flexible version of MapReduce.
At this stage, though, 2.0 isn't as stable as the original Hadoop. So before you jump in you have to think: Do you want stability, or can't you wait for that extended level of flexibility and scalability?
Hue is the Hadoop User Experience, but what can it do for you?
At its core, Hue provides a user interface which makes Hadoop (and Hadoop services) easier to use.
As an example, let's take Sarianne -- she's hard at work looking for insight in the millions of transactions her financial services employer has collected.
It's time to open the workload out to the wider team -- but rather than having everyone learn to code the hard way like she does, she makes use of HUE. This slaps an easy to navigate graphical interface onto many Hadoop tasks -- so that the people around her can have access to the data without worrying about the inherent complexities in the Hadoop and HDFS world.
We've put together this short video to help you show your team just what you mean by "data warehouse."
A data warehouse is a collection of data, specifically designed to let you run reports and analytics.
Tobias has brought in a data specialist to get his online movie retail business into the big time, but he's confused about what the specialist keeps referring to as a data warehouse. He's got his database set up. Isn't that his data warehouse?
Technically, no. A database is simply the way we store any kind of data, whereas a data warehouse is specifically set up to run reports and analyze. It's data, but it has specific, definable business value.
This will allow Tobias to tailor his storage solution. His data warehouse takes pride of place on faster, more expensive disks in more accessible forms. The rest of his data is collected for potential use or out of legal obligation on more standard and cheaper disks.
Data scraping is the act of getting hold of data which lacks inherent structure, usually because it wasn't meant to be collected in a way that would bring value to your business.
Let's explore this concept with an example.
Tobias is looking to expand his online DVD shop's selection of classic cult films. As part of his research to select stock, he wants to pull in a load of data from online review spaces like Amazon and Rotten Tomatoes.
All this data, from star ratings to personal opinion, was added for human consumption. There's no easy "export as XML" function on Amazon review lists. That is where data scraping comes in. In essence, it's about Tobias identifying the elements he needs and running a program to export them into a database of his own. Of course, he'll need to match up the ratings across stars, percentages, scores, etc., as well as figure out what kind of sentiment analysis he wants to carry out on written comments -- but that's up to him.
Now, this "meant for human eyes" data is computer readable, and Tobias can begin munging and rearranging that data until it suits his analytical needs.
We've shown you the role data cleansing plays within data quality, so let's take a minute to explore data cleansing methods with an example.
Sarianne is still slogging away on her financial data and has thrown herself into fixing up the poor-quality data her company has spent years ignorantly collecting.
She uses parsing to find syntax errors in her data collections. For instance, if a financial transaction record's time format doesn't comply with what she would expect to find in the system, parsing will quickly flag it up as erroneous -- after all, you can't fix what you don't know is broken.
Once found, she can use data transformation in some cases to map this data into the format the business needs.
It's not new to any kind of data work, but duplicate elimination plays a huge role in Sarianne's data cleansing routine. She can use an algorithm to find those suspected duplicate representations and then get to work deduping.
She can also use statistical methods to find clusters, and mean and standard deviation as a way to compare the data to itself and find outlying elements. It could flag up erroneous data, but you'd need to take care not to simply be working to correct outliers.
These tools and more are at Sarianne's disposal, but they will never replace her ability to spot deviations and use reason and hard work to increase the data quality overall.
Data quality can be measured and assessed in a number of ways.
Let's explore some main criteria with an example
Tobias has decided to create a data governance plan for his wildly successful online DVD shop. Before he does so, he'll need to decide how to define quality data.
There's validity, which is all about making sure new data conforms to the defined rules of your business. We all know how irritating it can be to get incomplete phone numbers, postcodes where names should be, or prices in the unique identifier fields.
There's consistency, which Tobias can use to make sure he has the right information on his customers. If one entry says a customer lives in Perth, Australia but a ZIP code in Beverly Hills appears elsewhere, an alarm bell should be ringing.
There's accuracy, which is incredibly hard for Tobias to achieve, because there isn't really a go-to, faultless version of his data out there. Having said that, if he had an external database of postal codes and their relevant geographic locations, he'd have a better chance of figuring out if that customer really lived in WA6000 or 90210.
There's completeness. In Tobias's case, a problem there would look like blank cells in his records. That's very difficult to fix, short of just making stuff up.
There are many more criteria for checking data quality. Tobias can put a data governance plan in to improve data input, but there are no easy criteria to use once bad data has been collected and stored.
Data cleansing is the act of taking data and improving its overall quality.
Let's explore this with an example
Sarianne is still hard at work making the most of her financial market data, massaging some data quality into her life. We've seen her approach to data governance, which is a key aspect of maintaining data quality, but she's going to need to run some data cleansing on everything that was collected before data governance was put in place.
It's a pain because it doesn't just involve fixing the data. It also involves identifying and locating poor data in a nebulous collection. It's like looking for a slightly bent needle in a stack of needles.
Luckily, Sarianne and her team don't have to do all this manually. Statistical and database methods can be employed to automate some of these processes. For instance, statistical models can be used to compare every entry against a mean, thus exposing potentially erroneous entries that deviate from the norm.
And there's the rub, because we all know those outliers could easily be correct data points that signify some insight that can be found. Tools or no tools, Sarianne has got her work cut out for her.
Data governance is the set of rules a business creates to ensure data is maintained in a meaningful and responsible way. While the main motivator for this is usually legal compliance, governance also plays an important role in maintaining data quality.
Let's explore this with an example
Sarianne is hard at work in the financial markets. After some exploratory queries of her stored data, she's realised it's a mess. She needs a system to make sure the errors and lazy management stop as soon as possible.
Creating a data governance plan allows her to dictate how certain data will be collected and initiate plans to monitor its quality as it is entered. Crucially, a well drawn governance plan will also allow her to hold departments and team members responsible for their own areas of data. It will put a stop to data issues being brushed under the carpet or dismissed as being a wider company problem.
These rules will help Sarianne going forward, but what about all that rotten data she's already got?
Entity disambiguation can be used to make sure that products, people, or topics that share a similar or exact name are properly differentiated from one another.
Let's explore this with a simple example
Last time we saw Fatima, she was using simple sentiment analysis technology to discover how people felt about her game Disgruntled Dogs.
The results weren't great, but when discussing her findings with the team, it was pointed out that their launch coincided with a Twitter campaign from residents in a localized area who were fed up listening to their neighbors' dogs yapping. The terms "disgruntled" and "dogs" turned up a lot in the campaign, meaning Fatima's sentiment analysis was not just focused on how people felt about her product.
Entity disambiguation can help define the search so it only pulls information relative to her needs. This can be done via establishing rules or through machine learning. In Fatima's case, she is using supervised learning on a training set of data to define terms which do or don't define the comment as being useful to her, by analyzing the words surrounding her key terms. She can use that training set to create an algorithm, which will allow her to run sentiment analysis on a smaller but more relevant group of results. Only time will tell if Disgruntled Dogs is better received than her first study revealed.
Semi-supervised learning meets unsupervised learning half way by combining both labelled and unlabelled data in the learning process.
Here's a simple example of how it might work for a business
We're back with Tobias and his online DVD business. We've already seen him use unsupervised learning to split his mailing list into two groups so that each can receive messaging more likely to result in a sale. Now, he wants to define these groups to a higher degree of accuracy. Unsupervised learning allows him to do this by working with labelled data (which can be expensive or time consuming to come by) with unlabelled data (which is more readily available and generally cheaper) to train the algorithm.
Using his labelled data as an anchor point, semi-supervised learning can then spot where clusters of the unlabelled data points can fit around it. What this means is he essentially has access to a bigger training set of data... and more data means more accuracy. Adjusting his groupings, he will be able to more accurately split his data into clusters in the hope that those marketing newsletters will yield even more sales.
Supervised Learning's main difference to its unsupervised counterpart is the presence of a "training" set of data used to prepare an algorithm before it is unleashed on a "live" data set.
Let's explore this with a simple example
Remember Michael, the medical researcher?
He now needs a set of patients to test a new treatment on. To find the right patients for the test, he uses a smaller training set of data (in which he has already identified the correct patients for his test) and creates an algorithm that can pick these candidates out based on the data held on them -- a combination of age, weight, addictions, previous ailments, genetic dispositions, and existing medication needs.
This is possible because the data set already contains all of these classifications. If it didn't, he'd need to use unsupervised learning to identify the inherent classifications in the data.
Once he is happy with the results returned from the training data set, he can unleash it on the wider set of patients, as well as applying it to the data from any patients to be captured in the future -- and go about trying to improve their lives.
We're unearthing more insight from the Big Data Show earlier this year -- today featuring Amanda Kahlow, CEO & Founder at 6Sense Insights Inc.
Amanda's approach to big data veers away from the vendor trap of getting overly invested in the storage, processing, and querying steps. She's more keen for businesses to bring together their interactive channels (and maybe some third party data) to see who is going to buy, when, and how much.
Amanda's pragmatic message sets a great goal for businesses looking to leverage their big data -- understand your customers before they have to tell you who they are.
Unsupervised learning allows us to apply labels to data that was previously undefined.
Let's explore this with an example
The last time we saw Tobias, he was using machine learning on a suggestion engine to get more products sold based on which films his customers were viewing. He also uses machine learning -- more specifically, unsupervised learning -- in his marketing campaigns. In this simple example, he'll send two versions of his newsletter: one featuring full-price blockbuster Hollywood films and one featuring sale-price world cinema.
It would take him months to sort through his customer list manually and determine who should get which version. To divide his customer data quickly into two coherent sets, he can use unsupervised learning algorithms, which will define the two groups for him by sorting them into clusters. This will allow him to increase the chances of his newsletters finding a receptive audience.
Recently, we caught up with Ben Pottier, Search Technology Specialist at Funnelback UK, to pick his brain on how SMEs can leverage big data.
He acknowledges that it's easy for smaller organizations to feel isolated from real big data benefits by the jargon and hype currently in circulation. However, he also maintains that SMEs are almost in a better position to leverage big data technologies, because of the ability to push big data technologies to grow their business.
He concludes "don't do big data with a big bang approach." Are you an SME leveraging big data, or looking to do so? Do you agree with Ben's advice? Let us know below.
Data Munging is boring but necessary, as it involves getting raw data ready so that the exciting work of analysis can be done.
Let's look at this with an example
Sara's hard at work on data from her local schools, but oh boy, what a mess! Attendance registers are pretty clean, but each school uses different formatting for keeping track of grades and discipline, some keep dates in different formats -- not to mention all the unstructured data sitting there in the form of comments and social media interactions.
Before Sara can get anywhere with this, she needs to do some serious munging, or wrangling. That means getting this information to match up with each other -- so she'll need to extract all of this raw data and run algorithms on it to match up with her preferred columns and rows -- and depositing the finished dataset in her data store so she can start running queries.
Machine Learning is a form of artificial intelligence that can be used to automate a lot of big data processes.
Here's an example
By using machine learning technology, customers to Tobias's online movie store get a more personalized, evolving service. Based on pages and products viewed, a customer to the site is presented with potential films he or she might like to purchase. This is based on the machine learning engine spotting correlations in the data of customers with similar demographics who have viewed similar pages, and recommending potential purchases from their purchase history.
As this is an automated system that "learns," Tobias doesn't need to be constantly tweaking the algorithms, and the machine learning tool continues to learn, so when a new natural purchasing trend emerges among customers, it makes recommendations based on having recognized these new patterns.
Sentiment analysis is one way data miners can take the legwork out of understanding the meanings and feelings behind statements made in social media and other forums.
Let's take a look at how it works with an example.
Fatima works on the marketing team for a company developing casual games. It launched its flagship game, Disgruntled Dogs, seven days ago -- and Fatima wants to know how the game has been received.
The sales team can pull sales data, which is a good indicator of success, but analyzing conversations will give Fatima a more nuanced idea of what people do and don't like. Gathering together all mentions of her company, the game's title, and mentions of a few key characters and elements of the game results in a pool of over 1,000,000 mentions -- in the first week. This is a well talked-about game!
The technology Fatima has chosen is effective but relatively simple. It aims to "read" the words and the way they are used, based on a set of semantic rules, to determine the feelings expressed -- in this case via the polarities "Positive" or "Negative."
Oh dear, it seems people have an overly negative view of Disgruntled Dogs. Still, it's better that Fatima can find this out quickly, and it gives her company the chance to alter the way the game works or change messaging to make sure this negative sentiment doesn't result in poor sales.