Doug Miles, Director of Market Intelligence, AIIM International, 3/6/2014 Comment now
During the research we carried out in our recent AIIM Big Data and Content Analytics: Measuring the ROI report, we discovered that while big data analysis is increasingly accepted as an essential core competence, security is becoming a major big data adoption challenge.
Zubin Dowlaty, Head of Innovation & Development, 3/3/2014 Comment now
Big data analytics is making big waves across all facets of industry, with adoption stories and use-cases reaching new zeniths. This pace can be attributed to the information explosion that is leading to unprecedented levels of focus on the ability to store, manage, and analyze data.
Robert Mullins, Technology Journalist, 2/17/2014 Comment now
A recently released report reveals "a stark disconnect" between the understanding by executives of the value of big-data analytics and their grasp of where to start on big-data projects in their organizations.
James M. Connolly, US Correspondent, 2/7/2014 Comment now
Seeing the Gartner Inc. prediction that 25% of large global companies will have adopted big data analytics for at least one security or fraud-detection application by 2016, I can't bring myself to take an optimistic glass-is-a-quarter-full viewpoint. Why only 25%? What are the 75% thinking?
As we move from a busy 2013 for big data and into what is shaping up to be an even bigger year -- with IDC telling us that the big data technology and services market will grow at a massive 27% compound annual growth rate (CAGR) in the next five years, hitting $32.4 billion by 2017 -- it seems timely to remind ourselves that technology won't do this by ...
John Edwards, Technology Journalist & Author, 2/3/2014 Comment now
Mathematicians at Rice University, Baylor College of Medicine, and the University of Texas at Austin have united to design new analytical tools that promise to help researchers uncover clues about cancer hidden inside gigantic haystacks of raw data.
Susan Fourtané, Science & Technology Writer, 1/30/2014 Comment now
The current adoption trend of big data solutions in Asia Pacific continues to grow, showing an increase in implementation of big data systems in enterprises. According to Forrester Research, China and India lead in big data adoption at this point in APAC, both with a reported 21% of big data implementation.
James M. Connolly, US Correspondent, 1/29/2014 Comment now
For all of the positives that big data analytics technology offers in terms of identifying prime market segments, increasing profitability, and controlling costs, it just might be that it also fosters discrimination.
Doug Miles, Director of Market Intelligence, AIIM International, 1/29/2014 Comment now
The big data learning curve is going to be steep for any business. Many observers have suggested a mixed team of IT and business people might be the best way of building internal competence, while group learning and shared expertise can also play a role.
We've put together this short video to help you show your team just what you mean by "data warehouse."
A data warehouse is a collection of data, specifically designed to let you run reports and analytics.
Tobias has brought in a data specialist to get his online movie retail business into the big time, but he's confused about what the specialist keeps referring to as a data warehouse. He's got his database set up. Isn't that his data warehouse?
Technically, no. A database is simply the way we store any kind of data, whereas a data warehouse is specifically set up to run reports and analyze. It's data, but it has specific, definable business value.
This will allow Tobias to tailor his storage solution. His data warehouse takes pride of place on faster, more expensive disks in more accessible forms. The rest of his data is collected for potential use or out of legal obligation on more standard and cheaper disks.
Data quality can be measured and assessed in a number of ways.
Let's explore some main criteria with an example
Tobias has decided to create a data governance plan for his wildly successful online DVD shop. Before he does so, he'll need to decide how to define quality data.
There's validity, which is all about making sure new data conforms to the defined rules of your business. We all know how irritating it can be to get incomplete phone numbers, postcodes where names should be, or prices in the unique identifier fields.
There's consistency, which Tobias can use to make sure he has the right information on his customers. If one entry says a customer lives in Perth, Australia but a ZIP code in Beverly Hills appears elsewhere, an alarm bell should be ringing.
There's accuracy, which is incredibly hard for Tobias to achieve, because there isn't really a go-to, faultless version of his data out there. Having said that, if he had an external database of postal codes and their relevant geographic locations, he'd have a better chance of figuring out if that customer really lived in WA6000 or 90210.
There's completeness. In Tobias's case, a problem there would look like blank cells in his records. That's very difficult to fix, short of just making stuff up.
There are many more criteria for checking data quality. Tobias can put a data governance plan in to improve data input, but there are no easy criteria to use once bad data has been collected and stored.
Data cleansing is the act of taking data and improving its overall quality.
Let's explore this with an example
Sarianne is still hard at work making the most of her financial market data, massaging some data quality into her life. We've seen her approach to data governance, which is a key aspect of maintaining data quality, but she's going to need to run some data cleansing on everything that was collected before data governance was put in place.
It's a pain because it doesn't just involve fixing the data. It also involves identifying and locating poor data in a nebulous collection. It's like looking for a slightly bent needle in a stack of needles.
Luckily, Sarianne and her team don't have to do all this manually. Statistical and database methods can be employed to automate some of these processes. For instance, statistical models can be used to compare every entry against a mean, thus exposing potentially erroneous entries that deviate from the norm.
And there's the rub, because we all know those outliers could easily be correct data points that signify some insight that can be found. Tools or no tools, Sarianne has got her work cut out for her.
Data governance is the set of rules a business creates to ensure data is maintained in a meaningful and responsible way. While the main motivator for this is usually legal compliance, governance also plays an important role in maintaining data quality.
Let's explore this with an example
Sarianne is hard at work in the financial markets. After some exploratory queries of her stored data, she's realised it's a mess. She needs a system to make sure the errors and lazy management stop as soon as possible.
Creating a data governance plan allows her to dictate how certain data will be collected and initiate plans to monitor its quality as it is entered. Crucially, a well drawn governance plan will also allow her to hold departments and team members responsible for their own areas of data. It will put a stop to data issues being brushed under the carpet or dismissed as being a wider company problem.
These rules will help Sarianne going forward, but what about all that rotten data she's already got?
Entity disambiguation can be used to make sure that products, people, or topics that share a similar or exact name are properly differentiated from one another.
Let's explore this with a simple example
Last time we saw Fatima, she was using simple sentiment analysis technology to discover how people felt about her game Disgruntled Dogs.
The results weren't great, but when discussing her findings with the team, it was pointed out that their launch coincided with a Twitter campaign from residents in a localized area who were fed up listening to their neighbors' dogs yapping. The terms "disgruntled" and "dogs" turned up a lot in the campaign, meaning Fatima's sentiment analysis was not just focused on how people felt about her product.
Entity disambiguation can help define the search so it only pulls information relative to her needs. This can be done via establishing rules or through machine learning. In Fatima's case, she is using supervised learning on a training set of data to define terms which do or don't define the comment as being useful to her, by analyzing the words surrounding her key terms. She can use that training set to create an algorithm, which will allow her to run sentiment analysis on a smaller but more relevant group of results. Only time will tell if Disgruntled Dogs is better received than her first study revealed.
Supervised Learning's main difference to its unsupervised counterpart is the presence of a "training" set of data used to prepare an algorithm before it is unleashed on a "live" data set.
Let's explore this with a simple example
Remember Michael, the medical researcher?
He now needs a set of patients to test a new treatment on. To find the right patients for the test, he uses a smaller training set of data (in which he has already identified the correct patients for his test) and creates an algorithm that can pick these candidates out based on the data held on them -- a combination of age, weight, addictions, previous ailments, genetic dispositions, and existing medication needs.
This is possible because the data set already contains all of these classifications. If it didn't, he'd need to use unsupervised learning to identify the inherent classifications in the data.
Once he is happy with the results returned from the training data set, he can unleash it on the wider set of patients, as well as applying it to the data from any patients to be captured in the future -- and go about trying to improve their lives.
We're unearthing more insight from the Big Data Show earlier this year -- today featuring Amanda Kahlow, CEO & Founder at 6Sense Insights Inc.
Amanda's approach to big data veers away from the vendor trap of getting overly invested in the storage, processing, and querying steps. She's more keen for businesses to bring together their interactive channels (and maybe some third party data) to see who is going to buy, when, and how much.
Amanda's pragmatic message sets a great goal for businesses looking to leverage their big data -- understand your customers before they have to tell you who they are.
Machine Learning is a form of artificial intelligence that can be used to automate a lot of big data processes.
Here's an example
By using machine learning technology, customers to Tobias's online movie store get a more personalized, evolving service. Based on pages and products viewed, a customer to the site is presented with potential films he or she might like to purchase. This is based on the machine learning engine spotting correlations in the data of customers with similar demographics who have viewed similar pages, and recommending potential purchases from their purchase history.
As this is an automated system that "learns," Tobias doesn't need to be constantly tweaking the algorithms, and the machine learning tool continues to learn, so when a new natural purchasing trend emerges among customers, it makes recommendations based on having recognized these new patterns.
Sentiment analysis is one way data miners can take the legwork out of understanding the meanings and feelings behind statements made in social media and other forums.
Let's take a look at how it works with an example.
Fatima works on the marketing team for a company developing casual games. It launched its flagship game, Disgruntled Dogs, seven days ago -- and Fatima wants to know how the game has been received.
The sales team can pull sales data, which is a good indicator of success, but analyzing conversations will give Fatima a more nuanced idea of what people do and don't like. Gathering together all mentions of her company, the game's title, and mentions of a few key characters and elements of the game results in a pool of over 1,000,000 mentions -- in the first week. This is a well talked-about game!
The technology Fatima has chosen is effective but relatively simple. It aims to "read" the words and the way they are used, based on a set of semantic rules, to determine the feelings expressed -- in this case via the polarities "Positive" or "Negative."
Oh dear, it seems people have an overly negative view of Disgruntled Dogs. Still, it's better that Fatima can find this out quickly, and it gives her company the chance to alter the way the game works or change messaging to make sure this negative sentiment doesn't result in poor sales.
More often than not, unlike structured data, unstructured data isn't neatly arranged in columns and rows, properly titled, or identified and structured. This makes it much harder for data workers to gain insight.
You remember David, a Data Scientist with a lending firm? He's got all he can from the structured data available to him -- built some pretty good models -- but he's looking to get an edge on his competitors.
His team has found a load of data on his customers, and it stretches from years and years of scanned documents, to uploaded photos and email. It's very information rich, and could tell him a lot about his customers, but it won't fit neatly into any of his columns or rows.
Luckily for David, he has a series of more advanced tools at his disposal, which allow him to categorize, tag, pull key comments, and make correlations between different sets of this data.
It's harder work for him, but David knows this data is more likely to reveal actionable insight on his customer base and their habits -- so he can target and sort them based on the loans they are more likely to take.
It's an interesting phrase -- data capable. I find it to be a far more realistic measure than a lot of the hyperbole we splash around on big data at the moment. Here we have Christine Andrews of DQM Group talking about her data capable concept at the Big Data Show earlier this year. It sounds analogous with being able to run before you can walk -- a key learning from many failed big data projects.
They've even developed a model to test your data capability rating. So, being honest now, on a scale of 0-5, how data capable are you?
At the Big Data Show earlier this year, we caught up with Duncan Ross, director of Data Science at Teradata International.
He was excited to talk to us about the launch of DataKind in the UK. DataKind is a community of brilliant data scientists who have taken on the challenge of implementing data-led projects "in the service of humanity."
ETL is central to a lot of big data work, standing for Extract, Transform, and Load. But what does that mean? Let's explain it with an example:
Lauren is a data scientist working at a university, looking to bring together different datasets to make sure students are offered courses which best suit their profiles. To do this, she needs to pull data from lots of places into a centralized data warehouse.
First, she needs to extract data from the original sources, which can include existing university databases, as well as web crawling for social media information on students.
Next, Lauren has to transform this extracted data so that it fits in a way the centralized data warehouse can use it. For this, she can use a series of rules or functions to get the data into shape -- for instance, changing DOBs to reflect age, deriving aggregated values, deduplicating records, or joining data from multiple sources, depending on what the final data warehouse needs.
Finally, Lauren can load this data into the data warehouse, giving her a way to gain new insight on students by mining for patterns in this collected data.
At last week's Big Data Show we were lucky enough to speak to Lauren Walker, Sales Leader at IBM Big Data Solutions, who gave us a great message from her real-time analytics talk: Babies, Brains, and Buses.
This case study focused on the big data's ability to help the survival rate of premature babies by combining machine information and human content in real time.
Continuing our series of interviews with businesses leveraging big data, we talk to James Gill, CEO of GoSquared.
GoSquared offers real-time web analytics, using big data technologies to surface the analytical data that counts. Marketing managers and IT departments benefit from GoSquared's ability to pick out the most actionable insights as they happen.
Pig basically simplifies the processes needed to get analytics done through Hadoop on your big data sets.
Like the animal, Pig is not a fussy eater, getting its name from its ability to crunch through data, no matter what form it takes. It acts as a scripting interface to
Hadoop, meaning a lack of
MapReduce programming experience won't hold you back.
Example: Harvey works in a government office, looking to formulate new solutions for his city's parking problems. He knows how to use data, but writing his own mapper and reduce functions is a little beyond him.
Luckily, he's been set up with access to the databases through Pig, meaning he can draw on sources like parking ticket records and population density maps. Taking advantage of Pig's eat-anything attitude, he can also mine topics from a call for email suggestions his department sent to local residents, as well as sensor information about the amount of traffic on the roads. In spite of his limited programming capabilities, Pig allows Harvey to query these data sets and sketch out some draft suggestions he can use to alleviate the local parking problems.
In the first of a series of interviews with business leaders who leverage big data, we talk to James Robinson, CTO and co-founder of OpenSignal.
OpenSignal combines big data technologies and sensor data from mobile phones to give insight to both mobile consumers and telecommunications giants. Robinson is also a contributing writer on Big Data Republic.