Imagine talking to various big data vendors, asking, “Once I’ve collected all my big data sources, where should I put it?” Hadoop vendors will say “put it in Hadoop.” NoSQL vendors or appliance vendors will say “No, put it here.” At which you’re ready to say, “I’ll tell you where to put it.”
Start with the end in mind
The big problem is not where you put it. If you succumb to the “put it here” way of thinking, you’re not solving the problem. Leave your data where it is and start by asking, “What do I want to get out of my data? What problem am I trying to solve?” In other words, use that tried and true standard: Start with the end in mind.
We’ve always had a big data problem; data grows constantly and new sources of data emerge as time goes on. It’s just that today’s data -- given the advent of embedded software and social media -- is more diverse, more extensive, and, well, bigger than ever.
Data drives the problem, which in turns drives the technology to solve the problem. But what is driving your rationale for collecting the data? Our recent research says that most organizations want to understand their customer behavior far more than they want understand the uptime of their computers (sorry, Splunk).
The themes in data
We are starting to see themes emerge. Customers, competitors, process improvement, cost control, and risk mitigation. What is odd -- and we are seeing this firsthand with our recent vendor panels -- is that vendors are having difficulty identifying successes. Why? Competitive advantage is why. Success with customer behavior analysis can make such a huge difference that those who are indeed succeeding are giddy with excitement while they withhold this secret from the rest of us. We have seen this for years in the financial services space; our customers are very reluctant to reveal how they succeed with Actuate, precisely because they succeed!
No one has ever said “slow down my data.” Speed of access has driven the database industry for decades, from regular RDBMS to warehouse appliances, and now to in-memory and columnar technologies, as well as distributed multi-parallel processing systems like Hadoop. Faster data. Well there are still problems with this. Moving data across a network is expensive -- just look at your cellphone bill. Bandwidth is and will continue to be the bottleneck. So what is the answer?
Some clarifying advice: Pick the problem first
Know what you are trying to solve and work specifically on that problem rather than looking at the entire universe of data you could collect or are already collecting. If you want to know what your customers will do, collect customer transactions, demographic data, and perhaps tweets or posts on Twitter or Facebook profiles (if that’s where your customers hang out). Then decide where to put that data, if anywhere.
Demographic and geolocation data can stay on the Internet; it doesn’t change very often, and you can pick what you need when you need it. Some data may need to be collected in the Hadoop Distributed File System (HDFS) and processed by programs in order to find what’s relevant. Some data needs to be collected, parsed, enhanced, and indexed, like Twitter data, in order to be meaningful. Set your hooks in the right places, and don’t try to move seas of data from the Pacific Ocean to the Atlantic.
Knowing how all of this data behaves and what you are trying to discover from it can help you identify which technologies might be appropriate to mix together, mine for intelligence, or visualize for communication and process improvement. The tasks of collecting and consolidating disparate sources are a big part of the big data problem, but there are lots of valuable products that can help here, like Cloudera, Hortonworks, EMC, Cassandra, VoltDB, and MongoDB. Used in the right combination, they can give you the insights you need to run your business better.
The goal is to help you find what big data matters to your goals, shrink it into a meaningful interactive visualization, and deliver it to everyone across your organization. This requires flexibility not often found in our industry. It’s like exercise; you don’t have to do it, but you should.
You want all the flexibility available, and you don’t want to be forced into dumping all your big data into one location or pushing it across one network -- all things that traditional vendors will ask you to do. By starting with the end in mind, and by using a combination of products (recent research shows organizations looking to manage big data are considering an average of 3.5 software products to do so), you will get the insights you need from corralled big data, and the problem will become a solution.
Re: the big data crystal ball @Jeff, a brilliant overview. Inspiring, yet I can see for companies at the start of this journey it might be slightly daunting. Are there two or three questions/data sets all of these journies should start from to inspire the next next set of queries?
the big data crystal ball I do like the blue sky assumption that with all these emerging technologies we will begin to understand what we don't know. But as we keep polling the market, the questions around the big data crystal ball are themselves crystallizing. "What don't I understand about my customers, my prospects or my audience?" is the root. From that question, a million others spawn, but its answers are starting to shape and hone the competitive edge of organisations who ask it. This is why there is both urgency and secrecy as big data projects move forward because competitiveness is shaping the need for these answers. What is handy here, of course, is that as soon as you start asking these questions yourself, you eventually land on the path to implementation, when you finally ponder, "What data will help me find my answers?" Now you have both a purpose and the beginnings of a plan, one in which all these technologies might help you resolve.
User Rank: Blogger 1/28/2013 | 11:36:30 AM
Re: Spot on. @MDMConsult - it seems it's a complex area people like to make even more complex for themselves. Those early questions are crucial sure, but so long as they aren't stringing out the rest of the team.
User Rank: Exabyte Executive 1/28/2013 | 6:25:02 AM
First things first I assume you are saying companies should ask all the questions you raised before they start collecting the data rather than afterwards. Where a company stores data is important but setting early goals about the data is even more important.
User Rank: Exabyte Executive 1/27/2013 | 1:46:36 PM
Re: Spot on. Determining what big data is value and how to take advantage of its value is important. Being able to determing the most useful & value from the big data sets and optimize this for better decision making should be understood by organizations.
Re: Spot on. @Saul, the important notion in your comment is "part of the appeal" as in finding unknown knowledge. Unsupervised learning is just half of the Calculus of data science - through techniques like clustering you can discover previously uknown gems in your big data. The other side of the equation of course is prediction - supervised learning. With this form of machine learning you absolutely need to know what problem you're solving. And it is important to understand that machine learning is not magic; that data must support the answers to the questions.
User Rank: Exabyte Executive 1/26/2013 | 1:35:26 PM
Re: to the point data Good point, but that's why it is important have people that can think out side of the box to add new perspectives of interpreting the data that could potentially drive real ROI for big data projects.
User Rank: Exabyte Executive 1/25/2013 | 3:00:44 PM
Re: Spot on. Those are good questions to ask also, Who owns big data in the organization? Questions like these especially in the initial phases, being big data is a complex area are important. We have to define its objectives, identify these challenges and analytics to measure. Applying questions in the early phases is crucial.
User Rank: Petabyte Pathfinder 1/25/2013 | 2:33:55 PM
Re: Spot on. i agree with you. The point you mentioned : " The vast potential of Big Data is in making it "small" data that will add value to organizations." , is what all must acheive. we dont have to keep all the data. We need just the accurate one for our analysis, be it demographic or their geographic locations or whatever suits our business requirement.
User Rank: Petabyte Pathfinder 1/25/2013 | 2:30:05 PM
to the point data we can all gather data and data about data, but only useful will it be only then when we know what sort of data we looking for and we filter out the rest. otherwise it wil be just costly for us to store and maintain all that data which is of no use in real. therefore one should be focused and have in mind exactly what they are looking for, only then bigData will prove to be helpful and successful.