Sponsored by:
 
 

Big Data Explained: What Is Volume?

Today we're going to take a look at the V that makes big data big: Volume.

It's no secret we're inundated with data these days, from mobile devices, machines, social media, transactions, satellites… pretty much everything is throwing data out. And technology has reached a point that allows us to capture and keep everything, too.

Why would we bother?

Because controlling such a vast quantity of data can reveal information and patterns about the people and objects that we otherwise can't see.

An example:

John runs a tradeshow and wants to make it a really unique and repeatable experience for all his attendees.

Tim is an attendee at the show, and has been for five years. John's company has been tracking his every data point with the show for that whole time – from his online activity before the show, checking in to his hotel, scanning his ticket as he enters the show, the stands and sessions he has attended in previous years, even down to what he has had for his lunch.

Keeping hold of all of this data on Tim means John can present him with a really personalized experience – with a dedicated map and timetable guiding Tim to the content he has a history of making a beeline for, and even getting him a voucher for his favorite vegetarian lunch!

That's a lot of data, but what makes this really big data is that John's company has been collecting this information from every one of the attendees at every one of its shows – allowing it to offer this personalized and highly valued experience to everyone.

With good management of the volume of data, big data allows organizations to grow and experiment based on previous encounters.

4%
96%
Part of a 9 part series
11/28/2012 | 23 comments
Page 1 / 3   >   >>
kiran
50%
50%
kiran, User Rank: Megabyte Messenger
1/26/2013 | 3:49:35 AM


Re: Good Video
The video is indeed simple and to the point and explains the very important property of bigData , the amount and size. The data is produced by every machine, computer and electronic device that we use. This allows for the big volume of the data. however intelligent decision and work to be done is to filter out the data that we need from the pile of the data and analysie it for our requirements. 

netcrawl
50%
50%
netcrawl, User Rank: Exabyte Executive
1/18/2013 | 12:26:37 AM


Re: He who has the most data wins!

The term "big data" has been used to describe the massive volumes of data analyzed by huge corporate companies like Google or Microsoft. It's all about finding new value within and outside conventional data sources, taking a huge analysis of the data and transforming that into something useful. Data analysis on big data goes beyond providing us with ever-increasing information about everything.


netcrawl
50%
50%
netcrawl, User Rank: Exabyte Executive
1/18/2013 | 12:16:04 AM


Re: He who has the most data wins!
Big data is all about volume, tons of information that can make a huge impact to the company's bottom line. The only question here is how do we stored those data and how do we analyzed those massive data? 

Susan Fourtané
50%
50%
Susan Fourtané , User Rank: Blogger
1/1/2013 | 11:11:58 AM


Re: He who has the most data wins!
Thanks, Anna.

This means, then, we could have a Big Data storage bank for future reference, in case we need to go back and use some of our old data which was not useful before but may be useful at some other point. Of course with the cloud this becomes simplified, and easier to manage, too.

I love your criminal investigations example. Very clear.

-Susan  

 

Susan Fourtané
50%
50%
Susan Fourtané , User Rank: Blogger
1/1/2013 | 11:00:05 AM


Re: He who has the most data wins!
Thanks, Daniel. 

This is very clear example on how to keep your data clean and in order. 

-Susan

deastman
50%
50%
deastman, User Rank: Bit Player
12/31/2012 | 10:32:19 PM


Re: He who has the most data wins!
I agree with this training theory.  The basic premise being you never know what kind of situation the future may hold.  So rather than just disregard a piece of data as insignificant rather store it and perhaps it can be useful in the future.  Hmm... brings to mind a movie...Terminator anyone? 

 

I'll be backk....

netcrawl
50%
50%
netcrawl, User Rank: Exabyte Executive
12/29/2012 | 7:22:56 AM


Re: He who has the most data wins!
If you know how to tame the data deluge then you have the potential to harness it power and turn that into new profit. . but before you do that you need to aquire four critical capabilities:

fast, cheap capture and storage of unstructured data

Cost-effective organization of data

Trend spotting using analysis tools

Engineered system that turn data into action

Anna Young
50%
50%
Anna Young, User Rank: Exabyte Executive
12/28/2012 | 1:40:23 PM


Re: He who has the most data wins!
Susan, Great question! Let me give a somewhat convoluted answer. Our ability to appreciate the data we collate is often restricted by the analytical tools we have today. The Big Data we think we've analyzed or that we think is not useful could prove useful in future based on the application of new tools.

It's like in criminal investigations. Remember all the data (evidence, blood samples and other items) collected from crime scene in the 50s, 60s and even in only the last decades but which were thought useless only for the same to be used more recently to confirm or reject previous conclusions? This has been possible only because of recent technology developments, DNA analysis, for instance. The same applies to the evaluation of Big Data and the question of what to do with data that is no longer "valuable."

Daniel Gutierrez
50%
50%
Daniel Gutierrez, User Rank: Blogger
12/25/2012 | 3:44:29 PM


Re: He who has the most data wins!
Right ... during the data cleansing phase, the goal is to yield a clean data set for the purpose of analytics. This data set is simply a clean version of potentially dirty transactional data. We don't touch the original data.

Then during the feature engineering phase, it is necessary to select features (data elements) that may contribute to the success of the machine learning process. If a specific feature is determined to be non-valuable, it is not removed, just removed from the feature set (maybe temporarily, since the feature could be reinstated at some later time depending on requirements).

Susan Fourtané
50%
50%
Susan Fourtané , User Rank: Blogger
12/25/2012 | 2:45:57 AM


Re: He who has the most data wins!
Thanks, Daniel. 

So we agree on the importance of quality data.

When analyzing the results do you discard the data which will not have an effect for the purpose of your data collecting, and filtrate this to get a better quality of data, or you keep separately as extracted knowledge anyways?

-Susan

Page 1 / 3   >   >>
Latest Blogs
Can those reservoirs of emails prove useful to your business? Can the President's prove useful to the USA?
Can big data finally offer the transparency to pull doctors out of Big Pharma's pocket?
WalmartLabs show just how smart and forward-thinking the retail giant are with their big data.
Our panel of expert judges pull their 20 top names from the #bigdata100.
Learning thermostat maker Nest's acquisition of MyEnergy is one more step toward a time when big data is at the core of home energy management.
Flash Poll
Information Resources
Data Visualization Showcase
This Tableau visualization of international debt demonstrates how simple visualizations can give great insight
Explore this data here.
More Data Visualization Showcase
BDR in your Inbox
Digital Audio
Latest Archived Broadcast
Join this radio show to truly understand what a CIO needs to do to build a successful private cloud and what skills and values the IT team will need to embody.
Follow Us on Twitter
Like Us on Facebook
Accolades
Accolades
 


Saul Sherry
Big Data Explained: What Is Variety?

Part of 9   |  
See complete series
11|7|12   |   2:10   |   (22) comments


There's plenty of talk about big data's three V's: volume, velocity, and variety. But what exactly do these terms mean?

We're going to take a quick trip through one of these today: Variety.

This exciting concept within big data gives you the opportunity to gain insight by combining a variety of data sets that would not traditionally sit together. By enabling you to link up your traditional analytical data sets with many different types of information, a new world of analytical possibilities is opened.

So what's so exciting about this?

Well, it allows you to collate data sets that don't obviously relate to each other. Data experts can then analyse this collated data, to spot patterns or create new insights you would previously have been blind to. Variety, when tackled well in big data, allows you to see new revelations in the data your organization already produces.

An example: Judith is a brand manager, she loves her job and is very good at it, but knows she would benefit from being able to listen even more closely to the voice of her customer.

Taking traditional financial information, Judith can already see the performance of her brand. It doesn't take a data scientist to see which week did well, and which week did badly. But it won't tell her why.

Harnessing variety in data, Judith's data team can create relations between this data and what's being said on social media about her brand, as well as in text-input fields on customer satisfaction surveys. These disparate sets of data can be brought together, contextualized, and visualized in a way that gives Judith clues as to what her brand has done to influence customer behavior.

Suddenly, Judith now has the vision to generate hypotheses on ways to amplify positive results and mitigate negative trends.

Most importantly, she can take action.

Saul Sherry
Big Data Explained: What Is Volume?

Part of 9   |  
See complete series
11|28|12   |   1:44   |   (23) comments


Today we're going to take a look at the V that makes big data big: Volume.

It's no secret we're inundated with data these days, from mobile devices, machines, social media, transactions, satellites… pretty much everything is throwing data out. And technology has reached a point that allows us to capture and keep everything, too.

Why would we bother?

Because controlling such a vast quantity of data can reveal information and patterns about the people and objects that we otherwise can't see.

An example:

John runs a tradeshow and wants to make it a really unique and repeatable experience for all his attendees.

Tim is an attendee at the show, and has been for five years. John's company has been tracking his every data point with the show for that whole time – from his online activity before the show, checking in to his hotel, scanning his ticket as he enters the show, the stands and sessions he has attended in previous years, even down to what he has had for his lunch.

Keeping hold of all of this data on Tim means John can present him with a really personalized experience – with a dedicated map and timetable guiding Tim to the content he has a history of making a beeline for, and even getting him a voucher for his favorite vegetarian lunch!

That's a lot of data, but what makes this really big data is that John's company has been collecting this information from every one of the attendees at every one of its shows – allowing it to offer this personalized and highly valued experience to everyone.

With good management of the volume of data, big data allows organizations to grow and experiment based on previous encounters.

Saul Sherry
Big Data Explained: What Is Velocity?

Part of 9   |  
See complete series
1|18|13   |   1:53   |   (13) comments


Today we're going to take a look at the V that allows big data to be immediate and reactive: Velocity.

As well as having to master the sheer volume and variety of information within big data, organizations also have to be able to contend with the speed at which all of this data is generated. Real benefit can be gained by pouncing on this data in real-time -- affecting outcomes while they are still forming.

What kind of benefit?

Well, as we've already established, data can take many different forms. How working on this stream of real-time big data will benefit you will depend on your industry. For this example I'll focus on the financial services sector.

Andy is in charge of online security for a big bank, trying to make sure his customers' money is safe. When he can detect fraud after the event, it's fairly useless, but if he can spot it as it happens, it can be priceless. If a malicious computerized attack is started on Andy's bank, it will be generating thousands of events every second -- but Andy has put the right system in place to detect these events by comparing them to the way actual, normal customers behave. And it happens in real time, so alarms are going off to let him know.

As Frank Bria told us in his Big Data Republic article, Big Data Tackles Fraud:

Many fraudsters will access online banking and go directly to the transfer section of a Website without first checking balances and transactions. That clickstream is foreign and unfamiliar to the complex event processing engine and thus gets flagged.
In this way the bank can stamp down on the illegal activity as it happens, rather than chasing up after the event.

Saul Sherry
Big Data Explained: What Is MapReduce?

Part of 9   |  
See complete series
2|26|13   |   1:16   |   (7) comments


I want to tackle Hadoop, but before we get there, we're going to need to explore MapReduce. MapReduce is a programming model for processing large datasets, and the clue to its function is in its name.

When you want to pull certain information from your datasets, it "maps" out the relevant information for your query.

Then it "reduces" the information down, sorts it based on any rules you've applied, and gives you just the data you were after.

An example:

Virginia is a medical researcher looking to carry out research on diabetes patients. For the purposes of her study, she wants to see any geographical concentrations of diabetes patients who are male, between the ages of 40 and 50, and who smoke.

The map in the MapReduce model finds the data sets which fit Virginia's needs.

Then begins the reduce function -- aggregating geographical data of these records and providing an ordered list of cities with the highest population of the defined type. This simple process has allowed Virginia to identify areas of concentration for further study.

MapReduce itself is pretty straightforward, but once we start ramping up the amount and types of data used we will need Hadoop's help -- which is where things get a bit more complex.

Saul Sherry
Big Data Explained: What Is Hadoop?

Part of 9   |  
See complete series
3|5|13   |   1:13   |   (9) comments


Hadoop is the open-source software framework that quickly became almost synonymous with big data. But what does it actually do?

Whereas traditional data queries were run on one server, Hadoop enables you to run data queries across a large number of machines. By spreading the computational load across many servers, Hadoop enables you to deal with big data in a timely fashion.

An example:

Tobias runs an online DVD store -- and he wants to increase sales by recommending products to customers as they check out. But he doesn't just want to recommend bestsellers, he wants a smart system that recommends based on the buyer's demographics and taste.

That's where Hadoop helps out. For each customer, Hadoop enables Tobias to spot patterns across all of his customers' data, based on age, sex, genre preference, actor preference, period of production, and many other defining elements. He can access this information quickly, because different elements of the search can be carried out individually and simultaneously, instead of having to take place on a single machine.

Using MapReduce (as discussed in a previous video), these queries are then returned in a way that can guide the customer and increase Tobias's revenue.

Saul Sherry
Big Data Explained: What Is Pig?

Part of 9   |  
See complete series
3|21|13   |   1:16   |   (8) comments


Pig basically simplifies the processes needed to get analytics done through Hadoop on your big data sets.

Like the animal, Pig is not a fussy eater, getting its name from its ability to crunch through data, no matter what form it takes. It acts as a scripting interface to Hadoop, meaning a lack of MapReduce programming experience won't hold you back.

Example: Harvey works in a government office, looking to formulate new solutions for his city's parking problems. He knows how to use data, but writing his own mapper and reduce functions is a little beyond him.

Luckily, he's been set up with access to the databases through Pig, meaning he can draw on sources like parking ticket records and population density maps. Taking advantage of Pig's eat-anything attitude, he can also mine topics from a call for email suggestions his department sent to local residents, as well as sensor information about the amount of traffic on the roads. In spite of his limited programming capabilities, Pig allows Harvey to query these data sets and sketch out some draft suggestions he can use to alleviate the local parking problems.

Saul Sherry
Big Data Explained: What Is HDFS?

Part of 9   |  
See complete series
4|4|13   |   1:05   |   (13) comments


Big data is awash with acronyms at the moment, none more widely used than HDFS. Let's cut to the chase... it stands for Hadoop Distributed File System.

This is the system of distributing files that allows Hadoop to work on huge data sets at speed. It spreads blocks of data across different servers, as well as duplicating those blocks of data, and storing them distinctly.

Let's see why with an example.

Sarianne works in the financial markets, and runs a lot of predictive models to make sure her investments are minimum risk.

Utilising HDFS, her queries through Hadoop can run quickly because the data blocks are stored separately -- meaning all the computation can happen in one go, rather than queuing up behind each other.

As an added benefit, if one server fails (as one is bound to, given the amount of servers and disk drives needed to run big data projects) it won't stop Sarianne's models from pulling the data they need, because HDFS duplicated those blocks -- meaning Hadoop can return Sarianne's results in double quick time.

Saul Sherry
Big Data Explained: What Is ETL?

Part of 9   |  
See complete series
5|13|13   |   1:14   |   (9) comments


ETL is central to a lot of big data work, standing for Extract, Transform, and Load. But what does that mean? Let's explain it with an example:

Lauren is a data scientist working at a university, looking to bring together different datasets to make sure students are offered courses which best suit their profiles. To do this, she needs to pull data from lots of places into a centralized data warehouse.

First, she needs to extract data from the original sources, which can include existing university databases, as well as web crawling for social media information on students.

Next, Lauren has to transform this extracted data so that it fits in a way the centralized data warehouse can use it. For this, she can use a series of rules or functions to get the data into shape -- for instance, changing DOBs to reflect age, deriving aggregated values, deduplicating records, or joining data from multiple sources, depending on what the final data warehouse needs.

Finally, Lauren can load this data into the data warehouse, giving her a way to gain new insight on students by mining for patterns in this collected data.