Sponsored by:
 
 

The Pitfalls of Big Data Prediction Analysis

Terry Simmonds
50%
50%
Newest First | Oldest First | Threaded View
comments
Anna Young
50%
50%
Anna Young, User Rank: Exabyte Executive
12/9/2012 | 9:22:26 PM


Calling Nate Silver
This is not to trivialize the issue but much as Nate Silver of Fivethirtyeight managed to beat established polling agencies to more closely predict the results of the last US presidential election, his methodology of averaging polls and accounting for differences could be useful in assuring "Big Data Prediction" is as close as possible to reality.

Daniel Gutierrez
50%
50%
Daniel Gutierrez, User Rank: Blogger
12/7/2012 | 5:51:16 PM


Re: A day in the life of a machine learning algorithm
@Saul, any training of a classifier algorithm will address the issue of "false positives" which is what we're talking about here - falsely identifying the website as a carrier of malware. The idea is to reduce false positives. But part of the data science project must include a mechanism for re-training the algorithm, or in the case of online learning, the parameter vector is udpated real-time. How you approach the re-learning is dependent on the problem being solved.

Saul Sherry
50%
50%
Saul Sherry, User Rank: Blogger
12/7/2012 | 5:35:04 AM


Re: A day in the life of a machine learning algorithm
Seems that will always be the case in seucrity measures @legalcio. "Better safe than sorry" seems to ring true. Is there a chance an competitive economic model will develop out of who can build the best system not just based on exclusions, but also on how many 'legal' sites AREN'T blocked?

Candidate A excludes 95% of all malware, and 10% of that is revealed as misidentified vs. Candidate B excludes 95% of all malware, but are preferred because of a meagre 5% misidentification rate?

Saul Sherry
50%
50%
Saul Sherry, User Rank: Blogger
12/7/2012 | 5:30:49 AM


Re: A day in the life of a machine learning algorithm
@Daniel, where would that constant re-evaluation sit within an expense decision for an organization. Is it feasible they think "it's doing its job" and see any more ongoing investment as a sinkhole for cash?

SharCo
50%
50%
SharCo, User Rank: Bit Player
12/6/2012 | 2:41:04 PM


Re: unfortunate incident
This really is an unfortunate incident. Having something like this happen definitely puts a crimp to the growth of your site and ruins the momentum. If you think about it, it's a bit sad that malware sites get a lot of hits in a short amount of time--that means many people are getting fooled by the second. I hope that you find a way to rectify this, Terry.

SharCo
50%
50%
SharCo, User Rank: Bit Player
12/6/2012 | 2:38:54 PM


Re: unfortunate incident
I remember the discussion you're referring to Saul. It looks like Terry is part of the 'collateral damage' and it's ironic that it was brought about by too much success! The system is obviously flawed at this point and unfortunately, it's the innocents that suffer in the end.

legalcio
50%
50%
legalcio, User Rank: Exabyte Executive
12/6/2012 | 2:28:11 PM


Re: A day in the life of a machine learning algorithm
While this was an inconvenience, I'd say most telcos and other large companies are going to err on the side of caution. The real problem here is no one has found a way to effectively deal with the tonnage of malware out there. It's like dealing with the TSA at airports. You don't fit the profile, but somewhere out there is a suicide bomber that does, so take your shoes off. We implemented a new firewall solution recently and are still adding perfectly benign sites that for some odd algorithmic reason are blocked.

Daniel Gutierrez
50%
50%
Daniel Gutierrez, User Rank: Blogger
12/6/2012 | 12:48:32 PM


A day in the life of a machine learning algorithm
Your experience exposes the process with which data scientist design machine learning algorithms to tackle specific problems - case in point the malware classifier that blacklisted you. I've been in meetings with domain experts during the design phase of an algorithm that would play a critical role in some business process. I recall one meeting where I was designing a support vector machine (SVM) implementation for a spam classifier. My client was insisting on certain rules having to do with e-mail headers and certain domain name patterns. I disagreed because the result would be very similar to the one you report now; too many innocent senders would be classified as spammers. Alas, I lost the battle. It seems like another over-exuberant algorithm got the better of you too.

At the end of the day, all machine learning algorithms need to be re-evaluated on a regular basis as more Big Data flows through. Old features (machine learning lingo for data attributes used for learning purposes) may be discarded, new features adopted. So in the case you bring attention to, the "young, high-volume" feature needs to be revisited. Once the model has been tweaked, then it must be retrained to give better predictions.

Saul Sherry
50%
50%
Saul Sherry, User Rank: Blogger
12/6/2012 | 10:24:51 AM


Re: unfortunate incident
It brings to mind what Frank Bria has been telling us about financial detection of fraud on internet banking... certain behaviours flag you as guilty. But those behaviors don't reflect every context. I see development here as a playoff between automation and man hours. The system works as far as keeping malware out... but fails in that it keeps non malware out too. Do you leave as is (flawed but cheap) or have individuals involved to rectify, feed terry's (and a few thousand other falsely accused) into the model and refine (costs, but is progressive).

kiran
50%
50%
kiran, User Rank: Megabyte Messenger
12/6/2012 | 8:19:36 AM


unfortunate incident
Terry, i must say that it is an unfortunate event that has happened to you. I hope your website gets off the blacklist. Reasons for getting the hits doesn't always link to malware, it could be something positive like how innovative and good it is. However this does directs mind to the thought that Big Data if not used properly will not benefit you. One should do a detailed reserach, thinking and pre planning before using Big Data, its not just a thing that any organization can implement. One should make sure that the algos implemented in their BI systems are working correctly and is not blocking off any beneficial site.

More Blogs from Terry Simmonds
You won't know how to respond to new events if you aren't tracking what has gone before.
If things are not always what they seem in the results of big data analytics, you might just need to 'go to ground.'
Big data is set to get bigger in 2013, and the focus on moving away from a single warehouse means new technical considerations with significant cost implications.
Flash Poll
Data Visualization Showcase
This Tableau visualization of international debt demonstrates how simple visualizations can give great insight
Explore this data here.
More Data Visualization Showcase
BDR in your Inbox
Featured Video
9
Big Data Explained: What Is ETL?
OK, so it's Extract, Transform and Load - but we'll show you what it really means.
Watch This Video
Follow Us on Twitter
Like Us on Facebook
Accolades
Accolades