Big data prediction analysis can be great for giving us suggestions for contacts on LinkedIn or good reads on Amazon. What happens when the big data prediction algorithms get it wrong?
So I try to make contact with someone who might not be my perfect match, or I read a book I don't enjoy. So what? But what happens if the big data prediction algorithm blocks a reputable operator from doing business on the web because the prediction algorithm gets it wrong? What follows is a real-life example, and it happened this week (not last century).
The event and my sin
I am keen to use the web for many things, including promoting my business. My website www.EndureDSandBI.com is still in the early development stages, but from a hits perspective, it is doing quite well. So well, in fact, that my site has been tagged by a large telecommunications company as malware. My sin (as can be seen below) is that the traffic to my website is too high for its age.
My new website is too successful. Therefore, it must be malware. That makes sense, right? As a result, the telco is blacklisting said disreputable website and blocking all internal staff from accessing it. Everyone is happy, right? Not quite. I genuinely wanted the internal staff at this company to access my website. Now all access has been denied, and it is not clear what impact being blacklisted by this company will have on things like search engine results.
To be malware or not to be malware
What is malware? Techterms.com defines it as malicious software designed to destroy your computer. It is clear that a big data prediction algorithm has been established with the firm belief that all young, high-volume sites are malware. This is an interesting thought that has resulted in my new site being accused of being malicious and attempting to corrupt computer software. I am therefore being pronounced guilty, and there is no care if my site is reputable. Of course, this telco needs to protect itself from malicious attacks, and the risk to companies large and small is enormous. Of course, it is essential to block all suspicious URLs. But what happens if this then becomes the norm, stifling successful entrepreneurialism?
This could all be explained as a one-off phenomenon. However, I would suspect the big data prediction algorithm is such that any successful entrepreneurial site would get the same blacklisting. So much for successful innovation on the web. Big pata prediction analysts, beware -- your work may very well be stifling the very essence of the work in which you are involved.
Spurious accuracy be gone, but make sure you get it right
I agree that spurious accuracy is not needed for big data analytics. However, it is critical to make sure your work does not label something incorrectly with potentially damning results. Make sure you get it right. Don't just think you have found the magic bullet without understanding the ramifications of implementing poor big data prediction algorithms.
Has anyone else been impacted by a similar situation?
User Rank: Exabyte Executive 12/9/2012 | 9:22:26 PM
Calling Nate Silver This is not to trivialize the issue but much as Nate Silver of Fivethirtyeight managed to beat established polling agencies to more closely predict the results of the last US presidential election, his methodology of averaging polls and accounting for differences could be useful in assuring "Big Data Prediction" is as close as possible to reality.
Re: A day in the life of a machine learning algorithm @Saul, any training of a classifier algorithm will address the issue of "false positives" which is what we're talking about here - falsely identifying the website as a carrier of malware. The idea is to reduce false positives. But part of the data science project must include a mechanism for re-training the algorithm, or in the case of online learning, the parameter vector is udpated real-time. How you approach the re-learning is dependent on the problem being solved.
Re: A day in the life of a machine learning algorithm Seems that will always be the case in seucrity measures @legalcio. "Better safe than sorry" seems to ring true. Is there a chance an competitive economic model will develop out of who can build the best system not just based on exclusions, but also on how many 'legal' sites AREN'T blocked?
Candidate A excludes 95% of all malware, and 10% of that is revealed as misidentified vs. Candidate B excludes 95% of all malware, but are preferred because of a meagre 5% misidentification rate?
Re: A day in the life of a machine learning algorithm @Daniel, where would that constant re-evaluation sit within an expense decision for an organization. Is it feasible they think "it's doing its job" and see any more ongoing investment as a sinkhole for cash?
User Rank: Petabyte Pathfinder 12/6/2012 | 2:41:04 PM
Re: unfortunate incident This really is an unfortunate incident. Having something like this happen definitely puts a crimp to the growth of your site and ruins the momentum. If you think about it, it's a bit sad that malware sites get a lot of hits in a short amount of time--that means many people are getting fooled by the second. I hope that you find a way to rectify this, Terry.
User Rank: Petabyte Pathfinder 12/6/2012 | 2:38:54 PM
Re: unfortunate incident I remember the discussion you're referring to Saul. It looks like Terry is part of the 'collateral damage' and it's ironic that it was brought about by too much success! The system is obviously flawed at this point and unfortunately, it's the innocents that suffer in the end.
User Rank: Exabyte Executive 12/6/2012 | 2:28:11 PM
Re: A day in the life of a machine learning algorithm While this was an inconvenience, I'd say most telcos and other large companies are going to err on the side of caution. The real problem here is no one has found a way to effectively deal with the tonnage of malware out there. It's like dealing with the TSA at airports. You don't fit the profile, but somewhere out there is a suicide bomber that does, so take your shoes off. We implemented a new firewall solution recently and are still adding perfectly benign sites that for some odd algorithmic reason are blocked.
A day in the life of a machine learning algorithm Your experience exposes the process with which data scientist design machine learning algorithms to tackle specific problems - case in point the malware classifier that blacklisted you. I've been in meetings with domain experts during the design phase of an algorithm that would play a critical role in some business process. I recall one meeting where I was designing a support vector machine (SVM) implementation for a spam classifier. My client was insisting on certain rules having to do with e-mail headers and certain domain name patterns. I disagreed because the result would be very similar to the one you report now; too many innocent senders would be classified as spammers. Alas, I lost the battle. It seems like another over-exuberant algorithm got the better of you too.
At the end of the day, all machine learning algorithms need to be re-evaluated on a regular basis as more Big Data flows through. Old features (machine learning lingo for data attributes used for learning purposes) may be discarded, new features adopted. So in the case you bring attention to, the "young, high-volume" feature needs to be revisited. Once the model has been tweaked, then it must be retrained to give better predictions.
User Rank: Blogger 12/6/2012 | 10:24:51 AM
Re: unfortunate incident It brings to mind what Frank Bria has been telling us about financial detection of fraud on internet banking... certain behaviours flag you as guilty. But those behaviors don't reflect every context.
I see development here as a playoff between automation and man hours. The system works as far as keeping malware out... but fails in that it keeps non malware out too.
Do you leave as is (flawed but cheap) or have individuals involved to rectify, feed terry's (and a few thousand other falsely accused) into the model and refine (costs, but is progressive).
User Rank: Megabyte Messenger 12/6/2012 | 8:19:36 AM
unfortunate incident Terry, i must say that it is an unfortunate event that has happened to you. I hope your website gets off the blacklist. Reasons for getting the hits doesn't always link to malware, it could be something positive like how innovative and good it is. However this does directs mind to the thought that Big Data if not used properly will not benefit you. One should do a detailed reserach, thinking and pre planning before using Big Data, its not just a thing that any organization can implement. One should make sure that the algos implemented in their BI systems are working correctly and is not blocking off any beneficial site.