Making data science a sport
An Australian statistician whose "crowd-sourcing for geniuses" website has solved some of the world's most intractable problems has just received an US$11 million injection from some of the biggest investors in Silicon Valley.
Anthony Goldbloom, who coded Kaggle.com in a small apartment in Bondi in Sydney's east after leaving cushy jobs at Treasury and the Reserve Bank of Australia, is now the toast of Silicon Valley.
The founder of PayPal and Slide, Max Levchin, is one of the investors and he has joined Kaggle as its chairman. The others include Index Ventures, Khosla Ventures, SV Angel, Yuri Milner's Start Fund, Stanford Management Co. and a series of high-profile angel investors including Google chief economist Hal Varian.
In a phone interview with Fairfax Media from his new base in San Francisco, Goldbloom, 28, said he spoke to a Wall Street Journal reporter who told him that it was "the most impressive list of investors she's ever seen".
"I've got 500 unread emails in my inbox, mostly from recruiters and job applicants," said Goldbloom.
Kaggle is a platform where companies, researchers and governments can host competitions to help solve huge data-related problems. About 50 competitions have been run to date and members take just days to solve problems that have stumped scientists for years.
"We're kind of making data science into a sport," said Goldbloom.
Kaggle's 17,000 PhD-level members have so far helped NASA come up with models to map the universe's dark matter, helped health care providers predict which customers will get sick and predicted the winners of the EuroVision song contest with greater accuracy than the betting markets.
They get paid handsomely for doing so - ranging from US$5000 to US$3 million per solution, Goldbloom says. In fact, he predicts that his service could allow scientists to earn as much as entertainers and sports stars as Kaggle grows.
"Business analytics or predictive modelling is a US$100 billion industry and US$41 billion is spent on outsourced business analytics every year. I think that's about twice the size of the movie industry - it's really big," said Goldbloom.
"Our view is that the very best data miners or statisticians can earn as much as the very best golfers or tennis players."
Goldbloom said most modern companies, under the hood, had a predictive model directing the way business is done. For instance, banks use models to work out who is going to default, health care providers use models to predict health outcomes and supermarkets use models of buying behaviour and other factors to determine where to place items in the store.
Google's chief economist Varian has said statistics is the sexy job of the 21st century.
"The reason this is the case is because companies are collecting more and more data than ever before and they're starting to realise that there's a lot of value in that data," said Goldbloom.
He used the example of credit card companies like Visa, MasterCard and American Express, which are now looking at ways to get into ad targeting.
"We think Facebook and Google know a lot about us - who knows more about us than AmEx, MasterCard and Visa? They know exactly what we spend and where we spent it ... so they're looking at ways to unlock it," said Goldbloom.
On Kaggle.com, swathes of data are made available to members who vie to produce the best algorithms.
Their efforts are ranked in real time right up until the deadline, encouraging competition between members. Several of the top solutions earn money, not just the overall winners.
"To date Kaggle has crunched data on dark matter, predicting which used cars are likely to be bad buys, improve the World Chess Federation's official chess rating system, and predicting the likelihood that an HIV patient's infection will become less severe, given a small dataset and limited clinical information," Kaggle claims.
NASA's dark matter competition was actually won by a British glaciologist from the University of Cambridge, Martin O'Leary. The solution was a mathematical model for the tiny distortions in images of the galaxy, thought to be dark matter.
NASA, the European space agency and others had been working on the problem for 10 years but O'Leary found the solution in a week and a half. The effort garnered him a mention on the White House website, which compared him to Albert Einstein and Isaac Newton.
"Glaciologists use different techniques to astronomers. It turns out that the techniques that glaciologists use are really powerful on this NASA problem," said Goldbloom.
In another example, the NSW Road & Traffic Authority used Kaggle to come up with a better model for predicting travel times for motorists. The AU$10,000 prize was won by Costa Rican PhD students but it is unclear whether the solution has been adopted by the RTA.
"They could tell you how long it would take you within one minute of the actual travel time 73 per cent of the time, which is pretty good," said Goldbloom.
Banks have used the service to work out models for loan defaulting.
For instance, the bank has two years worth of information on borrowers and they know whether or not these borrowers have defaulted.
They would release the first 18 months of data and ask participants to predict who of the borrowers in the last six months would default on their loan. The bank already knows the answers but they can use the most accurate solution as a model for predicting future defaults.
"You want to evaluate future borrowers but in order to train an algorithm that will help you identify future defaults you have to train it and evaluate it on past data," Goldbloom explained.
Kaggle's competitions are being compared to a competition run by Netflix several years ago to improve their recommendation algorithm that suggests movies a user might like. A team of seven engineers from across the industry won the $1 million prize.
Goldbloom said Kaggle could be explained as a "99designs for predictive modelling", where crowd-sourced statistical models replace graphics design. But the big difference with Kaggle is that the problems it is solving are fundamental to the way big companies operate, and are therefore worth a lot of money.
This is part of the reason for the large US$11 million investment, which is unusual for a first funding round and for a team of just four staff. Goldbloom would not reveal how much of the company he owned but said the founders remained the majority shareholders.
The biggest prize on offer on the site is US$3 million, offered by a managed care provider who wants to predict who of its clients will go to hospital in the next year. The problem remains unsolved but has had scores of entries.
"They receive a fixed fee for each patient and if they can keep the patient out of hospital then they pay out less on that patient and make a bigger margin," said Goldbloom.
"They want to predict in advance who's at risk of going to hospital so they can call them into the doctor's office."
The competition to predict the winner of the EuroVision song contest last year was Kaggle's first competition. The contest has very strong voting patterns, such as the fact that Greece always votes for Cyprus and vice-versa.
"Most of the top performers [on Kaggle] were better at picking the ordering in the top 10 than the betting markets were," said Goldbloom.
"If I took the average of all the statistical models, they got seven of the top 10 right whereas the betting markets only got five of the top 10."
Of course, for some of the data sets, such as health care data, privacy is paramount. Goldbloom said it was possible to set private competitions, where the data is only accessible to certain selected Kaggle members, who sign an NDA with the company involved.
Goldbloom never expected to become an entrepreneur.
When he was at the RBA he entered an essay competition at The Economist magazine and won. His piece posited that the subprime mortgage crisis wasn't a big deal.
Goldbloom was given a three-month internship at the magazine and was given a story to write about big data, data mining and how it all fit into business.
The people he spoke to, including the guy who was doing the "microtargeting" for Barack Obama's presidential campaign, opened his eyes to the huge opportunity.
He also spoke to many CIOs who all said they wanted to do more predictive modelling but their enthusiasm didn't match actual adoption. Goldbloom concluded that the problem was that the product was too technical and the barriers to entry too high.
"I just became obsessed," said Golbloom. "It was so much better than trying to predict unemployment next month and getting it wrong."
Sunday Star Times