Google Flu Trends will supplement crowdsourced approach with CDC data

November 03, 2014

07:53 am

Without specifically acknowledging criticisms about its accuracy, Google announced Friday it would be making some changes to Google Flu Trends, the service that uses search data to track and predict flu outcomes in the United States and around the world. Specifically, Flu Trends will stop relying solely on search data to make predictions but will begin to combine that search data, continually, with publicly available data from the CDC.

Earlier this year, Google was subjected to accusations -- backed up with data in publications like Science Magazine -- that Google Flu Trends was not just wrong about the flu one time, in 2012, but had been massively and systemically inaccurate for years. David Lazer, Professor in Political Science and Computer and Information Science at Northeastern University, and Ryan Kennedy, a visiting professor at Northeastern, published papers about it in Science and the Social Science Research Network in March.

“So we started digging deeper and deeper and it got more and more puzzling,” Lazer said at the time. “It was clear that it went off the rails years ago. Nobody at Google noticed. It was just like a boat where all the crew had died or something like that. It was just sort of floating around.”

Lazer and Kennedy's theory was that the constant changes to Google's proprietary and secret search algorithm were tainting the data from Flu Trends and making its old modeling algorithms (based on an older version of search) inaccurate. Their suggestion? Add in the CDC data for a hybrid approach.

In a blog post about the change, Google Senior Software Engineer Christian Stefansen, didn't really address the reason for the inaccuracies, but did express the extent of them in much softer terms than Lazer. According to Stefansen, bringing in the CDC data is just another tweak in an iterative learning process for Google Flu Trends.

"The original model performed surprisingly well despite its simplicity," he wrote. "It was retrained just once per year, and typically used only the 50 to 300 queries that produced the best estimates for prior seasons. We then left it to perform through the new season and evaluated it at the end. It didn’t use the official CDC data for estimation during the season—only in the initial training."

"In the 2012/2013 season, we significantly overpredicted compared to the CDC’s reported US flu levels. We investigated and in the 2013/2014 season launched a retrained model (still using the original method). It performed within the historic range, but we wondered: could we do even better? Could we improve the accuracy significantly with a more robust model that learns continuously from official flu data?"

While Stefansen's blog post doesn't mention Lazer and Kennedy's work, he did link to their Science paper in a citation.

"So for the 2014/2015 season, we’re launching a new Flu Trends model in the US that — like many of the best performing methods in the literature — takes official CDC flu data into account as the flu season progresses," he wrote, citing Lazer and Kennedy's paper next to "best performing methods". "We’ll publish the details in a technical paper soon."

Tags: