Georgia Senate runoff: Using predictive analytics and big data properly
Even though election modeling as it is currently practiced is broken, we remain firm believers in the overall power of big data and predictive analytics, when done properly.
By DOV GREENBAUM, MARK GERSTEIN
Given all the noise resulting from the last election, it is surprising to hear little polling news regarding the current pair of Senate runoffs in Georgia. The silence is especially shocking given that the Senate majority hangs in the balance. We believe this silence reflects a larger reality – that people have given up on the polls and that the loser in the election for many was neither Trump nor Biden, but data science – in particular, the field of social data science.Day in and day out, for months leading up to the election, Biden and the Democratic Party held on to a strong and ultimately expanding lead; the prediction was not just that he and the Democrats would win, but handedly. We all know now that reality begged to differ: Even when all the votes were ultimately counted, in some states, Biden eked out only razor thin margins and the Democrats lost seats.Not only was much of the polling and the associated reporting inaccurate, but it occurred fully aware of the 2016 polling train wreck: Trump vs. Clinton. Notably, there seems to be less public handwringing than in 2016. For the general public, at least, the consequences of being off are substantially less problematic than being off by a lot.We disagree. Accurately taking the nation’s temperature from a small sample is important to the smooth working of democracy. Surprises are disruptive politically, socially and economically. Practically, they often led to misallocated donations and effort courting specific pockets of voters.Despite all the wonders that we’ve heard about what we can do with mining big data sets – predicting, for instance, someone’s political affiliation through their shopping habits – the polls in what was billed as the most important election in a generation were not only wrong, they were, arguably, stupendously so. Billions were spent on polling and modeling, much more than on many other data science challenges, such as curing cancer. And yet, they were still clearly inadequate and inaccurate.In the immediate aftermath of the election, we saw some knee-jerk Wednesday-morning quarterbacking serving up the many possible mistakes that the pollsters may have made. Regardless of whether they were wrong because of known unknowns like the certainty that people lie to pollsters, or polling misses demographic pockets (i.e., shy or rural voters), or any of the other excuses, each excuse ought to have been, and likely was, factored into the final model.More so, in the days leading up to the election, we were repeatedly promised that pollsters, still licking their four-year-old wounds, were properly accounting for these and other concerns. Yet clearly, they still missed something. Mistakes were made in 2016, and even more egregiously now, given our 20/20 hindsight of that debacle.But even though election modeling as it is currently practiced is broken, we remain firm believers in the overall power of big data and predictive analytics, when done properly. Modern society relies on it too much for smooth operation to be so wrong.All analytics have inherent biases. There are even organizations such as the Israeli company CybeRighTech which aim to identify and remove those biases in the growing number of predictive algorithms that we now encounter daily. There are at least two main sources of bias in polling models: the pollsters and the responders with their respective and seemingly intractable implicit biases. As such, the best way to increase the accuracy of polling is to reduce our reliance on both.To this end, the polling industry ought to look to open science and markets.
In natural sciences, an error is found and removed by reproducing experiments where both the methods and the data are transparent and broadly available for others to test and retest the results. As scientists, we argue that pollster biases can be reduced by rapidly releasing the underlying data and the models that process them so that others can tinker and attempt to reproduce the results – similar to the open review of a submitted clinical trial data before drug approval.Another useful methodology would result from the increased use of market mechanisms in mainstream election forecasting. To varying degrees, this already exists abroad. Ultimately, the odds from offshore election betting (e.g., in the UK’s Betfair Exchange), where gamblers put profits over political expediency, appear to have been considerably more accurate than the local polls. A truly open and liquid market that allows stakeholders to monetize their predictions of election outcomes, with the necessary caveats to prevent abuse, would perhaps allow for more accurate and stable forecasts for presidential elections and important referendums like Brexit.Dov Greenbaum is a professor of law at IDC Herzliya and director of the Zvi Meitar Institute for Legal Implications of Emerging Technologies, a research affiliate in the Department of Molecular Biophysics and Biochemistry at Yale University, and a research fellow at Singapore Management University Center for AI and Data Governance.Mark Gerstein is the Albert L. Williams Professor of Biomedical Informatics and professor of molecular biophysics and biochemistry, of computer science, and of statistics & data science at Yale University. He is also co-director of the Yale Center for Biomedical Data Science.