{"id":284675,"date":"2014-09-29T06:00:24","date_gmt":"2014-09-29T13:00:24","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=284675"},"modified":"2016-08-31T08:40:43","modified_gmt":"2016-08-31T15:40:43","slug":"data-driven-crystal-ball","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/data-driven-crystal-ball\/","title":{"rendered":"A Data-Driven Crystal Ball"},"content":{"rendered":"

\u201cScottish independence: polls show it\u2019s too close to call.\u201d<\/p>\n

\u201cScotland\u2019s vote likely to be a nail-biter.\u201d<\/p>\n

\u201cScottish independence vote on a knife edge as polls put both Yes AND No ahead.\u201d<\/p>\n

If there was any consensus in the days running up to the momentous Sept. 18 vote in Scotland, it was that no one could predict the outcome. Headlines from Edinburgh, London, and across the globe were in complete agreement: It was impossible to say with any confidence what would happen.<\/p>\n

And then there was David Rothschild (opens in new tab)<\/span><\/a>, a Microsoft researcher and leading expert in a new kind of data-driven predictive methodology. Three days ahead of the vote in Scotland, he put the chances of a No outcome at 77.4 percent. Two days later, he inched it up to 79.5. On the morning of the vote, before any returns were announced, he went on record on his blog (opens in new tab)<\/span><\/a> with an 84 percent chance of defeat for Scottish independence.<\/p>\n

\"Miro

Miro Dudik (at whiteboard) confers with Microsoft Prediction Lab colleagues David Rothschild (left) and David Pennock.<\/p><\/div>\n

This isn\u2019t a mere parlor game for Rothschild, who, along with colleagues at Microsoft and elsewhere, correctly predicted the winners of all 15 World Cup knockout games earlier this year and got the Obama vs. Romney outcome right in 50 of 51 jurisdictions (the states plus the District of Columbia) in the 2012 U.S. presidential election. It seems no contest is beyond the purview of Rothschild\u2019s predictive powers, whether it\u2019s congressional races, the Super Bowl, the Oscars, or the Eurovision Song Contest.<\/p>\n

In an era in which traditional political polling is taking a huge reputational hit\u2014just ask Eric Cantor, former majority leader of the U.S. House of Representatives, who lost his Republican primary election in Virginia by 11 percentage points despite his own pollster putting him 34 points ahead\u2014Rothschild\u2019s success rate is gaining notice.<\/p>\n

That momentum culminates this week with the launch of a new, interactive platform, Microsoft Prediction Lab, which serves as a website-based showcase (opens in new tab)<\/span><\/a> and as a laboratory for his ongoing work.<\/p>\n

\u201cThe polls track the sentiment of the people who are answering the poll at the time,\u201d Rothschild said as he awaited the results in Scotland. \u201cMy forecast predicts what will happen on Election Day. Clearly, the sentiment of the people at the time of the polls is a critical component on any forecast of Election Day, but not the only one.\u201d<\/p>\n

\u201cIt may actually be reasonably convincing,\u201d he said of the victory for the No side. And convincing it was: 55 to 45 percent.<\/p>\n

The Problem with Representational Polling<\/h2>\n

Consider conventional political polling, which has a solid track record but is expensive and time-consuming. In recent decades, polling companies have relied on random-digit landline phone calls to determine voter sentiment. The accuracy of such results depends significantly on reaching a representative sample of people who actually will go to the polls. In the era of mobile phones and caller ID, the obstacles are mounting.<\/p>\n

Among the insights that Rothschild has documented and that he puts to considerable use in his methodology is that polls of voters\u2019 expectations\u2014who they think will win\u2014is a more accurate basis for forecasting than polls asking people how they intend to vote.<\/p>\n

\u201c[T]his is because we are polling from a broader information set, and voters respond as if they had polled 20 of their friends,\u201d he wrote in a 2013 paper (opens in new tab)<\/span><\/a> co-written with Justin Wolfers of the University of Michigan. Not surprisingly, then, Rothschild regularly includes data from betting markets in generating his predictions, including his forecast of the Scottish independence vote.<\/p>\n

Another major contribution from Rothschild, who has a doctorate in applied economics from the Wharton School of Business at the University of Pennsylvania, is that by applying the appropriate statistical adjustments, highly unrepresentative samples can be used to generate remarkably accurate forecasts.<\/p>\n

He and several colleagues demonstrated this in a novel experiment (opens in new tab)<\/span><\/a> that polled Xbox (opens in new tab)<\/span><\/a> users before the 2012 U.S. presidential election. They conducted an opt-in poll in the 45 days before the election and enabled people to participate once a day. In addition to asking, \u201cIf the election were held today, who would you vote for?\u201d they collected basic demographic information: sex, race, age, education, state of residence, party identification, political leanings, and how the respondent voted in the 2008 presidential election.<\/p>\n

As you might expect, the vast majority of Xbox users\u2014and thus survey respondents\u2014were male and relatively young. They would make a terrible sample for standard polling. But they served the researchers\u2019 purposes.<\/p>\n

\u201cStandard polling looks at a respondent as, for example, a male from New York,\u201d Rothschild says. \u201cThe way we look we look at it is: a male and a person from New York. I hope to find other potential polltakers who are male and other potential polltakers who are from New York. And from that, by breaking people into their demographics, we\u2019re able to allow all users to inform the likely polling of all other users.\u201d<\/p>\n

So even though they were short on women older than 65, for example, they had a number of female respondents and some respondents older than 65, along with others who shared certain other characteristics with older women. In the end, the data from more than 750,000 Xbox surveys taken by almost 350,000 unique respondents yielded 176,000 different demographic \u201ccells,\u201d each with a distinct combination of characteristics.<\/p>\n

From there, the researchers \u201cpost-stratified\u201d the Xbox responses to mimic a representative sample of likely voters, calculating cell weights by cross-tabulating with exit polls from the 2008 presidential election. As Election Day approached, they used the accumulated data to update their forecasts daily for each state.<\/p>\n

\u201cNot only did we match the accuracy of major polling companies,\u201d Rothschild says, \u201cbut we also provided a lot of insight that they weren\u2019t able to get, through the fact that we had people coming back again and again.\u201d<\/p>\n

Each predictive exercise that Rothschild runs draws from a different pool of data, which is often a combination of polling data, historical results, Internet betting data, routinely collected statistics, and user-generated data. For Major League Baseball playoffs, for example, massive amounts of data are available from the regular season. World Cup soccer doesn\u2019t have that kind of buildup, so it makes sense to engage the crowd to collect new data to augment historical data about the players and teams and the results of the qualifying rounds.<\/p>\n

\u201cThere\u2019s always something missing\u2014always data we wish we had that didn\u2019t quite exist,\u201d Rothschild says. \u201cSo we\u2019ve done a lot of fun experiments.\u201d These include Oscars prediction games and NFL prediction games that were designed to attract people with a high level of expertise in those areas.<\/p>\n

\u201cThe way I\u2019ve always looked at it,\u201d Rothschild says, \u201cis that any individual\u2014you, me, the guy on the street\u2014has a certain amount of information about the things the person cares about, but no one has been unlocking it.\u201d<\/p>\n

The conventional pollsters \u201cdon\u2019t think about somebody who is self-selected,\u201d he explains. \u201cThey go to random people. They also use very simple aggregation methods, rather than modeling the results they have. That\u2019s what computers are for. That\u2019s what our new knowledge is for.\u201d<\/p>\n

Rothschild and his colleagues apply deep expertise in machine learning to test and calibrate their models against historical data, and they use advanced algorithms to account for a host of variables, such as the advantages of incumbency and the tendency of bettors to overstate long-shot wins.<\/p>\n

Reinventing Survey Research<\/h2>\n

The interactive platform that Rothschild and other researchers launched today houses all of the ongoing predictive work that Rothschild has been featuring on his blog and in academic journals and presentations. The Microsoft Prediction Lab displays his data-driven predictions\u2014some of them updated in real time\u2014in a wide range of fields, from sports and entertainment to politics and economics.<\/p>\n

\u201cWe\u2019re building an infrastructure,\u201d he says, \u201cthat\u2019s incredibly scalable, so we can be answering questions along a massive continuum.\u201d<\/p>\n

Rothschild sees the new platform as \u201ca great laboratory for researchers\u201d as well as \u201ca very socialized experience\u201d for interested users. Among other contests, he plans to predict the results of every upcoming U.S. House, Senate, and gubernatorial race. Users will be able to customize views on the site based on their geographic location and their interests. The idea is to collect data quickly and update it as often as possible.<\/p>\n

\"Microsoft

A sample of the Microsoft Prediction Lab interface users will see for every U.S. House, Senate, and gubernatorial race in 2014.<\/p><\/div>\n

\u201cIt\u2019s also important to be agnostic and not be wed to one type of data,\u201d Rothschild says. He looks at any data that can contribute to the predictive model, whether it\u2019s stock-market data, Internet page views, or trending topics and word co-occurrence on social media. Collecting \u201ccrowd wisdom\u201d will be a big component of the endeavor.<\/p>\n

\u201cBy really reinventing survey research, we feel that we can open it up to a whole new realm of questions that, previously, people used to say you can only use a model for,\u201d Rothschild says. \u201cFrom whom you survey to the questions you ask to the aggregation method that you utilize to the incentive structure, we see places to innovate. We\u2019re trying to be extremely disruptive.\u201d<\/p>\n

That disruption has ramifications for the polling industry\u2014and beyond.<\/p>\n

\u201cThere are two reasons to experiment with nonprobability polling,\u201d he says. \u201cFirst, I firmly believe the standard polling will reach a point where the response rate and the coverage is so low that something bad will happen. Then, the standard polling technology will be completely destroyed, so it is prudent to invest in alternative methods.<\/p>\n

\u201cSecond, even if nothing ever happened to standard polling, nonprobability polling data will unlock market intelligence for us that no standard polling could ever provide. Ultimately, we will be able to gather data so quickly that the idea of a decision-maker waiting a few weeks for a poll will seem crazy.\u201d<\/p>\n

The ready availability of such data will enable businesses to make strategic investment decisions, such as where to locate a data center or how to invest marketing resources to attain the optimal yield.<\/p>\n

\u201cWe will be able,\u201d Rothschild says, \u201cto gather so much detail from repeated users\u2014and the quantity of users we can reach\u2014that decision-makers will come to cherish the nearly infinite number of data points that can be efficiently generated to answer the exact questions the question-maker has, not the expedient question or the historical norm.\u201d<\/p>\n

One caveat, though: The market intelligence derived from the nonprobability polling data must prove accurate.<\/p>\n

\u201cThat is what this research is all about,\u201d he adds, \u201creaching that point where the quick, relevant, and cost-effective market intelligence is as accurate as what it supplants. At that point, the demise of standard polling becomes irrelevant, because it will become strictly dominated by nonprobability data collection and analytical techniques.\u201d<\/p>\n

The new Microsoft Prediction Lab website draws on the expertise of researchers in Microsoft\u2019s New York City (opens in new tab)<\/span><\/a>, Redmond (opens in new tab)<\/span><\/a>, and India (opens in new tab)<\/span><\/a> labs. Key contributors include noted computer scientists Miro Dud\u00edk (opens in new tab)<\/span><\/a> and David Pennock (opens in new tab)<\/span><\/a>, as well as a research team led by Harry Shum (opens in new tab)<\/span><\/a>, Microsoft executive vice president of Technology and Research, and the office of Microsoft\u2019s chief economist, Preston McAfee (opens in new tab)<\/span><\/a>.<\/p>\n

\u201cIt has been,\u201d Rothschild confirms, \u201can incredibly collaborative effort.\u201d<\/p>\n

\u201cMost researchers get the opportunity to explore a much more narrow set of questions and a much more narrow set of data,\u201d he says. \u201cBut through collaboration with an awesome set of researchers, this really allows me to explore things that are so buried. And that\u2019s really the most exciting thing about this. It\u2019s not any individual outcome\u2014it\u2019s the massive amount of questions that we\u2019ll be able to answer in the near future.\u201d<\/p>\n","protected":false},"excerpt":{"rendered":"

\u201cScottish independence: polls show it\u2019s too close to call.\u201d \u201cScotland\u2019s vote likely to be a nail-biter.\u201d \u201cScottish independence vote on a knife edge as polls put both Yes AND No ahead.\u201d If there was any consensus in the days running up to the momentous Sept. 18 vote in Scotland, it was that no one could […]<\/p>\n","protected":false},"author":39507,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"categories":[194475,194455],"tags":[211454,211448,211445,211433,187369,211223,201249,211457,211451,211469,211226,211466,211460,203353,211439,211463,211442],"research-area":[13556,13563],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-284675","post","type-post","status-publish","format-standard","hentry","category-database-data-analytics-platforms","category-machine-learning","tag-accurate-forecasts","tag-applied-economics","tag-broader-information-set","tag-conventional-political-polling","tag-crowd-wisdom","tag-data-driven-predictive-methodology","tag-david-rothschild","tag-demographic-information","tag-highly-unrepresentative-samples","tag-market-intelligence","tag-microsoft-prediction-lab","tag-nonprobability-polling","tag-polltakers","tag-predictwise","tag-random-digit-landline-phone-calls","tag-survey-research","tag-voters-expectations","msr-research-area-artificial-intelligence","msr-research-area-data-platform-analytics","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199562,199565,199571],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[],"msr_type":"Post","byline":"","formattedDate":"September 29, 2014","formattedExcerpt":"\u201cScottish independence: polls show it\u2019s too close to call.\u201d \u201cScotland\u2019s vote likely to be a nail-biter.\u201d \u201cScottish independence vote on a knife edge as polls put both Yes AND No ahead.\u201d If there was any consensus in the days running up to the momentous Sept.…","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/284675"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/39507"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=284675"}],"version-history":[{"count":7,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/284675\/revisions"}],"predecessor-version":[{"id":285746,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/284675\/revisions\/285746"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=284675"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=284675"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=284675"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=284675"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=284675"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=284675"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=284675"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=284675"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=284675"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=284675"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=284675"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}