Big search data offers the opportunity to identify new and potentially real-time measures and predictors of important political, geographic, social, cultural, economic, and epidemiological phenomena, measures that might serve an important role as leading indicators in forecasts and nowcasts. However, it also presents vast new risks that scientists or the public will identify meaningless and totally spurious ‘relationships’ between variables. This study is the first to quantify that risk in the context of search data. We find that spurious correlations arise at exceptionally high frequencies among probability distributions examined for random variables based upon gamma (1, 1) and Gaussian random walk distributions. Quantifying these spurious correlations and their likely magnitude for various distributions has value for several reasons. First, analysts can make progress toward accurate inference. Second, they can avoid unwarranted credulity. Third, they can demand appropriate disclosure from the study authors.
© 2023 by the authors.
Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution 4.0 International (CC BY 4.0) License.
Article states: Data and other replication materials are available at “Replication Data for: Measuring and Answering the Challenge of Spurious Correlations in Big Search Data”, https://doi.org/10.7910/DVN/UW1UYR (accessed on 26 December 2022), Harvard Dataverse.
Original Publication Citation
Richman, J. T., & Roberts, R. J. (2023). Assessing spurious correlations in big search data. Forecasting, 5(1), 285-296. https://doi.org/10.3390/forecast5010015
Richman, Jesse T. and Roberts, Ryan J., "Assessing Spurious Correlations in Big Search Data" (2023). Political Science & Geography Faculty Publications. 55.