Validating symptom responses from the COVID-19 Survey with COVID-19 outcomes

Working with Carnegie Mellon University (CMU) and the University of Maryland (UMD), Facebook helped enable a comprehensive and privacy-driven daily survey to monitor the spread and impact of the COVID-19 pandemic in the U.S. around the world. The COVID-19 survey is an ongoing operation that is attended by approximately 45,000 people in the US every day. Respondents provide information about COVID-related symptoms, vaccine acceptance, contacts and behaviors, risk factors, and demographics so researchers can examine regional trends around the world. To date, the survey has collected more than 50 million responses worldwide.

In addition to visualizing this data on the Facebook Data for Good website, researchers can find publicly available aggregated data via the COVIDcast API and UMD API, as well as downloadable CSVs (US, World). The analyzes shown here are all based on publicly available data from CMU and other public data sources (e.g. the US Census Bureau and the Institute for Health Metrics and Evaluation). Microdata is also available upon request to academic and nonprofit researchers under data license agreements.

After the survey has been conducted for several months, the aggregated, publicly available datasets can be analyzed to determine key characteristics of the COVID-related symptom signals obtained through the survey. Here, we first investigate whether survey responses provide leading indicators of COVID-19 outbreaks. We note that survey signals related to symptoms can lead to deaths and even cases related to COVID-19 for many days, although the strength of the correlation may depend on population size and the height of the peak of the pandemic.

After this observation, we analyzed the conditions under which these leading indicators can be detected. We find that small sample size statistics and the presence of a small but significant “confuser” signal can contribute to an offset in the signals that obscures the actual changes in the COVID-19 survey signals.

Survey responses can provide leading indicators of COVID-19 outbreaks

For the following analyzes we have used publicly available aggregated data from the downloadable CMU-CSV, which has been smoothed and weighted. We focus on disease indicators (symptoms) that were reported as personal or knowing in their community by respondents between May 1, 2020 and January 4, 2021. To determine if the symptom signals from the survey act as early indicators for the new COVID-19 In cases or deaths, we collect data at the US state level, delay the symptom signals in a timely manner and establish the correlation with the COVID-19 results (ex . new daily cases or new daily deaths).

In the figure below, we show how Community CLI (COVID-like disease in the local community) from the survey is a leading signal of new daily deaths in Texas, as tabulated by the Institute for Health Metrics and Evaluation (IHME). In the top row, we compare the estimated percentage of respondents who know people in their community with symptoms of CLI (fever along with cough, shortness of breath, or difficulty breathing) with new daily deaths over time when the symptom signal is delayed by 0, 12 or 24 days. On the bottom line, we plot the time-delayed community CLI against new daily deaths and determine the Pearson correlation coefficient (Pearson’s r: 0.57, 0.86 and 0.98, respectively).

We can use this approach with the various disease indicators captured in the survey and multiple lag times to determine how “leading” the signal is on COVID-19 results, as shown in the following figure. In the top row, we show time series plots of symptom signals in the survey (% CLI,% Community CLI,% CLI + Anosmia, and% Anosmia), new daily cases, and new daily deaths in Texas and Arizona from May 2020 through May 2020 December 2020. In On the bottom row, we plot the Pearson correlation coefficient of symptoms and new daily deaths when the symptom signal was between -10 and 40 days ago.

For any US state, we can approximate how leading a symptom signal is by determining the optimal time lag, or the time lag that gives the highest Pearson’s r for that symptom. However, this method will not find the optimal time lag if a region 1) has poor scoring (e.g. inadequate testing), 2) is less populated and has too few samples (see below), or 3) only data for one side has one Result peaks (e.g., cases that are constantly falling or constantly rising) because the optimal delay is not clear. In the figure below, we show the optimal time lag (days, mean ± 95 percent ci) for four symptom signals (CLI, Community CLI, CLI + Anosmia, and Anosmia) in 39 U.S. states with large populations that have seen major COVID-19 outbreaks .

While all four symptom signals lead to new deaths for many days, the symptom signal CLI seems to lead to new deaths more time than new cases (left, CLI: 21.3 ± 3.0 days, new cases: 17.7 ± 2, 3 days). This is confirmed when the same analysis is performed for all four symptom signals using new daily cases as a result (correct, CLI lists new cases by 8.2 ± 4.0 days).

Detectability of COVID-19-related signals

In regions with relatively large outbreaks and reliable COVID prevalence data, the strength of symptom-outcome correlations depends on the height of the peak of the pandemic. The graph below shows that states with larger populations or with a high pandemic peak (maximum number of COVID-19 cases per million people) have a better correlation between CLI and new cases than smaller states or states that avoided a large outbreak.

We observed two main influences that decreased the quality of the survey signal for these poorly correlated conditions: 1) statistical noise due to a limited number of survey responses and 2) the presence of a confusion signal in the data.

The statistical noise in the COVID-19 survey is due to the fact that surveys are conducted on small samples of a population (read more about our sampling and weighting method here). That said, if you ask a random person about their health on a random day, it is unlikely to experience COVID-19 symptoms at that point. If not enough people are interviewed, the survey likely won’t even identify a person with COVID-19 symptoms. This means that survey signals for rare symptoms like CLI have a higher relative variance than community CLI because the likelihood that one person will know another person with COVID-like symptoms is usually higher than the likelihood that the respondent will have symptoms Has.

Regarding the second point, our analysis found that even without a COVID-19 outbreak, there is a persistent baseline for symptom signals such as CLI and Community CLI. Regardless of the source of this confusion signal (one explanation is survey participants who happen to have COVID-like symptoms but not COVID-19), it can mask actual outbreaks even in situations with large numbers of survey responses and low statistical noise.

Take Washington State for example from April 2020 to December 2020. In the left pane,% CLI (green) shows high relative variance and never falls below the Confuser baseline of ~ 0.25 percent, which is the COVID-19 outbreak in the summer 2020 hidden and barely make the autumn outbreak visible. On the other hand,% Community CLI (orange) shows a lower relative variance, and both the summer and autumn outbreaks are clearly visible. The right panel confirms this and shows that the Community CLI survey signal correlates very well with new deaths with a delay of about 7 days, while CLI does not.


In our preliminary examination of the disease indicators available in the public COVID-19 survey datasets, we find that symptom signals such as COVID-like illness (CLI) and community CLI correlate with COVID-19 results and sometimes with new COVID-19 cases deaths run for weeks. Additionally, we are observing two main components of these signals that are not related to COVID and that may ultimately obscure the actual impact of an outbreak in the data. More work is needed to quantify this more precisely and to extend this analysis globally.

Because the COVID-19 surveys are conducted daily and worldwide and are not subject to the reporting delays associated with COVID test results, survey responses can better represent the current pandemic situation than official case numbers. In line with previous work showing that COVID-19 survey signals can improve short-term predictions, our analysis here shows the survey’s potential to aid COVID-19 hotspot detection algorithms or improve pandemic predictions.

Facebook and our partners are encouraging researchers, public health officials and the general public to use the COVID-19 survey data (available via the COVIDcast API and UMD API) and other data sets (such as Facebook Data for Goods Population Density Maps and Disease Prevention Maps) ) for new analyzes and insights. Microdata from the surveys are also available upon request to academic and nonprofit researchers under data license agreements.

Comments are closed.