An early analysis of the COVID-19 pandemic


This dataset is collected from public agencies or news media, containing detailed information about some 1400 COVID-19 cases confirmed in and outside China. This dataset is free to use and share given that appropriate credits are given under the CC-BY-4.0 license. It can be loaded in R as a package:


More details about the dataset can be found in


and in this arXiv preprint.

Statistical inference: the BETS model

We have developed a generative model for four key epidemiological events: Beginning of exposure, End of exposure, time of Transmission, and time of Symptom onset (BETS). This package implements a likelihood inference for the BETS model. Try:


Details of the model and methodology can be found in this preprint on arXiv. In short, we find that several published early analyses were severely biased by sample selection. All our analyses, regardless of which subsample and model were being used, point to an epidemic doubling time of 2 to 2.5 days during the early outbreak in Wuhan.

A Bayesian nonparametric analysis further suggests that 5% of the symptomatic cases may not develop symptoms within 14 days since infection. Code for the Bayesian model and MCMC sampler can be found under the bayesian folder.



Many people have contributed to the data collection and given helpful suggestions. We thank Rajen Shah, Yachong Yang, Cindy Chen, Yang Chen, Dylan Small, Michael Levy, Hera He, Zilu Zhou, Yunjin Choi, James Robins, Marc Lipsitch, Andrew Rosenfeld.

Earlier work

This project first started from a preliminary analysis of some international COVID-19 cases exported from Wuhan. The report of the first analysis can be found on medRxiv. Code for that analysis can be found in the report1 branch.