Difference between revisions of "COVID-19 dataset clearinghouse"

From Polymath Wiki
Jump to: navigation, search
(Other lists)
(Visualizations and summaries)
(18 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Data cleaning proposal ==
+
This is a repository for public data sets relating to the COVID-19 pandemic.  It was also initially envisioned as a clearinghouse for matching requests for data cleaning of such datasets with volunteers willing to perform this clearing, but the [https://discourse.data-against-covid.org/c/i-have-data/15 existing clearinghouse] at [http://united-against-covid.org/ United against COVID-19] is already up and running for this purpose, so we are redirecting such requests to that site in order not to fragment the pools of requests and volunteers.
  
* [https://terrytao.files.wordpress.com/2020/03/covid_19_polymath_project-1.pdf PDF format]
+
For discussion of this project, see [https://terrytao.wordpress.com/2020/03/25/polymath-proposal-clearinghouse-for-crowdsourcing-covid-19-data-and-data-cleaning-requests this blog post].
* [https://www.overleaf.com/project/5e7acd0e03821500012262bb Overleaf format]
+
* [https://terrytao.wordpress.com/2020/03/25/polymath-proposal-clearinghouse-for-crowdsourcing-covid-19-data-and-data-cleaning-requests Blog post discussing the proposal]
+
  
== Instructions for posting a request for a data set to be cleaned ==
+
== Data sets ==
  
Ideally, the submission should consist of a single plain text file which clearly delineates your request (specify what your “cleaned” data set should contain). This should specify the desired format in which the data should be saved (e.g. csv, npy, mat, json). This text file should also contain a link to a webpage where the raw data to be cleaned can easily be accessed and/or downloaded, and with specific instruction for how to locate the data set on said webpage.
+
Further contributions are very welcome, and can be made either directly to this wiki page (after requesting an account), or placed in the comments to [https://terrytao.wordpress.com/2020/03/25/polymath-proposal-clearinghouse-for-crowdsourcing-covid-19-data-and-data-cleaning-requests/ this blog post], or by email to tao@math.ucla.edu.
 
+
We do not yet have a platform for these requests, so please post them for now at [https://terrytao.wordpress.com/2020/03/25/polymath-proposal-clearinghouse-for-crowdsourcing-covid-19-data-and-data-cleaning-requests the above blog post] or email tao@math.ucla.edu .
+
 
+
== Data sets ==
+
  
 
=== Epidemiology ===
 
=== Epidemiology ===
  
* [https://www.kaggle.com/tags/covid19 COVID-19 data sets], Kaggle
+
* [https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset Novel Corona Virus 2019 Dataset - Day level information on covid-19 affected cases], Kaggle
** [https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset Novel Corona Virus 2019 Dataset - Day level information on covid-19 affected cases]
+
 
* [https://ourworldindata.org/coronavirus Coronavirus Disease (COVID-19) – Statistics and Research], Our World in Data, by Max Roser, Hannah Ritchie and Esteban Ortiz-Ospina
 
* [https://ourworldindata.org/coronavirus Coronavirus Disease (COVID-19) – Statistics and Research], Our World in Data, by Max Roser, Hannah Ritchie and Esteban Ortiz-Ospina
 
* [https://github.com/CSSEGISandData/COVID-19 Novel Coronavirus (COVID-19) Cases], Johns Hopkins University Center for Systems Science and Engineering
 
* [https://github.com/CSSEGISandData/COVID-19 Novel Coronavirus (COVID-19) Cases], Johns Hopkins University Center for Systems Science and Engineering
Line 27: Line 20:
 
* [https://docs.google.com/spreadsheets/d/1jS24DjSPVWa4iuxuD4OAXrE3QeI8c9BC1hSlqr-NMiU/edit#gid=1187587451 Google sheets from DXY.cn]  
 
* [https://docs.google.com/spreadsheets/d/1jS24DjSPVWa4iuxuD4OAXrE3QeI8c9BC1hSlqr-NMiU/edit#gid=1187587451 Google sheets from DXY.cn]  
 
** Contains some patient information [age,gender,etc]
 
** Contains some patient information [age,gender,etc]
 +
* [https://www.kaggle.com/imdevskp/corona-virus-report COVID-19 Complete Dataset (Updated every 24hrs)], Kaggle
 +
* [https://datarepository.wolframcloud.com/resources/Epidemic-Data-for-Novel-Coronavirus-COVID-19 Epidemic Data for Novel Coronavirus COVID-19], Wolfram
  
 
==== North America ====
 
==== North America ====
Line 32: Line 27:
 
* [https://github.com/COVID19Tracking/covid-tracking-data COVID Tracking Data], from the [https://covidtracking.com/ COVID tracking project]
 
* [https://github.com/COVID19Tracking/covid-tracking-data COVID Tracking Data], from the [https://covidtracking.com/ COVID tracking project]
 
** A daily updated repository with CSV representations of data from the [https://github.com/COVID19Tracking/covid-tracking-api/blob/master/README.md Covid Tracking API].   
 
** A daily updated repository with CSV representations of data from the [https://github.com/COVID19Tracking/covid-tracking-api/blob/master/README.md Covid Tracking API].   
 +
* [https://www.kaggle.com/sudalairajkumar/covid19-in-usa COVID-19 in USA], Kaggle
 
* [https://coronavirus.1point3acres.com/en COVID-19 in US and Canada]
 
* [https://coronavirus.1point3acres.com/en COVID-19 in US and Canada]
 
** [https://coronavirus.1point3acres.com/en/data Data request form]
 
** [https://coronavirus.1point3acres.com/en/data Data request form]
Line 39: Line 35:
 
** [https://covidtracking.com/api/ API]
 
** [https://covidtracking.com/api/ API]
 
* [https://github.com/kgjenkins/covid-19-ny Covid-19 coronovirus cases in New York State]
 
* [https://github.com/kgjenkins/covid-19-ny Covid-19 coronovirus cases in New York State]
 +
* [https://www.nytimes.com/article/coronavirus-county-data-us.html Coronavirus Case Data for Every U.S. County], New York Times
  
==== Other regional data ====
+
==== Europe ====
 +
 
 +
* [https://leoss.net/ Studying SARS-CoV-2 in European patients], Lean European Open Survey on SARS-CoV-2 Infected Patients (LEOSS)
 +
* [https://github.com/pcm-dpc/COVID-19 COVID-19 Italia - Monitoraggio situazione]
 +
* [https://www.epicentro.iss.it/coronavirus/sars-cov-2-sorveglianza-dati Sorveglianza integrata COVID-19: i principali dati nazionali] (Italy), Epicentro
 +
* [https://www.kaggle.com/sudalairajkumar/covid19-in-italy COVID-19 in Italy], Kaggle
 +
* [https://npgeo-corona-npgeo-de.hub.arcgis.com/search?groupIds=b28109b18022405bb965c602b13e1bbc RKI COVID19] (Germany), NPGEO Corona
 +
* [https://www.bag.admin.ch/bag/en/home/krankheiten/ausbrueche-epidemien-pandemien/aktuelle-ausbrueche-epidemien/novel-cov/situation-schweiz-und-international.html New coronavirus: Current situation – Switzerland and international], Bundesamt für Gesundheit.
 +
** [https://www.bag.admin.ch/dam/bag/de/dokumente/mt/k-und-i/aktuelle-ausbrueche-pandemien/2019-nCoV/covid-19-datengrundlage-lagebericht.xlsx.download.xlsx/200325_Datengrundlage_Grafiken_COVID-19-Bericht.xlsx data set]
 +
* [https://www.mscbs.gob.es/profesionales/saludPublica/ccayes/alertasActual/nCov-China/situacionActual.htm Situacion Actual] (Spain), Ministerio de Sanidad, Consumo y Bienestar
 +
 
 +
==== Asia ====
  
 
* [https://www.covid19india.org/ India COVID-19 tracker]
 
* [https://www.covid19india.org/ India COVID-19 tracker]
 
** [https://docs.google.com/spreadsheets/d/e/2PACX-1vSc_2y5N0I67wDU38DjDh35IZSIS30rQf7_NYZhtYYGU1jJYT6_kDx4YpF-qw0LSlGsBYP8pqM_a1Pd/pubhtml Patient database]
 
** [https://docs.google.com/spreadsheets/d/e/2PACX-1vSc_2y5N0I67wDU38DjDh35IZSIS30rQf7_NYZhtYYGU1jJYT6_kDx4YpF-qw0LSlGsBYP8pqM_a1Pd/pubhtml Patient database]
 +
* [https://www.kaggle.com/sudalairajkumar/covid19-in-india Dataset on Novel Corona Virus Disease 2019 in India], Kaggle
 +
* [https://www.kaggle.com/imdevskp/covid19-corona-virus-india-dataset COVID-19 Corona Virus India Dataset], Kaggle
 +
** State/UT/NCR wise COVID-19 data
 
* [https://github.com/jihoo-kim/Data-Science-for-COVID-19-old Data Science for COVID-19 in South Korea]
 
* [https://github.com/jihoo-kim/Data-Science-for-COVID-19-old Data Science for COVID-19 in South Korea]
* [https://github.com/pcm-dpc/COVID-19 COVID-19 Italia - Monitoraggio situazione]
+
** [https://www.kaggle.com/kimjihoo/coronavirusdataset The data set on Kaggle]
 +
* [https://www.cdc.go.kr/board/board.es?mid=a30402000000&bid=0030 Press releases], Korea Centers for Disease Control and Prevention
 +
 
 +
==== Other regional data ====
 +
 
 +
* [https://www.kaggle.com/unanimad/corona-virus-brazil Coronavirus (COVID-19) - Brazil Dataset], Kaggle
 +
* [https://www.health.nsw.gov.au/Infectious/diseases/Pages/covid-19-latest.aspx Latest updates on COVID-19], New South Wales
  
 
=== Genomics and homology ===
 
=== Genomics and homology ===
Line 55: Line 72:
 
* [https://www.kaggle.com/paultimothymooney/coronavirus-genome-sequence Coronavirus Genome Sequence], Kaggle
 
* [https://www.kaggle.com/paultimothymooney/coronavirus-genome-sequence Coronavirus Genome Sequence], Kaggle
 
* [https://www.kaggle.com/paultimothymooney/repository-of-coronavirus-genomes Repository of Coronavirus Genomes], Kaggle
 
* [https://www.kaggle.com/paultimothymooney/repository-of-coronavirus-genomes Repository of Coronavirus Genomes], Kaggle
 +
* [https://www.kaggle.com/jamzing/sars-coronavirus-accession SARS coronavirus accession], Kaggle
 +
** Exploration of mutations of the SARS corona virus with complete genome
 +
* [https://datarepository.wolframcloud.com/resources/Genetic-Sequences-for-the-SARS-CoV-2-Coronavirus Genetic Sequences for the SARS-CoV-2 Coronavirus], Wolfram
 +
** Nucleotide sequences of the SARS-CoV-2 virus (the virus associated with the COVID-19 disease, formerly known as 2019-nCoV) including location, collection time and similar supporting data.
 
* [https://3dprint.nih.gov/discover/3DPX-012867 Wuhan coronavirus 2019-nCoV protease homology model], National Institutes of Health
 
* [https://3dprint.nih.gov/discover/3DPX-012867 Wuhan coronavirus 2019-nCoV protease homology model], National Institutes of Health
  
Line 61: Line 82:
 
* [https://www.ncbi.nlm.nih.gov/research/coronavirus/ LitCovid] - a curated literature hub for tracking up-to-date scientific information about the 2019 novel Coronavirus
 
* [https://www.ncbi.nlm.nih.gov/research/coronavirus/ LitCovid] - a curated literature hub for tracking up-to-date scientific information about the 2019 novel Coronavirus
 
* [https://connect.biorxiv.org/relate/content/181 COVID-19 SARS-CoV-2 preprints from medRxiv and bioRxiv]
 
* [https://connect.biorxiv.org/relate/content/181 COVID-19 SARS-CoV-2 preprints from medRxiv and bioRxiv]
 +
* [https://pages.semanticscholar.org/coronavirus-research COVID-19 Open Research Dataset (CORD-19)], Allen Institute for AI, Microsoft, NLM, CZI, Georgetown University
 +
** Over 44,000 scholarly articles, including over 29,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community
 +
** requested by the White House Office of Science and Technology Policy, and part of the [https://www.whitehouse.gov/briefings-statements/call-action-tech-community-new-machine-readable-covid-19-dataset/ Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset]
 +
 +
=== Medical imagery and records ===
 +
 +
* [https://datarepository.wolframcloud.com/resources/Patient-Medical-Data-for-Novel-Coronavirus-COVID-19 Patient Medical Data for Novel Coronavirus COVID-19], Wolfram
 +
* [https://www.kaggle.com/darshan1504/covid19-detection-xray-dataset COVID-19 Detection X-Ray Dataset], Kaggle
 +
* [https://www.sirm.org/category/senza-categoria/covid-19/ COVID-19: casistica radiologica Italiana], Società Italiana di Radiologia Medica e Interventistica
  
 
=== Other data ===
 
=== Other data ===
Line 70: Line 100:
 
** Open geospatial work to support health systems' capacity (providers, supplies, ventilators, beds, meds) to effectively care for rapidly growing COVID19 patient needs
 
** Open geospatial work to support health systems' capacity (providers, supplies, ventilators, beds, meds) to effectively care for rapidly growing COVID19 patient needs
 
** [https://www.covidcaremap.org/maps/us-healthcare-system-capacity/#6.07/40.085/-75.195 Open map data on US health system capacity to care for COVID-19 patients]
 
** [https://www.covidcaremap.org/maps/us-healthcare-system-capacity/#6.07/40.085/-75.195 Open map data on US health system capacity to care for COVID-19 patients]
* [https://www.kaggle.com/darshan1504/covid19-detection-xray-dataset COVID-19 Detection X-Ray Dataset], Kaggle
+
* [http://www.panacealab.org/covid19/ Covid-19 Twitter chatter dataset for scientific use], Panacea Lab, Georgia State University
  
 
=== Data scrapers and aggregators ===
 
=== Data scrapers and aggregators ===
Line 88: Line 118:
 
* [https://www.ft.com/coronavirus-latest Coronavirus tracked: the latest figures as the pandemic spreads], Financial Times
 
* [https://www.ft.com/coronavirus-latest Coronavirus tracked: the latest figures as the pandemic spreads], Financial Times
 
* [https://www.mygov.in/covid-19/ COVID-19] - official Indian government site
 
* [https://www.mygov.in/covid-19/ COVID-19] - official Indian government site
 +
* [https://www.kaggle.com/imdevskp/covid-19-analysis-visualization-comparisons/data COVID-19 - Analysis, Visualization & Comparisons], Kaggle
 +
* [https://covidactnow.org/ COVID Act Now] - predictions of COVID cases in the US by state
 +
** [https://covidactnow.org/model The model used]
  
 
=== Other lists ===
 
=== Other lists ===
  
 +
* [https://www.kaggle.com/datasets?search=covid-19 COVID-19 data sets], Kaggle
 
* [https://www.reddit.com/r/datasets/comments/exnzrd/coronavirus_datasets/ Reddit thread collecting coronavirus datasets]
 
* [https://www.reddit.com/r/datasets/comments/exnzrd/coronavirus_datasets/ Reddit thread collecting coronavirus datasets]
 
* [https://www.programmableweb.com/news/apis-to-track-coronavirus-covid-19/review/2020/03/18 Review of COVID-19 APIs], Wendell Santos
 
* [https://www.programmableweb.com/news/apis-to-track-coronavirus-covid-19/review/2020/03/18 Review of COVID-19 APIs], Wendell Santos
* [https://www.data-against-covid.org/ Data against COVID-19]
+
* [https://npgeo-corona-npgeo-de.hub.arcgis.com/ NPGEO Corona Hub 2020], Nationale Plattform für geografische Daten (NPGEO)
 +
* [https://datarepository.wolframcloud.com/search?i=COVID Data sets for COVID], Wolfram Data Repository
 +
* [https://www.tableau.com/covid-19-coronavirus-data-resources COVID-19 Data Hub], Tableau
  
== Data cleaning requests ==
+
== Data or Data cleaning requests ==
  
We do not have a platform yet to handle queries or submissions to these cleaning requests, so for now please use the comment thread at [https://terrytao.wordpress.com/2020/03/25/polymath-proposal-clearinghouse-for-crowdsourcing-covid-19-data-and-data-cleaning-requests/ this blog post] for these.
+
As mentioned at the top of this page, future requests for data or data cleaning should be directed to [http://united-against-covid.org/ this data discourse page] at [http://united-against-covid.org/ United Against COVID-19].  Below are the legacy requests of this project prior to this redirect.
  
 
=== From Chris Strohmeier (UCLA), Mar 25 ===
 
=== From Chris Strohmeier (UCLA), Mar 25 ===
 
  
 
The biorxiv_medrxiv file at https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge contains another folder titled biorxiv_medrxiv, which in turn contains hundreds of json files. Each file corresponds to a research article, at least tangentially related to COVID-19.
 
The biorxiv_medrxiv file at https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge contains another folder titled biorxiv_medrxiv, which in turn contains hundreds of json files. Each file corresponds to a research article, at least tangentially related to COVID-19.
Line 114: Line 149:
 
Contact: c.strohmeier@math.ucla.edu
 
Contact: c.strohmeier@math.ucla.edu
  
== Miscellaneous links ==
+
=== From Juan José Piñero de Armas (U. Católica de Murcia), Mar 27 ===
 +
 
 +
We request information (on a person basis) to perform survival analyses, regressions with random effects, etc.  Some data exists for instance at
 +
 
 +
https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset/data
 +
https://www.kaggle.com/kimjihoo/coronavirusdataset
 +
https://www.kaggle.com/imdevskp/covid-19-analysis-visualization-comparisons/data
 +
https://www.sirm.org/category/senza-categoria/covid-19/
 +
 
 +
but we need much more detail (date when each person was diagnosed, date of infection for the same person, discharge date, date of death, gender, age, treatments, temperatures...) not just summaries or country-aggregated data.
  
* [https://united-against-covid.org/ United Against COVID-19], which also crowdsources scientific and coding efforts to study the COVID-19 pandemic
+
Contact: jjpinero@ucam.edu

Revision as of 00:42, 29 March 2020

This is a repository for public data sets relating to the COVID-19 pandemic. It was also initially envisioned as a clearinghouse for matching requests for data cleaning of such datasets with volunteers willing to perform this clearing, but the existing clearinghouse at United against COVID-19 is already up and running for this purpose, so we are redirecting such requests to that site in order not to fragment the pools of requests and volunteers.

For discussion of this project, see this blog post.

Data sets

Further contributions are very welcome, and can be made either directly to this wiki page (after requesting an account), or placed in the comments to this blog post, or by email to tao@math.ucla.edu.

Epidemiology

North America

Europe

Asia

Other regional data

Genomics and homology

Literature

Medical imagery and records

Other data

Data scrapers and aggregators

Visualizations and summaries

Other lists

Data or Data cleaning requests

As mentioned at the top of this page, future requests for data or data cleaning should be directed to this data discourse page at United Against COVID-19. Below are the legacy requests of this project prior to this redirect.

From Chris Strohmeier (UCLA), Mar 25

The biorxiv_medrxiv file at https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge contains another folder titled biorxiv_medrxiv, which in turn contains hundreds of json files. Each file corresponds to a research article, at least tangentially related to COVID-19.

We are requesting:

  • A tf-idf matrix associated to the subset of the above collection which contain full-text articles (some appear to only have abstracts).
  • The rows should correspond to the (e.g. 5000) most commonly used words.
  • The columns should correspond to each individual json file.
  • The clean data should be stored as a npy or mat file (or both).
  • Finally, there should be a csv or text document (or both) explaining the meaning of the individual rows and columns of the matrix (what words do the rows correspond to? What file does each column correspond to).

Contact: c.strohmeier@math.ucla.edu

From Juan José Piñero de Armas (U. Católica de Murcia), Mar 27

We request information (on a person basis) to perform survival analyses, regressions with random effects, etc. Some data exists for instance at

https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset/data https://www.kaggle.com/kimjihoo/coronavirusdataset https://www.kaggle.com/imdevskp/covid-19-analysis-visualization-comparisons/data https://www.sirm.org/category/senza-categoria/covid-19/

but we need much more detail (date when each person was diagnosed, date of infection for the same person, discharge date, date of death, gender, age, treatments, temperatures...) not just summaries or country-aggregated data.

Contact: jjpinero@ucam.edu