Difference between revisions of "COVID-19 dataset clearinghouse"

From Polymath Wiki
Jump to: navigation, search
(Data sets)
(Other lists)
(29 intermediate revisions by the same user not shown)
Line 3: Line 3:
 
* [https://terrytao.files.wordpress.com/2020/03/covid_19_polymath_project-1.pdf PDF format]
 
* [https://terrytao.files.wordpress.com/2020/03/covid_19_polymath_project-1.pdf PDF format]
 
* [https://www.overleaf.com/project/5e7acd0e03821500012262bb Overleaf format]
 
* [https://www.overleaf.com/project/5e7acd0e03821500012262bb Overleaf format]
 +
* [https://terrytao.wordpress.com/2020/03/25/polymath-proposal-clearinghouse-for-crowdsourcing-covid-19-data-and-data-cleaning-requests Blog post discussing the proposal]
 +
 +
== Instructions for posting a request for a data set to be cleaned ==
 +
 +
Ideally, the submission should consist of a single plain text file which clearly delineates your request (specify what your “cleaned” data set should contain). This should specify the desired format in which the data should be saved (e.g. csv, npy, mat, json). This text file should also contain a link to a webpage where the raw data to be cleaned can easily be accessed and/or downloaded, and with specific instruction for how to locate the data set on said webpage.
 +
 +
We do not yet have a platform for these requests, so please post them for now at [https://terrytao.wordpress.com/2020/03/25/polymath-proposal-clearinghouse-for-crowdsourcing-covid-19-data-and-data-cleaning-requests the above blog post] or email tao@math.ucla.edu .
  
 
== Data sets ==
 
== Data sets ==
  
* [https://www.kaggle.com/tags/covid19 COVID-19 data sets on Kaggle]
+
=== Epidemiology ===
 +
 
 +
* [https://www.kaggle.com/tags/covid19 COVID-19 data sets], Kaggle
 
** [https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset Novel Corona Virus 2019 Dataset - Day level information on covid-19 affected cases]
 
** [https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset Novel Corona Virus 2019 Dataset - Day level information on covid-19 affected cases]
** [https://www.kaggle.com/paultimothymooney/coronavirus-genome-sequence Coronavirus Genome Sequence]
 
** [https://www.kaggle.com/paultimothymooney/repository-of-coronavirus-genomes Repository of Coronavirus Genomes]
 
* [https://docs.google.com/forms/d/e/1FAIpQLSc501xfAzEPADOwRmsdHmu-v8aN14jnKHBmEmdJJcTgRLddqw/viewform Safegraph aggregated foot traffic data].  Needs non-commercial agreement to execute.
 
 
* [https://ourworldindata.org/coronavirus Coronavirus Disease (COVID-19) – Statistics and Research], Our World in Data, by Max Roser, Hannah Ritchie and Esteban Ortiz-Ospina
 
* [https://ourworldindata.org/coronavirus Coronavirus Disease (COVID-19) – Statistics and Research], Our World in Data, by Max Roser, Hannah Ritchie and Esteban Ortiz-Ospina
* [https://github.com/datasets/covid-19 Novel Coronavirus 2019 time series data on cases], sourced and cleaned from [https://github.com/CSSEGISandData/COVID-19 this upstream repository from the Johns Hopkins University Center for Systems Science and Engineering]
+
* [https://github.com/CSSEGISandData/COVID-19 Novel Coronavirus (COVID-19) Cases], Johns Hopkins University Center for Systems Science and Engineering
* [https://github.com/COVID19Tracking/covid-tracking-data COVID Tracking Data (CSV)], from the [https://covidtracking.com/ COVID tracking project]. (US data only)
+
** [https://github.com/datasets/covid-19 Novel Coronavirus 2019 time series data on cases], sourced and cleaned from the above data set
* [https://github.com/covid19-data/covid19-data 2019-nCoV Data Processing Pipelines and datasets]
+
* [https://github.com/covid19-data/covid19-data 2019-nCoV Data Processing Pipelines and datasets] 
 +
** Countries and state names are normalized with ISO 3166-1 code.
 +
* [https://github.com/beoutbreakprepared/nCoV2019 Location for summaries and analysis of data related to n-CoV 2019, first reported in Wuhan, China], Outbreak and Pandemic Preparedness team at the Institute for Health Metrics and Evaluation, University of Washington
 +
** A [https://www.healthmap.org/covid-19/ visualization of one of the data sets]
 +
* [https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide Daily data on the geographic distribution of COVID-19 cases worldwide], European Centre for Disease Prevention and Control
 +
* [https://docs.google.com/spreadsheets/d/1jS24DjSPVWa4iuxuD4OAXrE3QeI8c9BC1hSlqr-NMiU/edit#gid=1187587451 Google sheets from DXY.cn]  
 +
** Contains some patient information [age,gender,etc]
 +
 
 +
==== North America ====
 +
 
 +
* [https://github.com/COVID19Tracking/covid-tracking-data COVID Tracking Data], from the [https://covidtracking.com/ COVID tracking project]
 +
** A daily updated repository with CSV representations of data from the [https://github.com/COVID19Tracking/covid-tracking-api/blob/master/README.md Covid Tracking API]. 
 +
* [https://coronavirus.1point3acres.com/en COVID-19 in US and Canada]
 +
** [https://coronavirus.1point3acres.com/en/data Data request form]
 +
* [https://covidtracking.com/ COVID tracking project]
 +
** Includes positive and negative results, pending tests, and total people tested for each state in the US
 +
** [https://docs.google.com/spreadsheets/u/2/d/e/2PACX-1vRwAqp96T9sYYq2-i7Tj0pvTf6XVHjDSMIKBdZHXiCGGdNC0ypEU9NbngS8mxea55JuCFuua1MUeOj5/pubhtml raw data]
 +
** [https://covidtracking.com/api/ API]
 +
* [https://github.com/kgjenkins/covid-19-ny Covid-19 coronovirus cases in New York State]
 +
 
 +
==== Other regional data ====
 +
 
 +
* [https://www.covid19india.org/ India COVID-19 tracker]
 +
** [https://docs.google.com/spreadsheets/d/e/2PACX-1vSc_2y5N0I67wDU38DjDh35IZSIS30rQf7_NYZhtYYGU1jJYT6_kDx4YpF-qw0LSlGsBYP8pqM_a1Pd/pubhtml Patient database]
 +
* [https://github.com/jihoo-kim/Data-Science-for-COVID-19-old Data Science for COVID-19 in South Korea]
 +
* [https://github.com/pcm-dpc/COVID-19 COVID-19 Italia - Monitoraggio situazione]
 +
 
 +
=== Genomics and homology ===
 +
 
 +
* [https://www.gisaid.org/ GISAID data] (Global Initiative on Sharing All Influenza Data)
 +
** [https://www.gisaid.org/registration/register/ Registration] is required.
 +
** [https://github.com/nextstrain/ncov Nextstrain build for novel coronavirus (nCoV)], based on GISAID data
 +
*** A [https://nextstrain.org/ncov Genomic epidemiology of novel coronavirus]
 +
* [https://www.kaggle.com/paultimothymooney/coronavirus-genome-sequence Coronavirus Genome Sequence], Kaggle
 +
* [https://www.kaggle.com/paultimothymooney/repository-of-coronavirus-genomes Repository of Coronavirus Genomes], Kaggle
 +
* [https://3dprint.nih.gov/discover/3DPX-012867 Wuhan coronavirus 2019-nCoV protease homology model], National Institutes of Health
 +
 
 +
=== Literature ===
 +
 
 +
* [https://www.ncbi.nlm.nih.gov/research/coronavirus/ LitCovid] - a curated literature hub for tracking up-to-date scientific information about the 2019 novel Coronavirus
 +
* [https://connect.biorxiv.org/relate/content/181 COVID-19 SARS-CoV-2 preprints from medRxiv and bioRxiv]
 +
 
 +
=== Other data ===
 +
 
 +
* [https://docs.google.com/forms/d/e/1FAIpQLSc501xfAzEPADOwRmsdHmu-v8aN14jnKHBmEmdJJcTgRLddqw/viewform Aggregated foot traffic data], Safegraph
 +
**  Needs non-commercial agreement to execute.
 +
** [https://www.safegraph.com/dashboard/covid19-commerce-patterns?is=5e7a3815f20d617a17a33173 Sample visualization of Safegraph data]
 +
* [https://www.covidcaremap.org COVID Care Map]
 +
** Open geospatial work to support health systems' capacity (providers, supplies, ventilators, beds, meds) to effectively care for rapidly growing COVID19 patient needs
 +
** [https://www.covidcaremap.org/maps/us-healthcare-system-capacity/#6.07/40.085/-75.195 Open map data on US health system capacity to care for COVID-19 patients]
 +
* [https://www.kaggle.com/darshan1504/covid19-detection-xray-dataset COVID-19 Detection X-Ray Dataset], Kaggle
 +
 
 +
=== Data scrapers and aggregators ===
 +
 
 +
* [https://coronadatascraper.com/#home Corona Data Scraper]
 +
* [https://github.com/jagsfan82/Covid19-WebScrape-Plus Covid19-WebScrape-Plus]
 +
* [http://covid-19.seektable.com/ COVID-19], Seektable
 +
 
 +
=== Visualizations and summaries ===
 +
 
 +
* [https://www.worldometers.info/coronavirus/ COVID-19 Coronavirus Pandemic], Worldometer
 +
* [https://bnonews.com/index.php/2020/03/the-latest-coronavirus-cases/ Tracking coronavirus: Map, data and timeline], BNO News
 +
* [https://gisanddata.maps.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6 Coronavirus COVID-19 Global Cases], JHU CSSE
 +
* [https://infection2020.com/ Infection2020]
 +
* [https://covy.app/ covy.app]
 +
* [https://ncov.dxy.cn/ncovh5/view/pneumonia?from=dxy&source=&link=&share= COVID-19 Global Pandemic Real-Time report], dxy.cn ([https://ncov.dxy.cn/ncovh5/view/en_pneumonia?from=dxy&source=&link=&share= English version])
 +
* [https://www.ft.com/coronavirus-latest Coronavirus tracked: the latest figures as the pandemic spreads], Financial Times
 +
* [https://www.mygov.in/covid-19/ COVID-19] - official Indian government site
 +
 
 +
=== Other lists ===
 +
 
 +
* [https://www.reddit.com/r/datasets/comments/exnzrd/coronavirus_datasets/ Reddit thread collecting coronavirus datasets]
 +
* [https://www.programmableweb.com/news/apis-to-track-coronavirus-covid-19/review/2020/03/18 Review of COVID-19 APIs], Wendell Santos
 +
* [https://www.data-against-covid.org/ Data against COVID-19]
 +
 
 +
== Data cleaning requests ==
 +
 
 +
We do not have a platform yet to handle queries or submissions to these cleaning requests, so for now please use the comment thread at [https://terrytao.wordpress.com/2020/03/25/polymath-proposal-clearinghouse-for-crowdsourcing-covid-19-data-and-data-cleaning-requests/ this blog post] for these.
 +
 
 +
=== From Chris Strohmeier (UCLA), Mar 25 ===
 +
 
 +
 
 +
The biorxiv_medrxiv file at https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge contains another folder titled biorxiv_medrxiv, which in turn contains hundreds of json files. Each file corresponds to a research article, at least tangentially related to COVID-19.
 +
 
 +
We are requesting:
 +
 
 +
* A tf-idf matrix associated to the subset of the above collection which contain full-text articles (some appear to only have abstracts).
 +
* The rows should correspond to the (e.g. 5000) most commonly used words.
 +
* The columns should correspond to each individual json file.
 +
* The clean data should be stored as a npy or mat file (or both).
 +
* Finally, there should be a csv or text document (or both) explaining the meaning of the individual rows and columns of the matrix (what words do the rows correspond to? What file does each column correspond to).
 +
 
 +
Contact: c.strohmeier@math.ucla.edu
 +
 
 +
== Miscellaneous links ==
 +
 
 +
* [https://united-against-covid.org/ United Against COVID-19], which also crowdsources scientific and coding efforts to study the COVID-19 pandemic

Revision as of 15:08, 27 March 2020

Data cleaning proposal

Instructions for posting a request for a data set to be cleaned

Ideally, the submission should consist of a single plain text file which clearly delineates your request (specify what your “cleaned” data set should contain). This should specify the desired format in which the data should be saved (e.g. csv, npy, mat, json). This text file should also contain a link to a webpage where the raw data to be cleaned can easily be accessed and/or downloaded, and with specific instruction for how to locate the data set on said webpage.

We do not yet have a platform for these requests, so please post them for now at the above blog post or email tao@math.ucla.edu .

Data sets

Epidemiology

North America

Other regional data

Genomics and homology

Literature

Other data

Data scrapers and aggregators

Visualizations and summaries

Other lists

Data cleaning requests

We do not have a platform yet to handle queries or submissions to these cleaning requests, so for now please use the comment thread at this blog post for these.

From Chris Strohmeier (UCLA), Mar 25

The biorxiv_medrxiv file at https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge contains another folder titled biorxiv_medrxiv, which in turn contains hundreds of json files. Each file corresponds to a research article, at least tangentially related to COVID-19.

We are requesting:

  • A tf-idf matrix associated to the subset of the above collection which contain full-text articles (some appear to only have abstracts).
  • The rows should correspond to the (e.g. 5000) most commonly used words.
  • The columns should correspond to each individual json file.
  • The clean data should be stored as a npy or mat file (or both).
  • Finally, there should be a csv or text document (or both) explaining the meaning of the individual rows and columns of the matrix (what words do the rows correspond to? What file does each column correspond to).

Contact: c.strohmeier@math.ucla.edu

Miscellaneous links