Prof. Harbert
The growth of data generation and collection is far outpacing the ability to process data in many fields. Today we will discuss what makes up "Data Science" and how biology and bioinformatics fits in the context of:
"Data science has been called "the sexiest job of the 21st century," presumably by someone who has never visited a fire station." -- From "Data Science From Scratch"
Data Scientist (n.):
"Person who is better at statistics than any software engineer and better at software engineering than any statistician."
-Josh Wills (Data Engineer)
The "Data Science" of Biology
In both Data Science and Bioinformatics the jobs of the future are going to require the basics of data oriented disciplines.
The 4 'V's of big data:
And more
The top data producer in biology is clearly DNA sequencing:
The alphabet of the genome has 4 letters (ACGT). However, much like languages, the code of the genome depends on series of nucleotides (i.e., genes, regulatory elements).
So how does this increase complexity?
"The current Illumina HiSeq X sequencing platform can generate 900 billion nucleotides of raw DNA sequence in under 3 days-4 times the number of annotated nucleotides currently stored in GenBank, the United States "reference library" of DNA sequences [1, 2]. In the last decade, a 50,000-fold reduction in the cost of DNA sequencing [3] has led to an accumulation of 9.3 quadrillion (million billion) nucleotides of raw sequence data in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). The amount of sequence in the SRA doubled on average every 6-8 months from 2007-2016 [4, 5]." ref
90% of NSF grant funded labs say they will need help analyzing data they are generating now
"The problem is the growing gap between the accumulation of big data and researchers' knowledge about how to use it effectively."
"Beyond genotypic data, big data are flooding biology from all quarters-phenotypic data from agricultural field trials, patient medical records, and clinical trials; image data from microscopy, medical scanning, and museum specimens; interaction data from biochemical, cellular, physiological, and ecological systems; as well as an influx of data from translational fields such as bioengineering, materials science, and biogeography" ref
Phenotypic Data
Ecological Data Aggregation
A recent trend in data collection has been to offload the work of observation to non-professionals. Interested participants use a web interface (typically smartphone apps) to record observations of something scientists are interested in.
Earth observation satellites have a growing history of contributing to our ability to map and model ecosystems and global ecological change. The scale of earth observation image collection is on the order of 10s of terabytes of data per day!
Pick one of the "big-data" resources mentioned in the preceeding slides, research the 4V's relating to that project and prepare to speak for ~2 minutes on your findings on Monday to share your findings with the class.