Biology as a Data Science

BIO331 Introduction to Bioinformatics

Prof. Harbert

Biology as a Data Science

The growth of data generation and collection is far outpacing the ability to process data in many fields. Today we will discuss what makes up "Data Science" and how biology and bioinformatics fits in the context of:

  • Definitions of "Data Science" and how it differs (or not) from statistics and computational sciences.
  • Types and volume of data in Biology
  • Speculating about the future of data driven biological inquiry.

Defining Data Science

Defining Data Science

Data Scientist (n.):

"Person who is better at statistics than any software engineer and better at software engineering than any statistician."

-Josh Wills (Data Engineer)

Via Twitter

Bioinformatics

The "Data Science" of Biology

  • "Person who is better at biology than any programmer and better at programming than any biologist."
  • An intersection of coding skill, math/stats functionality, and biological specialization.
  • Current Example



In both Data Science and Bioinformatics the jobs of the future are going to require the basics of data oriented disciplines.

  • Data processing skills are the "new normal" expected for professionals in all fields.

Big Data in Biology | Is there a "big-data" problem or a "BIG" data problem?

  • What qualifies as "big-data"?
  • What is the biggest source of biological data?
  • Where are other sources of potentially big data in biology?

Big Data: Volume, velocity, variety, and veracity

The 4 'V's of big data:

  • How much data is there now?
  • How quickly are we making/recording/streaming more?
  • How 'dirty' is it?
  • Can we trust data as truth?

And more

How big is biological data?

Genomics, Astronomy, Twitter, Youtube comparison:

plot of chunk unnamed-chunk-3

Genomic data "Variety"

The alphabet of the genome has 4 letters (ACGT). However, much like languages, the code of the genome depends on series of nucleotides (i.e., genes, regulatory elements).

So how does this increase complexity?

  • What is the probability of randomly selecting any given 12 letter DNA word?
  • What about doing this twice? (Is it likely to get the same sequence of DNA in a genome by chance?)

Growing data and unmet needs:

"The current Illumina HiSeq X sequencing platform can generate 900 billion nucleotides of raw DNA sequence in under 3 days-4 times the number of annotated nucleotides currently stored in GenBank, the United States "reference library" of DNA sequences [1, 2]. In the last decade, a 50,000-fold reduction in the cost of DNA sequencing [3] has led to an accumulation of 9.3 quadrillion (million billion) nucleotides of raw sequence data in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). The amount of sequence in the SRA doubled on average every 6-8 months from 2007-2016 [4, 5]." ref

Labor needs

Other areas of Biological data growth

"Beyond genotypic data, big data are flooding biology from all quarters-phenotypic data from agricultural field trials, patient medical records, and clinical trials; image data from microscopy, medical scanning, and museum specimens; interaction data from biochemical, cellular, physiological, and ecological systems; as well as an influx of data from translational fields such as bioengineering, materials science, and biogeography" ref

Other areas of Biological data growth

Citizen Science

A recent trend in data collection has been to offload the work of observation to non-professionals. Interested participants use a web interface (typically smartphone apps) to record observations of something scientists are interested in.

  • People love birding: eBird has over 500 million citizen science observations. ref
  • And a cool smartphone app iNaturalist with nifty image recognition software

Satellite Remote Sensing

Earth observation satellites have a growing history of contributing to our ability to map and model ecosystems and global ecological change. The scale of earth observation image collection is on the order of 10s of terabytes of data per day!

For next time:

Pick one of the "big-data" resources mentioned in the preceeding slides, research the 4V's relating to that project and prepare to speak for ~2 minutes on your findings on Monday to share your findings with the class.

  • So that we don't have duplicates please sign up on the Slack #topics channel
  • If you find or no of a big-data project not mentioned here (and there are many!) feel free to research what you find interesting.

end