Cool stuff:

https://aeon.co/amp/videos/watch-a-single-cell-become-a-complete-organism-in-six-pulsing-minutes-of-timelapse?__twitter_impression=true

R Reminders

Accessing RStudio Online

To access your RStudio account for this course navigate to http://ip.of.the.day:8787. This number can change from time to time so if you have trouble accessing this later you should email your instructor.

Data Visualization

Recent news:

The BBC Visual and Data Journalism team uses R and ggplot to create their publication graphics. A few weeks ago they released their toolbox on GitHub. More from Medium

Read Census Data

For this section we will read in a table of US Census Demographic data from https://www.kaggle.com/muonneutrino/us-census-demographic-data. The file can be found on the RStudio server at ‘/usr/share/data/kaggle/census.csv’.

To read data into R we use: read.table() read.csv() read.delim()

census = read.csv('/usr/share/data/kaggle/census.csv')
head(census)

LOOK BACK: How do you write a table to a new file?

More ggplot2 Graphics

For help and an overview of all of the graphics types that are available in ggplot check out the R ggplot2 cheatsheet

library(ggplot2)

Recall: Scatterplots using ‘geom_point()’

ggplot(data = census) +
  geom_point(mapping = aes(x = Citizen, y = TotalPop))

Graphs for single parameters:

geom_histogram()

ggplot(data=census) +
  geom_histogram(aes(x=Citizen), binwidth=100)

But the x-axis is based on the maximum value, not the effective range so the distribution of the data is tough to read.

ggplot(data=census) +
  geom_histogram(aes(x=Citizen), binwidth=100)+
  xlim(c(0,10000))

## Warning: Removed 158 rows containing non-finite values (stat_bin).

That also got rid of an artifact near zero. Let’s inspect the data for ‘Citizen’ to see what data could be causing that:

#summary(census$Citizen)
sum(census$Citizen <=0)

## [1] 691

Are zero values real?

Sidetrack: Logical subsetting

We can use logical operators and the data.frame index notation to find ONLY those rows of data that have a zero value in the Citizen column.

# test for Citizen ==0
logic = census$Citizen==0
# Then use this vector of logical results to index the data frame
zerocit= census[logic,] #make sure to use the rows position -> left of the column
#summary(zerocit)

Density:

ggplot(data=census) +
  geom_density(aes(x=Citizen)) +
  xlim(c(0,10000))

## Warning: Removed 158 rows containing non-finite values (stat_density).

geom_area()

ggplot(data=census) +
  geom_area(stat='bin', bins = 30, aes(x=Citizen)) +
  xlim(c(0,10000))

## Warning: Removed 158 rows containing non-finite values (stat_bin).

For Discrete values: geom_bar()

ggplot(data=census) +
  geom_bar(aes(x=State))

But, this is hard to read. A few fixes:

Rotate x-axis labels (Use ‘theme()’)
Sort states by count (Use library(forcats) – Tools for working with categorical variables)

library(forcats)
ggplot(data=census) +
  geom_bar(aes(x=fct_infreq(State))) +
  xlab('') +
  theme(axis.text.x  = element_text(angle=90))

Challenge 1:

Pick a plot method appropriate for the data in the MeanCommute column.

Graphs for two parameters

Continuous parameters

geom_point()

ggplot(data=census) + 
    geom_point(aes(x=IncomePerCap, y=Poverty))

## Warning: Removed 881 rows containing missing values (geom_point).

See also: geom_jitter(), and geom_label()

geom_smooth

ggplot(data=census) + 
    geom_point(aes(x=IncomePerCap, y=Poverty), alpha=0.1) +
    geom_smooth(aes(x=IncomePerCap, y=Poverty))

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 881 rows containing non-finite values (stat_smooth).

## Warning: Removed 881 rows containing missing values (geom_point).

Countour and density plots

ggplot(data=census) + 
    geom_hex(aes(x=IncomePerCap, y=Poverty))

## Warning: Removed 881 rows containing non-finite values (stat_binhex).

ggplot(data=census) + 
    geom_density2d(aes(x=IncomePerCap, y=Poverty))

## Warning: Removed 881 rows containing non-finite values (stat_density2d).

One Discrete X One Continuous parameter

Boxplots

ggplot(data=census[1:10000,]) + 
    geom_boxplot(aes(x=State, y=Poverty))

## Warning: Removed 72 rows containing non-finite values (stat_boxplot).

Violin charts

ggplot(data=census[1:10000,]) + 
    geom_violin(aes(x=State, y=Poverty))

## Warning: Removed 72 rows containing non-finite values (stat_ydensity).

Facetting: Multipanel plots

Plus more logical subsetting using %in%

ggplot(data=census[census$State %in% c('Alabama', 'Oregon', 'New York', 'California'),]) + 
    geom_point(aes(x=Employed, y=Poverty)) +
    facet_grid(~State) + theme(axis.text.x = element_text(angle=90))

## Warning: Removed 165 rows containing missing values (geom_point).

Challenge 2

Develop a plot showing the relationship between MeanCommute and Transit. Then attempt to facet this by State for five states of your choosing.

Just for fun:

Try installing and running the code found here to create ‘mathart’ examples from Marcus Volz.

Saving your workspace:

Either Session -> Save Workspace As

OR:

save.image('workspace_1.RData')

You ALSO want to save the R script file that you have been putting your code in with File-> Save(As) and using the file extension .R to identify R code files.

Blog

Create a new blog RMarkdown document outlining the basics of plotting with ggplot and R. Then open a new file using:

newdata = read.csv('/usr/share/data/classfiles/readlengths3.txt', sep=' ')

These data are read lengths for three samples run on an Illumina sequencing instrument. I want to know how the read length distributions differ between these three samples. The DNA came from fossilized packrat middens dating back up to 3500 years ago and the degree of fragmentation may tell something about the age and/or preservation state of the sample.

The sequencing generated reads of 125 base pairs and read paired-end data (recall the Illumina sequencing by synthesis process). These read lengths come from merged overlapping reads, so those reads >125bp represent molecules that were originally less than 240bp in size (required minimum 10bp overlap).

Create a new figure showing some attribute of the read length distribution that can be compared between these samples and include it in your blog for this week.

home

Data Visualization in R

Prof. Harbert

04 February, 2019