home

Data Visualization

Today we will be using ‘ggplot’, an R package for data visualization. These tools will help you create publication quality plots and figures directly from your R code. If there is one thing that you get out of this class I hope that it is som confidence with these tools.

What is ‘ggplot’? The name refers to the “Grammar of Graphics”. This refers to a structured language for building data visualizations.

Recent news: ggplot in the press

The BBC Visual and Data Journalism team uses R and ggplot to create their publication graphics. A few weeks ago they released their toolbox on GitHub. More from Medium

Visualization Practice

Read Census Data

For this section we will read in a table of US Census Demographic data from https://www.kaggle.com/muonneutrino/us-census-demographic-data. The file can be found on the RStudio server at ‘/usr/share/data/kaggle/census.csv’.

To read data into R we use: read.table() read.csv() read.delim()

census = read.csv('/usr/share/data/kaggle/census.csv')
head(census)

LOOK BACK: How do you write a table to a new file?

ggplot2 Graphics

For help and an overview of all of the graphics types that are available in ggplot check out the R ggplot2 cheatsheet

library(ggplot2)

Recall: Scatterplots using ‘geom_point()’

ggplot(data = census) +
  geom_point(mapping = aes(x = Citizen, y = TotalPop))

Graphs for single parameters:

geom_histogram()

ggplot(data=census) +
  geom_histogram(aes(x=Citizen), binwidth=100) 

But the x-axis is based on the maximum value, not the effective range so the distribution of the data is tough to read.

ggplot(data=census) +
  geom_histogram(aes(x=Citizen), binwidth=100)+
  xlim(c(0,10000))
## Warning: Removed 158 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

That also got rid of an artifact near zero. Let’s inspect the data for ‘Citizen’ to see what data could be causing that:

#summary(census$Citizen)
sum(census$Citizen <=0)
## [1] 691

Are zero values real?

Sidetrack: Logical subsetting

We can use logical operators and the data.frame index notation to find ONLY those rows of data that have a zero value in the Citizen column.

# test for Citizen ==0
logic = census$Citizen==0
# Then use this vector of logical results to index the data frame
zerocit= census[logic,] #make sure to use the rows position -> left of the column
#summary(zerocit)

Density:

ggplot(data=census) +
  geom_density(aes(x=Citizen)) +
  xlim(c(0,10000))
## Warning: Removed 158 rows containing non-finite values (stat_density).

geom_area()

ggplot(data=census) +
  geom_area(stat='bin', bins = 30, aes(x=Citizen)) +
  xlim(c(0,10000))
## Warning: Removed 158 rows containing non-finite values (stat_bin).

For Discrete values: geom_bar()

ggplot(data=census) +
  geom_bar(aes(x=State))

But, this is hard to read. A few fixes:

  • Rotate x-axis labels (Use ‘theme()’)
  • Sort states by count (Use library(forcats) – Tools for working with categorical variables)
library(forcats)
ggplot(data=census) +
  geom_bar(aes(x=fct_infreq(State))) +
  xlab('') +
  theme(axis.text.x  = element_text(angle=90))

Challenge 1:

Pick a plot method appropriate for the data in the MeanCommute column.

Graphs for two parameters

Continuous parameters

geom_point()

Scatterplots

ggplot(data=census) + 
    geom_point(aes(x=IncomePerCap, y=Poverty))
## Warning: Removed 881 rows containing missing values (geom_point).

See also: geom_jitter(), and geom_label()

geom_smooth

Add a line

ggplot(data=census) + 
    geom_point(aes(x=IncomePerCap, y=Poverty), alpha=0.1) +
    geom_smooth(aes(x=IncomePerCap, y=Poverty)) 
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 881 rows containing non-finite values (stat_smooth).
## Warning: Removed 881 rows containing missing values (geom_point).

Countour and density plots

ggplot(data=census) + 
    geom_hex(aes(x=IncomePerCap, y=Poverty)) 
## Warning: Removed 881 rows containing non-finite values (stat_binhex).

ggplot(data=census) + 
    geom_density2d(aes(x=IncomePerCap, y=Poverty)) 
## Warning: Removed 881 rows containing non-finite values (stat_density2d).

One Discrete X One Continuous parameter

Boxplots

ggplot(data=census[1:10000,]) + 
    geom_boxplot(aes(x=State, y=Poverty)) 
## Warning: Removed 72 rows containing non-finite values (stat_boxplot).

Violin charts

ggplot(data=census[1:10000,]) + 
    geom_violin(aes(x=State, y=Poverty)) 
## Warning: Removed 72 rows containing non-finite values (stat_ydensity).

Facetting: Multipanel plots

Plus more logical subsetting using the ‘%in%’ operator.

ggplot(data=census[census$State %in% c('Alabama', 'Oregon', 'New York', 'California'),]) + 
    geom_point(aes(x=Employed, y=Poverty)) +
    facet_grid(~State) + theme(axis.text.x = element_text(angle=90))
## Warning: Removed 165 rows containing missing values (geom_point).

Challenge 2

Develop a plot showing the relationship between MeanCommute and Transit. Then attempt to facet this by State for a subset (use %in%) of five states of your choosing.

Save your work

You should have been keeping your code in an R script file. Save that file now so that you can refer back to it later.

Homework

Part 1

Open a new data file using:

newdata = read.csv('/usr/share/data/classfiles/readlengths3.txt', sep=' ')

These data are read lengths for three samples run on an Illumina sequencing instrument. I want to know how the read length distributions differ between these three samples. The DNA came from fossilized packrat middens dating back up to 3500 years ago and the degree of fragmentation may tell something about the age and/or preservation state of the sample.

Create a plot showing some attribute of the read length distribution that can be compared between these samples. Post to #plots

Part 2

Re-make your plot from Monday and post to #plots

home