Today we will be using ‘ggplot’, an R package for data visualization. These tools will help you create publication quality plots and figures directly from your R code. If there is one thing that you get out of this class I hope that it is som confidence with these tools.
What is ‘ggplot’? The name refers to the “Grammar of Graphics”. This refers to a structured language for building data visualizations.
For this section we will read in a table of US Census Demographic data from https://www.kaggle.com/muonneutrino/us-census-demographic-data. The file can be found on the RStudio server at ‘/usr/share/data/kaggle/census.csv’.
To read data into R we use: read.table() read.csv() read.delim()
census = read.csv('/usr/share/data/kaggle/census.csv')
head(census)
LOOK BACK: How do you write a table to a new file?
For help and an overview of all of the graphics types that are available in ggplot check out the R ggplot2 cheatsheet
library(ggplot2)
ggplot(data = census) +
geom_point(mapping = aes(x = Citizen, y = TotalPop))
ggplot(data=census) +
geom_histogram(aes(x=Citizen), binwidth=100)
But the x-axis is based on the maximum value, not the effective range so the distribution of the data is tough to read.
ggplot(data=census) +
geom_histogram(aes(x=Citizen), binwidth=100)+
xlim(c(0,10000))
## Warning: Removed 158 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
That also got rid of an artifact near zero. Let’s inspect the data for ‘Citizen’ to see what data could be causing that:
#summary(census$Citizen)
sum(census$Citizen <=0)
## [1] 691
Are zero values real?
We can use logical operators and the data.frame index notation to find ONLY those rows of data that have a zero value in the Citizen column.
# test for Citizen ==0
logic = census$Citizen==0
# Then use this vector of logical results to index the data frame
zerocit= census[logic,] #make sure to use the rows position -> left of the column
#summary(zerocit)
ggplot(data=census) +
geom_density(aes(x=Citizen)) +
xlim(c(0,10000))
## Warning: Removed 158 rows containing non-finite values (stat_density).
ggplot(data=census) +
geom_area(stat='bin', bins = 30, aes(x=Citizen)) +
xlim(c(0,10000))
## Warning: Removed 158 rows containing non-finite values (stat_bin).
ggplot(data=census) +
geom_bar(aes(x=State))
But, this is hard to read. A few fixes:
library(forcats)
ggplot(data=census) +
geom_bar(aes(x=fct_infreq(State))) +
xlab('') +
theme(axis.text.x = element_text(angle=90))
Pick a plot method appropriate for the data in the MeanCommute column.
Scatterplots
ggplot(data=census) +
geom_point(aes(x=IncomePerCap, y=Poverty))
## Warning: Removed 881 rows containing missing values (geom_point).
See also: geom_jitter(), and geom_label()
Add a line
ggplot(data=census) +
geom_point(aes(x=IncomePerCap, y=Poverty), alpha=0.1) +
geom_smooth(aes(x=IncomePerCap, y=Poverty))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 881 rows containing non-finite values (stat_smooth).
## Warning: Removed 881 rows containing missing values (geom_point).
ggplot(data=census) +
geom_hex(aes(x=IncomePerCap, y=Poverty))
## Warning: Removed 881 rows containing non-finite values (stat_binhex).
ggplot(data=census) +
geom_density2d(aes(x=IncomePerCap, y=Poverty))
## Warning: Removed 881 rows containing non-finite values (stat_density2d).
ggplot(data=census[1:10000,]) +
geom_boxplot(aes(x=State, y=Poverty))
## Warning: Removed 72 rows containing non-finite values (stat_boxplot).
ggplot(data=census[1:10000,]) +
geom_violin(aes(x=State, y=Poverty))
## Warning: Removed 72 rows containing non-finite values (stat_ydensity).
Plus more logical subsetting using the ‘%in%’ operator.
ggplot(data=census[census$State %in% c('Alabama', 'Oregon', 'New York', 'California'),]) +
geom_point(aes(x=Employed, y=Poverty)) +
facet_grid(~State) + theme(axis.text.x = element_text(angle=90))
## Warning: Removed 165 rows containing missing values (geom_point).
Develop a plot showing the relationship between MeanCommute and Transit. Then attempt to facet this by State for a subset (use %in%) of five states of your choosing.
You should have been keeping your code in an R script file. Save that file now so that you can refer back to it later.
Open a new data file using:
newdata = read.csv('/usr/share/data/classfiles/readlengths3.txt', sep=' ')
These data are read lengths for three samples run on an Illumina sequencing instrument. I want to know how the read length distributions differ between these three samples. The DNA came from fossilized packrat middens dating back up to 3500 years ago and the degree of fragmentation may tell something about the age and/or preservation state of the sample.
Create a plot showing some attribute of the read length distribution that can be compared between these samples. Post to #plots