To access your RStudio account for this course navigate to http://ip.of.the.day:8787. This number can change from time to time so if you have trouble accessing this later you should email your instructor.
The BBC Visual and Data Journalism team uses R and ggplot to create their publication graphics. A few weeks ago they released their toolbox on GitHub. More from Medium
For this section we will read in a table of US Census Demographic data from https://www.kaggle.com/muonneutrino/us-census-demographic-data. The file can be found on the RStudio server at ‘/usr/share/data/kaggle/census.csv’.
To read data into R we use: read.table() read.csv() read.delim()
census = read.csv('/usr/share/data/kaggle/census.csv')
head(census)
LOOK BACK: How do you write a table to a new file?
For help and an overview of all of the graphics types that are available in ggplot check out the R ggplot2 cheatsheet
library(ggplot2)
ggplot(data = census) +
geom_point(mapping = aes(x = Citizen, y = TotalPop))
ggplot(data=census) +
geom_histogram(aes(x=Citizen), binwidth=100)
But the x-axis is based on the maximum value, not the effective range so the distribution of the data is tough to read.
ggplot(data=census) +
geom_histogram(aes(x=Citizen), binwidth=100)+
xlim(c(0,10000))
## Warning: Removed 158 rows containing non-finite values (stat_bin).
That also got rid of an artifact near zero. Let’s inspect the data for ‘Citizen’ to see what data could be causing that:
#summary(census$Citizen)
sum(census$Citizen <=0)
## [1] 691
Are zero values real?
We can use logical operators and the data.frame index notation to find ONLY those rows of data that have a zero value in the Citizen column.
# test for Citizen ==0
logic = census$Citizen==0
# Then use this vector of logical results to index the data frame
zerocit= census[logic,] #make sure to use the rows position -> left of the column
#summary(zerocit)
ggplot(data=census) +
geom_density(aes(x=Citizen)) +
xlim(c(0,10000))
## Warning: Removed 158 rows containing non-finite values (stat_density).
ggplot(data=census) +
geom_area(stat='bin', bins = 30, aes(x=Citizen)) +
xlim(c(0,10000))
## Warning: Removed 158 rows containing non-finite values (stat_bin).
ggplot(data=census) +
geom_bar(aes(x=State))
But, this is hard to read. A few fixes:
library(forcats)
ggplot(data=census) +
geom_bar(aes(x=fct_infreq(State))) +
xlab('') +
theme(axis.text.x = element_text(angle=90))
Pick a plot method appropriate for the data in the MeanCommute column.
ggplot(data=census) +
geom_point(aes(x=IncomePerCap, y=Poverty))
## Warning: Removed 881 rows containing missing values (geom_point).
See also: geom_jitter(), and geom_label()
ggplot(data=census) +
geom_point(aes(x=IncomePerCap, y=Poverty), alpha=0.1) +
geom_smooth(aes(x=IncomePerCap, y=Poverty))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 881 rows containing non-finite values (stat_smooth).
## Warning: Removed 881 rows containing missing values (geom_point).
ggplot(data=census) +
geom_hex(aes(x=IncomePerCap, y=Poverty))
## Warning: Removed 881 rows containing non-finite values (stat_binhex).
ggplot(data=census) +
geom_density2d(aes(x=IncomePerCap, y=Poverty))
## Warning: Removed 881 rows containing non-finite values (stat_density2d).
ggplot(data=census[1:10000,]) +
geom_boxplot(aes(x=State, y=Poverty))
## Warning: Removed 72 rows containing non-finite values (stat_boxplot).
ggplot(data=census[1:10000,]) +
geom_violin(aes(x=State, y=Poverty))
## Warning: Removed 72 rows containing non-finite values (stat_ydensity).
Plus more logical subsetting using %in%
ggplot(data=census[census$State %in% c('Alabama', 'Oregon', 'New York', 'California'),]) +
geom_point(aes(x=Employed, y=Poverty)) +
facet_grid(~State) + theme(axis.text.x = element_text(angle=90))
## Warning: Removed 165 rows containing missing values (geom_point).
Develop a plot showing the relationship between MeanCommute and Transit. Then attempt to facet this by State for five states of your choosing.
Try installing and running the code found here to create ‘mathart’ examples from Marcus Volz.
Either Session -> Save Workspace As
OR:
save.image('workspace_1.RData')
You ALSO want to save the R script file that you have been putting your code in with File-> Save(As) and using the file extension .R to identify R code files.
Create a new blog RMarkdown document outlining the basics of plotting with ggplot and R. Then open a new file using:
newdata = read.csv('/usr/share/data/classfiles/readlengths3.txt', sep=' ')
These data are read lengths for three samples run on an Illumina sequencing instrument. I want to know how the read length distributions differ between these three samples. The DNA came from fossilized packrat middens dating back up to 3500 years ago and the degree of fragmentation may tell something about the age and/or preservation state of the sample.
The sequencing generated reads of 125 base pairs and read paired-end data (recall the Illumina sequencing by synthesis process). These read lengths come from merged overlapping reads, so those reads >125bp represent molecules that were originally less than 240bp in size (required minimum 10bp overlap).
Create a new figure showing some attribute of the read length distribution that can be compared between these samples and include it in your blog for this week.