In this lab, we will explore the basics of plotting with ggplot and R.
To start, we must load a few packages:
library(ggplot2)
library(forcats)
Now, we shall read in some data and view it so that we can see what the data looks like.
census = read.csv('/usr/share/data/kaggle/census.csv')
head(census)
Since the data has been successfully read into R, we can now do some visualization of the data.
There are a few types of visual tools that we can use.
First, we can make a scatterplot of the data using the citizen and TotalPop columns.
ggplot(data = census) + geom_point(mapping = aes(x = Citizen, y = TotalPop))
We can also make a histogram of the data.
ggplot(data=census) + geom_histogram(aes(x=Citizen), binwidth=100) + xlim(c(0,15000))
## Warning: Removed 22 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
Another type of graph that can be made using ggplot2 is a density plot.
ggplot(data=census) + geom_density(aes(x=Citizen)) + xlim(c(0,15000))
## Warning: Removed 22 rows containing non-finite values (stat_density).
We can also make an Area plot.
ggplot(data=census) + geom_area(stat='bin', bins = 30, aes(x=Citizen)) + xlim(c(0,10000))
## Warning: Removed 158 rows containing non-finite values (stat_bin).
Bar graphs can also be made using these packages.
ggplot(data=census) + geom_bar(aes(x=State))
This graph can be a little tedious to read. There are, however, some tools that we can use to make this more readable.
ggplot(data=census) + geom_bar(aes(x=fct_infreq(State))) + xlab('') + theme(axis.text.x = element_text(angle=90))
This graph is much easier to read. Rotating the labels on the x-axis and sorting the data largest to smallest made this a much better graph to use.
The next few graphs are used for datasets with continuous parameters.
ggplot(data=census) +
geom_point(aes(x=IncomePerCap, y=Poverty), alpha=0.1) +
geom_smooth(aes(x=IncomePerCap, y=Poverty))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 881 rows containing non-finite values (stat_smooth).
## Warning: Removed 881 rows containing missing values (geom_point).
Contour Plots are also possible.
ggplot(data=census) +
geom_hex(aes(x=IncomePerCap, y=Poverty))
## Warning: Removed 881 rows containing non-finite values (stat_binhex).
Density Plots with more than one parameters can also be created.
ggplot(data=census) +
geom_density2d(aes(x=IncomePerCap, y=Poverty))
## Warning: Removed 881 rows containing non-finite values (stat_density2d).
Boxplots can also be made.
ggplot(data=census[1:10000,]) +
geom_boxplot(aes(x=State, y=Poverty))
## Warning: Removed 72 rows containing non-finite values (stat_boxplot).
Similarly to boxplots, violin charts can be made using ggplot2.
ggplot(data=census[1:10000,]) +
geom_violin(aes(x=State, y=Poverty))
## Warning: Removed 72 rows containing non-finite values (stat_ydensity).
Facetting is a really useful feature, as it allows one to view multiple graphs right next to each other. Here is an example:
ggplot(data=census[census$State %in% c('Alabama', 'Oregon', 'New York', 'California'),]) +
geom_point(aes(x=Employed, y=Poverty)) +
facet_grid(~State) + theme(axis.text.x = element_text(angle=90))
## Warning: Removed 165 rows containing missing values (geom_point).
The facet_grid function allows one to separate each graph based on a certain variable, which can be quite useful.
There is a dataset involve DNA from packrat middens that are up to 3500 years old. Using the visualization tools, we can create a figure of this data.
First, we need to read the file into R.
newdata = read.csv('/usr/share/data/classfiles/readlengths3.txt', sep=' ')
Now, lets take a look at the file and see what it looks like.
head(newdata)
Now, a density plot can be created to compare the read length distribution between the three samples.
ggplot(data=newdata[newdata$SampleID %in% c('GC100B', 'FRT504', 'FRT511A'),]) + geom_density(aes(x=SampleID))
With this density graph, we can see that sample FRT504 has the most reads, while sample GC100B has the smallest amount of reads. Sample FRT511A has about half the density of sample FRT504, meaning it has a little more reads than sample GCB100B. So, this shows that DNA fragmentation increases over time. If the DNA is fragmenting, the number of reads will increase. This is what is happening with sample FRT504. It has the highest read count, so it must be the oldest, as DNA fragmentation increases over time. Sample GC100B must be the youngest, as it has the smallest density. This means it has the lowest number of reads and the least amount of DNA fragmentation. Sample FRT511A has an intermediate amount of reads, as it has the the middle density of the three samples. It probably has not experienced a lot of DNA fragmentation, however, as its density appears to be similar to that of sample GC100B.