Once you read the data tables, you can create single parameter graphs, like histograms and scatterplots shown below. Single parameters can be chosen looking at your data using the head() function.
library(ggplot2)
library(forcats)
census = read.csv('/usr/share/data/kaggle/census.csv')
head(census)
## CensusTract State County TotalPop Men Women Hispanic White Black
## 1 1001020100 Alabama Autauga 1948 940 1008 0.9 87.4 7.7
## 2 1001020200 Alabama Autauga 2156 1059 1097 0.8 40.4 53.3
## 3 1001020300 Alabama Autauga 2968 1364 1604 0.0 74.5 18.6
## 4 1001020400 Alabama Autauga 4423 2172 2251 10.5 82.8 3.7
## 5 1001020500 Alabama Autauga 10763 4922 5841 0.7 68.5 24.8
## 6 1001020600 Alabama Autauga 3851 1787 2064 13.1 72.9 11.9
## Native Asian Pacific Citizen Income IncomeErr IncomePerCap
## 1 0.3 0.6 0.0 1503 61838 11900 25713
## 2 0.0 2.3 0.0 1662 32303 13538 18021
## 3 0.5 1.4 0.3 2335 44922 5629 20689
## 4 1.6 0.0 0.0 3306 54329 7003 24125
## 5 0.0 3.8 0.0 7666 51965 6935 27526
## 6 0.0 0.0 0.0 2642 63092 9585 30480
## IncomePerCapErr Poverty ChildPoverty Professional Service Office
## 1 4548 8.1 8.4 34.7 17.0 21.3
## 2 2474 25.5 40.3 22.3 24.7 21.5
## 3 2817 12.7 19.7 31.4 24.9 22.1
## 4 2870 2.1 1.6 27.0 20.8 27.0
## 5 2813 11.4 17.5 49.6 14.2 18.2
## 6 7550 14.4 21.9 24.2 17.5 35.4
## Construction Production Drive Carpool Transit Walk OtherTransp
## 1 11.9 15.2 90.2 4.8 0 0.5 2.3
## 2 9.4 22.0 86.3 13.1 0 0.0 0.7
## 3 9.2 12.4 94.8 2.8 0 0.0 0.0
## 4 8.7 16.4 86.6 9.1 0 0.0 2.6
## 5 2.1 15.8 88.0 10.5 0 0.0 0.6
## 6 7.9 14.9 82.7 6.9 0 0.0 6.0
## WorkAtHome MeanCommute Employed PrivateWork PublicWork SelfEmployed
## 1 2.1 25.0 943 77.1 18.3 4.6
## 2 0.0 23.4 753 77.0 16.9 6.1
## 3 2.5 19.6 1373 64.1 23.6 12.3
## 4 1.6 25.3 1782 75.7 21.2 3.1
## 5 0.9 24.8 5037 67.1 27.6 5.3
## 6 4.5 19.8 1560 79.4 14.7 5.8
## FamilyWork Unemployment
## 1 0 5.4
## 2 0 13.3
## 3 0 6.2
## 4 0 10.8
## 5 0 4.2
## 6 0 10.9
ggplot(data = census) +
geom_point(mapping = aes(x = Men, y = TotalPop))
ggplot(data=census) +
geom_histogram(aes(x=Citizen), binwidth=200)+xlim(c(0,15000))
## Warning: Removed 22 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
This scatterplot compares the number of men as a proportion of the total population. Most counties seem to be about half men and the other half of the population would be women. When making histograms, always have to change the default bin width, decreasing band width may offer more detail when observing the data’s distribution. You also can get rid of empty/unused spaces by putting a limit on the graph’s axes to make the graph easier to read. This may also get rid of outliers and missing or close to zero values.
Density plots are good for visualizing the distribution of data over a continuous interval or time period. Axes limits can also be added into these plots and many more.
ggplot(data=census) +
geom_density(aes(x=Citizen)) +
xlim(c(0,15000))
## Warning: Removed 22 rows containing non-finite values (stat_density).
ggplot(data=census) +
geom_area(stat='bin', bins = 50, aes(x=Citizen)) +
xlim(c(0,10000))
## Warning: Removed 158 rows containing non-finite values (stat_bin).
Discrete values take on integer values only while continuous data can contain any value.
ggplot(data=census) +
geom_bar(aes(x=fct_infreq(State))) +
xlab('') +
theme(axis.text.x = element_text(angle=90))
The ‘library(forcats)’ function allows us to sort the states by count and better compare each state. Using ‘theme()’ rotates the axis labels so they can be read more clearly.
Continous data can take on any numerical values, such as income or poverty. Ggplot can create a graph of two continous variables and include a line that best fits the data using ‘geom_smooth’.
ggplot(data=census) +
geom_point(aes(x=IncomePerCap, y=ChildPoverty), alpha=0.1) +
geom_smooth(aes(x=IncomePerCap, y=ChildPoverty))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 1127 rows containing non-finite values (stat_smooth).
## Warning: Removed 1127 rows containing missing values (geom_point).
The trend seems to be as Income per capital increases, child poverty decreases. Other density plots can be explored through the functions ‘geom_hex()’ and ‘geom_density2d()’. If you want to plot one discrete parameter against a continous one, then use ‘geom_boxplot()’, and pick a few discrete variables to compare in this boxplot.
ggplot(data=census[census$State %in% c('Massachusetts', 'New York', 'Florida', 'Texas'),]) +
geom_boxplot(aes(x=State, y=Poverty))
## Warning: Removed 225 rows containing non-finite values (stat_boxplot).
These four states seem to have a similar median percent of people under the poverty line.
ggplot(data=census[1:10000,]) +
geom_violin(aes(x=State, y=Poverty))
## Warning: Removed 72 rows containing non-finite values (stat_ydensity).
ggplot(data=census) +
geom_density2d(aes(x=IncomePerCap, y=Poverty))
## Warning: Removed 881 rows containing non-finite values (stat_density2d).
Multipanel plots are useful when you want to compare different categorical variables in separate plots which makes it easier to see differences in discrete or continuous values. Each panel can show a difference subset of the data.
ggplot(data=census[census$State %in% c('Massachusetts', 'Maine', 'Florida', 'Texas', 'Wisconsin'),]) +
geom_point(aes(x=MeanCommute, y=Transit)) +
facet_grid(~State) + theme(axis.text.x = element_text(angle=90))
## Warning: Removed 184 rows containing missing values (geom_point).
Massachusetts shows the highest percent of people commuting on public transportation out of those five states.
DNA was extracted from fossilized packrat midden that date back up to 3500 years ago. The three samples, GC100B, FRTSO4, and FRS11, have been analyzed by Illumina sequencing. We expect to see DNA fragmentation to increase over time, for example, the smallest fragments should be in the oldest samples (GC100B). The following figure will show the read length distribution between the three samples.
newdata = read.csv('/usr/share/data/classfiles/readlengths3.txt', sep=' ')
summary(newdata)
## readlength SampleID
## Min. : 30.0 FRT504 :19796901
## 1st Qu.: 85.0 FRT511A:12708845
## Median :138.0 GC100B :19220089
## Mean :138.8
## 3rd Qu.:193.0
## Max. :239.0
str(newdata)
## 'data.frame': 51725835 obs. of 2 variables:
## $ readlength: int 101 101 101 101 101 101 101 101 101 101 ...
## $ SampleID : Factor w/ 3 levels "FRT504","FRT511A",..: 3 3 3 3 3 3 3 3 3 3 ...
ggplot(data=newdata[newdata$SampleID %in% c('FRT504', 'FRT511A', 'GC100B'),]) +
geom_boxplot(aes(x=SampleID, y=readlength),outlier.colour = "red", outlier.alpha = 0.5)
#save.image('Workspace_3.RData')
Looking at this graph, there seems to be a lot of outliers in the FRT511A sample and also shows the longest median read lengths.