R Data and Visualization

Once you read the data tables, you can create single parameter graphs, like histograms and scatterplots shown below. Single parameters can be chosen looking at your data using the head() function.

library(ggplot2)
library(forcats)
census = read.csv('/usr/share/data/kaggle/census.csv')
head(census)

##   CensusTract   State  County TotalPop  Men Women Hispanic White Black
## 1  1001020100 Alabama Autauga     1948  940  1008      0.9  87.4   7.7
## 2  1001020200 Alabama Autauga     2156 1059  1097      0.8  40.4  53.3
## 3  1001020300 Alabama Autauga     2968 1364  1604      0.0  74.5  18.6
## 4  1001020400 Alabama Autauga     4423 2172  2251     10.5  82.8   3.7
## 5  1001020500 Alabama Autauga    10763 4922  5841      0.7  68.5  24.8
## 6  1001020600 Alabama Autauga     3851 1787  2064     13.1  72.9  11.9
##   Native Asian Pacific Citizen Income IncomeErr IncomePerCap
## 1    0.3   0.6     0.0    1503  61838     11900        25713
## 2    0.0   2.3     0.0    1662  32303     13538        18021
## 3    0.5   1.4     0.3    2335  44922      5629        20689
## 4    1.6   0.0     0.0    3306  54329      7003        24125
## 5    0.0   3.8     0.0    7666  51965      6935        27526
## 6    0.0   0.0     0.0    2642  63092      9585        30480
##   IncomePerCapErr Poverty ChildPoverty Professional Service Office
## 1            4548     8.1          8.4         34.7    17.0   21.3
## 2            2474    25.5         40.3         22.3    24.7   21.5
## 3            2817    12.7         19.7         31.4    24.9   22.1
## 4            2870     2.1          1.6         27.0    20.8   27.0
## 5            2813    11.4         17.5         49.6    14.2   18.2
## 6            7550    14.4         21.9         24.2    17.5   35.4
##   Construction Production Drive Carpool Transit Walk OtherTransp
## 1         11.9       15.2  90.2     4.8       0  0.5         2.3
## 2          9.4       22.0  86.3    13.1       0  0.0         0.7
## 3          9.2       12.4  94.8     2.8       0  0.0         0.0
## 4          8.7       16.4  86.6     9.1       0  0.0         2.6
## 5          2.1       15.8  88.0    10.5       0  0.0         0.6
## 6          7.9       14.9  82.7     6.9       0  0.0         6.0
##   WorkAtHome MeanCommute Employed PrivateWork PublicWork SelfEmployed
## 1        2.1        25.0      943        77.1       18.3          4.6
## 2        0.0        23.4      753        77.0       16.9          6.1
## 3        2.5        19.6     1373        64.1       23.6         12.3
## 4        1.6        25.3     1782        75.7       21.2          3.1
## 5        0.9        24.8     5037        67.1       27.6          5.3
## 6        4.5        19.8     1560        79.4       14.7          5.8
##   FamilyWork Unemployment
## 1          0          5.4
## 2          0         13.3
## 3          0          6.2
## 4          0         10.8
## 5          0          4.2
## 6          0         10.9

ggplot(data = census) +
  geom_point(mapping = aes(x = Men, y = TotalPop))

ggplot(data=census) +
  geom_histogram(aes(x=Citizen), binwidth=200)+xlim(c(0,15000))

## Warning: Removed 22 rows containing non-finite values (stat_bin).

## Warning: Removed 2 rows containing missing values (geom_bar).

This scatterplot compares the number of men as a proportion of the total population. Most counties seem to be about half men and the other half of the population would be women. When making histograms, always have to change the default bin width, decreasing band width may offer more detail when observing the data’s distribution. You also can get rid of empty/unused spaces by putting a limit on the graph’s axes to make the graph easier to read. This may also get rid of outliers and missing or close to zero values.

Density Plots

Density plots are good for visualizing the distribution of data over a continuous interval or time period. Axes limits can also be added into these plots and many more.

ggplot(data=census) +
  geom_density(aes(x=Citizen)) +
  xlim(c(0,15000))

## Warning: Removed 22 rows containing non-finite values (stat_density).

ggplot(data=census) +
  geom_area(stat='bin', bins = 50, aes(x=Citizen)) +
  xlim(c(0,10000))

## Warning: Removed 158 rows containing non-finite values (stat_bin).

Discrete Values

Discrete values take on integer values only while continuous data can contain any value.

ggplot(data=census) +
  geom_bar(aes(x=fct_infreq(State))) +
  xlab('') +
  theme(axis.text.x  = element_text(angle=90))

The ‘library(forcats)’ function allows us to sort the states by count and better compare each state. Using ‘theme()’ rotates the axis labels so they can be read more clearly.

Two Parameter Graphs; Continuous and Discrete Variables

Continous data can take on any numerical values, such as income or poverty. Ggplot can create a graph of two continous variables and include a line that best fits the data using ‘geom_smooth’.

ggplot(data=census) + 
    geom_point(aes(x=IncomePerCap, y=ChildPoverty), alpha=0.1) +
    geom_smooth(aes(x=IncomePerCap, y=ChildPoverty))

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

## Warning: Removed 1127 rows containing non-finite values (stat_smooth).

## Warning: Removed 1127 rows containing missing values (geom_point).

The trend seems to be as Income per capital increases, child poverty decreases. Other density plots can be explored through the functions ‘geom_hex()’ and ‘geom_density2d()’. If you want to plot one discrete parameter against a continous one, then use ‘geom_boxplot()’, and pick a few discrete variables to compare in this boxplot.

ggplot(data=census[census$State %in% c('Massachusetts', 'New York', 'Florida', 'Texas'),]) + 
    geom_boxplot(aes(x=State, y=Poverty))

## Warning: Removed 225 rows containing non-finite values (stat_boxplot).

These four states seem to have a similar median percent of people under the poverty line.

ggplot(data=census[1:10000,]) + 
  geom_violin(aes(x=State, y=Poverty))

## Warning: Removed 72 rows containing non-finite values (stat_ydensity).

ggplot(data=census) + 
    geom_density2d(aes(x=IncomePerCap, y=Poverty))

## Warning: Removed 881 rows containing non-finite values (stat_density2d).

Facetting: Multipanel Plots

Multipanel plots are useful when you want to compare different categorical variables in separate plots which makes it easier to see differences in discrete or continuous values. Each panel can show a difference subset of the data.

ggplot(data=census[census$State %in% c('Massachusetts', 'Maine', 'Florida', 'Texas', 'Wisconsin'),]) + 
  geom_point(aes(x=MeanCommute, y=Transit)) +
  facet_grid(~State) + theme(axis.text.x = element_text(angle=90))

## Warning: Removed 184 rows containing missing values (geom_point).

Massachusetts shows the highest percent of people commuting on public transportation out of those five states.

Illumina Sequence Read Length Distributions

DNA was extracted from fossilized packrat midden that date back up to 3500 years ago. The three samples, GC100B, FRTSO4, and FRS11, have been analyzed by Illumina sequencing. We expect to see DNA fragmentation to increase over time, for example, the smallest fragments should be in the oldest samples (GC100B). The following figure will show the read length distribution between the three samples.

newdata = read.csv('/usr/share/data/classfiles/readlengths3.txt', sep=' ')
summary(newdata)

##    readlength       SampleID       
##  Min.   : 30.0   FRT504 :19796901  
##  1st Qu.: 85.0   FRT511A:12708845  
##  Median :138.0   GC100B :19220089  
##  Mean   :138.8                     
##  3rd Qu.:193.0                     
##  Max.   :239.0

str(newdata)

## 'data.frame':    51725835 obs. of  2 variables:
##  $ readlength: int  101 101 101 101 101 101 101 101 101 101 ...
##  $ SampleID  : Factor w/ 3 levels "FRT504","FRT511A",..: 3 3 3 3 3 3 3 3 3 3 ...

ggplot(data=newdata[newdata$SampleID %in% c('FRT504', 'FRT511A', 'GC100B'),]) + 
    geom_boxplot(aes(x=SampleID, y=readlength),outlier.colour = "red", outlier.alpha = 0.5)

#save.image('Workspace_3.RData')

Looking at this graph, there seems to be a lot of outliers in the FRT511A sample and also shows the longest median read lengths.

home

Lab 2