home

Introduction to R

R Basics

Accessing RStudio Online

To access your RStudio account for this course navigate to http://ip.of.the.day:8787. This number can change from time to time so if you have trouble accessing this later you should email your instructor. I’ll let you know if this protocol changes.

Log in when prompted using your username and password.

This RStudio server will host all of our activities for this course, but you may also want to have RStudio on your laptop. To do that download and install (in order):

R Select the correct version for your operating system

RStudio **Select the FREE Desktop version*

R via the command line

You should know that everything we do in the RStudio console can be done in R on the commandline as well.

To get started type: ‘R’ at your command line.

What version of R do you have? The output should look something like this:

Basic R

We can now get started with the R in RStudio.

Simple Variables

Declare variables with ‘=’ OR “<-”. This assigns a value to a variable.

x=2
print(x) ##Print method
## [1] 2
class(x)
## [1] "numeric"

Data classes

R does what is known as “dynamic typing” when you create a new variable or object. This means that the kind of data is used to determine how that object gets created. Different classes of data have different properties in R, and many functions will require specific classes.

The main classes that you will see are:

Numeric - (default for numbers)

x = 2
class(x)
## [1] "numeric"

Integer - (Sometimes required, but usually the same as numeric whole numbers. Almost never the default)

y = as.integer(x)
class(y)
## [1] "integer"

Character- (Non-numeric character strings)

c = 'two'
class(c)
## [1] "character"

Factor- (Categorical variables, often the default for character fields in tables)

cf = as.factor(c)
class(cf)
## [1] "factor"

Operators

R has built in mathematical and logical operators that we can use to set up calculations.

Math: Addition (+) Subtraction (-) Multiplication (*) Division (/) Exponent (^ OR **)

Logical: Greater than/Less than (>/<) Equals (==) IMPORTANT to use two equals! Greater than or equal to (>=) Not (! / !=) “In” %/% Useful for checking if a list or vector contains some value

Examples:

Math

x=3
x+x
## [1] 6
x*x
## [1] 9
x**x
## [1] 27
x/x
## [1] 1

Logical

x>1
## [1] TRUE
x<=3
## [1] TRUE
x!=3
## [1] FALSE
x %in% c(1,2,3,4,5)
## [1] TRUE
x==2
## [1] FALSE

Vectors (Arrays/Lists)

Multiple values can be assigned a single variable, or object, by using R data structures known as vectors or lists. In other programming languages these are often called arrays.

Multi-value objects can often be created with functions like c(), seq(), and vector()

x=c(1:10) # Create a vector
class(x)
## [1] "integer"
print(x)
##  [1]  1  2  3  4  5  6  7  8  9 10

Accessing single variables from an object:

print(x[1]) # First index of vector
## [1] 1
print(x[1:5])
## [1] 1 2 3 4 5

2-dimensional objects are known as matrices or data.frames. Think of tables in Excel or similar:

y = matrix(nrow=5, ncol=5) # create a 5x5 matrix
print(y)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]   NA   NA   NA   NA   NA
## [2,]   NA   NA   NA   NA   NA
## [3,]   NA   NA   NA   NA   NA
## [4,]   NA   NA   NA   NA   NA
## [5,]   NA   NA   NA   NA   NA
class(y)
## [1] "matrix"
y[1,1] = 5 #Access data
print(y)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    5   NA   NA   NA   NA
## [2,]   NA   NA   NA   NA   NA
## [3,]   NA   NA   NA   NA   NA
## [4,]   NA   NA   NA   NA   NA
## [5,]   NA   NA   NA   NA   NA
y = cbind(seq(1:5), 
seq(1:5),
seq(1:5),
seq(1:5),
seq(1:5))

class(y)
## [1] "matrix"
z = as.data.frame(y);
class(z) #access is the same for data.frames
## [1] "data.frame"

For ALL multi-value objects we can apply basic operators to all values:

y+2
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    3    3    3    3    3
## [2,]    4    4    4    4    4
## [3,]    5    5    5    5    5
## [4,]    6    6    6    6    6
## [5,]    7    7    7    7    7
y**2
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    1    1    1    1
## [2,]    4    4    4    4    4
## [3,]    9    9    9    9    9
## [4,]   16   16   16   16   16
## [5,]   25   25   25   25   25
y>2
##       [,1]  [,2]  [,3]  [,4]  [,5]
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,]  TRUE  TRUE  TRUE  TRUE  TRUE
## [4,]  TRUE  TRUE  TRUE  TRUE  TRUE
## [5,]  TRUE  TRUE  TRUE  TRUE  TRUE
y!=4
##       [,1]  [,2]  [,3]  [,4]  [,5]
## [1,]  TRUE  TRUE  TRUE  TRUE  TRUE
## [2,]  TRUE  TRUE  TRUE  TRUE  TRUE
## [3,]  TRUE  TRUE  TRUE  TRUE  TRUE
## [4,] FALSE FALSE FALSE FALSE FALSE
## [5,]  TRUE  TRUE  TRUE  TRUE  TRUE

Built in math functions

R provides you with several useful functions for basic math operations that you don’t want to calculate by hand:

sum(c(1:10)) #calculate the sum
## [1] 55
sqrt(88) #square root
## [1] 9.380832
log(10) #logarithm, default: natural log but base can be defined
## [1] 2.302585
##Many others... min(), max(), mean(), median()

Challenge 1:

Given a list of numeric data:

data= c(1.1,1.5,1.2,0.7,1.6,0.2,1.1, 1.1, 1.2, 1.7,0.2, 0.7, 0.8, 0.9)

Calculate the mean and standard deviation using only the basic operators and the sum() function.

Save your approach for inclusion in your first Blog entry.

Data input/output

For this section we will read in a table of US Census Demographic data from https://www.kaggle.com/muonneutrino/us-census-demographic-data. The file can be found on the RStudio server at ‘/usr/share/data/kaggle/census.csv’.

Read table data in text files:

read.table() read.csv() read.delim()

census = read.csv('/usr/share/data/kaggle/census.csv')
head(census)

Write data to a file:

write.csv()

write.csv(census, 'newfile.txt')

Saving your workspace:

Either Session -> Save Workspace As

OR:

save.image('workspace_1.RData')

Basic plotting

One of R’s biggest advantages is the ability to create high quality graphics in nearly any format or style. Today we will be working with the basic plotting features but later we will take a look at the ggplot library. ggplot is the current leader in graphics for R.

Plot with base R’s ‘plot’ (This can be really bad!)

head(census)
str(census) #How many variables/columns do we have?
## 'data.frame':    74001 obs. of  37 variables:
##  $ CensusTract    : num  1e+09 1e+09 1e+09 1e+09 1e+09 ...
##  $ State          : Factor w/ 52 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ County         : Factor w/ 1928 levels "Añasco","Abbeville",..: 90 90 90 90 90 90 90 90 90 90 ...
##  $ TotalPop       : int  1948 2156 2968 4423 10763 3851 2761 3187 10915 5668 ...
##  $ Men            : int  940 1059 1364 2172 4922 1787 1210 1502 5486 2897 ...
##  $ Women          : int  1008 1097 1604 2251 5841 2064 1551 1685 5429 2771 ...
##  $ Hispanic       : num  0.9 0.8 0 10.5 0.7 13.1 3.8 1.3 1.4 0.4 ...
##  $ White          : num  87.4 40.4 74.5 82.8 68.5 72.9 74.5 84 89.5 85.5 ...
##  $ Black          : num  7.7 53.3 18.6 3.7 24.8 11.9 19.7 10.7 8.4 12.1 ...
##  $ Native         : num  0.3 0 0.5 1.6 0 0 0 3.1 0 0 ...
##  $ Asian          : num  0.6 2.3 1.4 0 3.8 0 0 0 0 0.3 ...
##  $ Pacific        : num  0 0 0.3 0 0 0 0 0 0 0 ...
##  $ Citizen        : int  1503 1662 2335 3306 7666 2642 2060 2391 7778 4217 ...
##  $ Income         : int  61838 32303 44922 54329 51965 63092 34821 73728 60063 41287 ...
##  $ IncomeErr      : int  11900 13538 5629 7003 6935 9585 7867 2447 8602 7857 ...
##  $ IncomePerCap   : int  25713 18021 20689 24125 27526 30480 20442 32813 24028 24710 ...
##  $ IncomePerCapErr: int  4548 2474 2817 2870 2813 7550 3245 4669 2233 4149 ...
##  $ Poverty        : num  8.1 25.5 12.7 2.1 11.4 14.4 28.9 13 13.9 6.8 ...
##  $ ChildPoverty   : num  8.4 40.3 19.7 1.6 17.5 21.9 41.9 25.9 18.3 10 ...
##  $ Professional   : num  34.7 22.3 31.4 27 49.6 24.2 19.5 42.8 31.5 29.3 ...
##  $ Service        : num  17 24.7 24.9 20.8 14.2 17.5 29.6 10.7 17.5 13.7 ...
##  $ Office         : num  21.3 21.5 22.1 27 18.2 35.4 25.3 34.2 26.1 17.7 ...
##  $ Construction   : num  11.9 9.4 9.2 8.7 2.1 7.9 10.1 5.5 7.8 11 ...
##  $ Production     : num  15.2 22 12.4 16.4 15.8 14.9 15.5 6.8 17.1 28.3 ...
##  $ Drive          : num  90.2 86.3 94.8 86.6 88 82.7 92.4 84.3 90.1 88.7 ...
##  $ Carpool        : num  4.8 13.1 2.8 9.1 10.5 6.9 7.6 8.1 8.6 7.9 ...
##  $ Transit        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Walk           : num  0.5 0 0 0 0 0 0 0 0 0 ...
##  $ OtherTransp    : num  2.3 0.7 0 2.6 0.6 6 0 1.7 0 1.2 ...
##  $ WorkAtHome     : num  2.1 0 2.5 1.6 0.9 4.5 0 5.9 1.3 2.1 ...
##  $ MeanCommute    : num  25 23.4 19.6 25.3 24.8 19.8 20 24.3 29.4 32.9 ...
##  $ Employed       : int  943 753 1373 1782 5037 1560 1166 1502 4348 2485 ...
##  $ PrivateWork    : num  77.1 77 64.1 75.7 67.1 79.4 82 78.1 73.3 77.9 ...
##  $ PublicWork     : num  18.3 16.9 23.6 21.2 27.6 14.7 14.6 14.8 22.1 15.2 ...
##  $ SelfEmployed   : num  4.6 6.1 12.3 3.1 5.3 5.8 3.4 7.1 4.6 6.9 ...
##  $ FamilyWork     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Unemployment   : num  5.4 13.3 6.2 10.8 4.2 10.9 11.4 8.2 8.7 7.2 ...
#plot(census) May fail

Pick two variables, it gets better:

plot(census[,7:8])

Add plot elements: e.g., title, marker symbols, colors

plot(census[,7:8], main="Percent of population white vs hispanic", pch=20, cex =0.2, col = 'grey')

OR we can create other types of plots by calling other functions. e.g., a histogram of boxplot:

hist(census[,'Citizen'])

boxplot(census[,'Citizen'])

Density plotting: Wrap density() function in plot Add series with points() instead of plot:

plot(density(census[,'Citizen']))
points(density(census[,'TotalPop']), type = 'l', col ='blue')

ggplot2 Graphics

library(ggplot2)
demo = census #Just rename the object for the next lines

SCATTERPLOTS

ggplot(data = demo) +
  geom_point(mapping = aes(x = Citizen, y = TotalPop))

### change x and y axis titles

ggplot(data = demo) +
  geom_point(mapping = aes(x = Citizen, y = TotalPop)) +
  xlab("No. of Citizens per County") + ylab("Total County Population")

use transparency with ‘alpha’

ggplot(data = demo) +
  geom_point(mapping = aes(x = Citizen, y = TotalPop,alpha =0.5)) +
  xlab("No. of Citizens per County") + ylab("Total County Population")

or color by a group column

ggplot(data = demo) +
  geom_point(mapping = aes(x = Citizen, y = TotalPop, alpha=0.2, col=State)) +
  xlab("No. of Citizens per County") + ylab("Total County Population")

Tour RMarkdown

Open the file /usr/share/docs/rmarkdown_template.Rmd

Tour editing RMarkdown docs.

Blog

Using rmarkdown and the census data: Create a document that guides you throug the steps of reading in data, accessing parts of it, and creating basic data visualization. Alongside the code include comments about what is happening at each step.

Then create a new data visualization (different parameters) and discuss any patterns or trends you observe using the census data. If using ggplot consult the ggplot cheatsheet for guidance beyond scatterplots.

When finished “knit” your document to html (use the “Knit” button in RStudio). Make sure this runs without errors.

Upload homework to Git

If you have not done already, get the Github Desktop app: https://desktop.github.com/ And sign in with your account.

Save your document to the server. Download with ‘scp’

scp 'username@ip.of.the.day:path/to/file.Rmd' 'path/to/Git'
scp 'username@ip.of.the.day:path/to/file.html' 'path/to/Git'

Then commit and push to repository using the Github Desktop app.

home