To access your RStudio account for this course navigate to http://ip.of.the.day:8787. This number can change from time to time so if you have trouble accessing this later you should email your instructor. I’ll let you know if this protocol changes.
Log in when prompted using your username and password.
This RStudio server will host all of our activities for this course, but you may also want to have RStudio on your laptop. To do that download and install (in order):
R Select the correct version for your operating system
RStudio **Select the FREE Desktop version*
You should know that everything we do in the RStudio console can be done in R on the commandline as well.
To get started type: ‘R’ at your command line.
What version of R do you have? The output should look something like this:
We can now get started with the R in RStudio.
Declare variables with ‘=’ OR “<-”. This assigns a value to a variable.
x=2
print(x) ##Print method
## [1] 2
class(x)
## [1] "numeric"
R does what is known as “dynamic typing” when you create a new variable or object. This means that the kind of data is used to determine how that object gets created. Different classes of data have different properties in R, and many functions will require specific classes.
The main classes that you will see are:
Numeric - (default for numbers)
x = 2
class(x)
## [1] "numeric"
Integer - (Sometimes required, but usually the same as numeric whole numbers. Almost never the default)
y = as.integer(x)
class(y)
## [1] "integer"
Character- (Non-numeric character strings)
c = 'two'
class(c)
## [1] "character"
Factor- (Categorical variables, often the default for character fields in tables)
cf = as.factor(c)
class(cf)
## [1] "factor"
R has built in mathematical and logical operators that we can use to set up calculations.
Math: Addition (+) Subtraction (-) Multiplication (*) Division (/) Exponent (^ OR **)
Logical: Greater than/Less than (>/<) Equals (==) IMPORTANT to use two equals! Greater than or equal to (>=) Not (! / !=) “In” %/% Useful for checking if a list or vector contains some value
Examples:
Math
x=3
x+x
## [1] 6
x*x
## [1] 9
x**x
## [1] 27
x/x
## [1] 1
Logical
x>1
## [1] TRUE
x<=3
## [1] TRUE
x!=3
## [1] FALSE
x %in% c(1,2,3,4,5)
## [1] TRUE
x==2
## [1] FALSE
Multiple values can be assigned a single variable, or object, by using R data structures known as vectors or lists. In other programming languages these are often called arrays.
Multi-value objects can often be created with functions like c(), seq(), and vector()
x=c(1:10) # Create a vector
class(x)
## [1] "integer"
print(x)
## [1] 1 2 3 4 5 6 7 8 9 10
Accessing single variables from an object:
print(x[1]) # First index of vector
## [1] 1
print(x[1:5])
## [1] 1 2 3 4 5
2-dimensional objects are known as matrices or data.frames. Think of tables in Excel or similar:
y = matrix(nrow=5, ncol=5) # create a 5x5 matrix
print(y)
## [,1] [,2] [,3] [,4] [,5]
## [1,] NA NA NA NA NA
## [2,] NA NA NA NA NA
## [3,] NA NA NA NA NA
## [4,] NA NA NA NA NA
## [5,] NA NA NA NA NA
class(y)
## [1] "matrix"
y[1,1] = 5 #Access data
print(y)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 5 NA NA NA NA
## [2,] NA NA NA NA NA
## [3,] NA NA NA NA NA
## [4,] NA NA NA NA NA
## [5,] NA NA NA NA NA
y = cbind(seq(1:5),
seq(1:5),
seq(1:5),
seq(1:5),
seq(1:5))
class(y)
## [1] "matrix"
z = as.data.frame(y);
class(z) #access is the same for data.frames
## [1] "data.frame"
For ALL multi-value objects we can apply basic operators to all values:
y+2
## [,1] [,2] [,3] [,4] [,5]
## [1,] 3 3 3 3 3
## [2,] 4 4 4 4 4
## [3,] 5 5 5 5 5
## [4,] 6 6 6 6 6
## [5,] 7 7 7 7 7
y**2
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 1 1 1 1
## [2,] 4 4 4 4 4
## [3,] 9 9 9 9 9
## [4,] 16 16 16 16 16
## [5,] 25 25 25 25 25
y>2
## [,1] [,2] [,3] [,4] [,5]
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,] TRUE TRUE TRUE TRUE TRUE
## [4,] TRUE TRUE TRUE TRUE TRUE
## [5,] TRUE TRUE TRUE TRUE TRUE
y!=4
## [,1] [,2] [,3] [,4] [,5]
## [1,] TRUE TRUE TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE TRUE TRUE
## [4,] FALSE FALSE FALSE FALSE FALSE
## [5,] TRUE TRUE TRUE TRUE TRUE
R provides you with several useful functions for basic math operations that you don’t want to calculate by hand:
sum(c(1:10)) #calculate the sum
## [1] 55
sqrt(88) #square root
## [1] 9.380832
log(10) #logarithm, default: natural log but base can be defined
## [1] 2.302585
##Many others... min(), max(), mean(), median()
Given a list of numeric data:
data= c(1.1,1.5,1.2,0.7,1.6,0.2,1.1, 1.1, 1.2, 1.7,0.2, 0.7, 0.8, 0.9)
Calculate the mean and standard deviation using only the basic operators and the sum() function.
Save your approach for inclusion in your first Blog entry.
For this section we will read in a table of US Census Demographic data from https://www.kaggle.com/muonneutrino/us-census-demographic-data. The file can be found on the RStudio server at ‘/usr/share/data/kaggle/census.csv’.
read.table() read.csv() read.delim()
census = read.csv('/usr/share/data/kaggle/census.csv')
head(census)
write.csv()
write.csv(census, 'newfile.txt')
Either Session -> Save Workspace As
OR:
save.image('workspace_1.RData')
One of R’s biggest advantages is the ability to create high quality graphics in nearly any format or style. Today we will be working with the basic plotting features but later we will take a look at the ggplot library. ggplot is the current leader in graphics for R.
Plot with base R’s ‘plot’ (This can be really bad!)
head(census)
str(census) #How many variables/columns do we have?
## 'data.frame': 74001 obs. of 37 variables:
## $ CensusTract : num 1e+09 1e+09 1e+09 1e+09 1e+09 ...
## $ State : Factor w/ 52 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ County : Factor w/ 1928 levels "Añasco","Abbeville",..: 90 90 90 90 90 90 90 90 90 90 ...
## $ TotalPop : int 1948 2156 2968 4423 10763 3851 2761 3187 10915 5668 ...
## $ Men : int 940 1059 1364 2172 4922 1787 1210 1502 5486 2897 ...
## $ Women : int 1008 1097 1604 2251 5841 2064 1551 1685 5429 2771 ...
## $ Hispanic : num 0.9 0.8 0 10.5 0.7 13.1 3.8 1.3 1.4 0.4 ...
## $ White : num 87.4 40.4 74.5 82.8 68.5 72.9 74.5 84 89.5 85.5 ...
## $ Black : num 7.7 53.3 18.6 3.7 24.8 11.9 19.7 10.7 8.4 12.1 ...
## $ Native : num 0.3 0 0.5 1.6 0 0 0 3.1 0 0 ...
## $ Asian : num 0.6 2.3 1.4 0 3.8 0 0 0 0 0.3 ...
## $ Pacific : num 0 0 0.3 0 0 0 0 0 0 0 ...
## $ Citizen : int 1503 1662 2335 3306 7666 2642 2060 2391 7778 4217 ...
## $ Income : int 61838 32303 44922 54329 51965 63092 34821 73728 60063 41287 ...
## $ IncomeErr : int 11900 13538 5629 7003 6935 9585 7867 2447 8602 7857 ...
## $ IncomePerCap : int 25713 18021 20689 24125 27526 30480 20442 32813 24028 24710 ...
## $ IncomePerCapErr: int 4548 2474 2817 2870 2813 7550 3245 4669 2233 4149 ...
## $ Poverty : num 8.1 25.5 12.7 2.1 11.4 14.4 28.9 13 13.9 6.8 ...
## $ ChildPoverty : num 8.4 40.3 19.7 1.6 17.5 21.9 41.9 25.9 18.3 10 ...
## $ Professional : num 34.7 22.3 31.4 27 49.6 24.2 19.5 42.8 31.5 29.3 ...
## $ Service : num 17 24.7 24.9 20.8 14.2 17.5 29.6 10.7 17.5 13.7 ...
## $ Office : num 21.3 21.5 22.1 27 18.2 35.4 25.3 34.2 26.1 17.7 ...
## $ Construction : num 11.9 9.4 9.2 8.7 2.1 7.9 10.1 5.5 7.8 11 ...
## $ Production : num 15.2 22 12.4 16.4 15.8 14.9 15.5 6.8 17.1 28.3 ...
## $ Drive : num 90.2 86.3 94.8 86.6 88 82.7 92.4 84.3 90.1 88.7 ...
## $ Carpool : num 4.8 13.1 2.8 9.1 10.5 6.9 7.6 8.1 8.6 7.9 ...
## $ Transit : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Walk : num 0.5 0 0 0 0 0 0 0 0 0 ...
## $ OtherTransp : num 2.3 0.7 0 2.6 0.6 6 0 1.7 0 1.2 ...
## $ WorkAtHome : num 2.1 0 2.5 1.6 0.9 4.5 0 5.9 1.3 2.1 ...
## $ MeanCommute : num 25 23.4 19.6 25.3 24.8 19.8 20 24.3 29.4 32.9 ...
## $ Employed : int 943 753 1373 1782 5037 1560 1166 1502 4348 2485 ...
## $ PrivateWork : num 77.1 77 64.1 75.7 67.1 79.4 82 78.1 73.3 77.9 ...
## $ PublicWork : num 18.3 16.9 23.6 21.2 27.6 14.7 14.6 14.8 22.1 15.2 ...
## $ SelfEmployed : num 4.6 6.1 12.3 3.1 5.3 5.8 3.4 7.1 4.6 6.9 ...
## $ FamilyWork : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Unemployment : num 5.4 13.3 6.2 10.8 4.2 10.9 11.4 8.2 8.7 7.2 ...
#plot(census) May fail
Pick two variables, it gets better:
plot(census[,7:8])
Add plot elements: e.g., title, marker symbols, colors
plot(census[,7:8], main="Percent of population white vs hispanic", pch=20, cex =0.2, col = 'grey')
OR we can create other types of plots by calling other functions. e.g., a histogram of boxplot:
hist(census[,'Citizen'])
boxplot(census[,'Citizen'])
Density plotting: Wrap density() function in plot Add series with points() instead of plot:
plot(density(census[,'Citizen']))
points(density(census[,'TotalPop']), type = 'l', col ='blue')
library(ggplot2)
demo = census #Just rename the object for the next lines
ggplot(data = demo) +
geom_point(mapping = aes(x = Citizen, y = TotalPop))
### change x and y axis titles
ggplot(data = demo) +
geom_point(mapping = aes(x = Citizen, y = TotalPop)) +
xlab("No. of Citizens per County") + ylab("Total County Population")
ggplot(data = demo) +
geom_point(mapping = aes(x = Citizen, y = TotalPop,alpha =0.5)) +
xlab("No. of Citizens per County") + ylab("Total County Population")
ggplot(data = demo) +
geom_point(mapping = aes(x = Citizen, y = TotalPop, alpha=0.2, col=State)) +
xlab("No. of Citizens per County") + ylab("Total County Population")
Open the file /usr/share/docs/rmarkdown_template.Rmd
Tour editing RMarkdown docs.
Using rmarkdown and the census data: Create a document that guides you throug the steps of reading in data, accessing parts of it, and creating basic data visualization. Alongside the code include comments about what is happening at each step.
Then create a new data visualization (different parameters) and discuss any patterns or trends you observe using the census data. If using ggplot consult the ggplot cheatsheet for guidance beyond scatterplots.
When finished “knit” your document to html (use the “Knit” button in RStudio). Make sure this runs without errors.
If you have not done already, get the Github Desktop app: https://desktop.github.com/ And sign in with your account.
Save your document to the server. Download with ‘scp’
scp 'username@ip.of.the.day:path/to/file.Rmd' 'path/to/Git'
scp 'username@ip.of.the.day:path/to/file.html' 'path/to/Git'
Then commit and push to repository using the Github Desktop app.