Here we will explore a way to read data files and get a glimpse of the structure and variables within the data. US Census Demographic Data will be used to observe demographic and economic factors among the Unites States.
To manipulate data, we will call certain columns of the data file up.
Using ggplot and other graphical data codes, we can compare variables to observe patterns that occur in the data.
#How to read a csv text data file
census = read.csv('/usr/share/data/kaggle/census.csv')
head(census)
str(census)
## 'data.frame': 74001 obs. of 37 variables:
## $ CensusTract : num 1e+09 1e+09 1e+09 1e+09 1e+09 ...
## $ State : Factor w/ 52 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ County : Factor w/ 1928 levels "Abbeville","Acadia",..: 90 90 90 90 90 90 90 90 90 90 ...
## $ TotalPop : int 1948 2156 2968 4423 10763 3851 2761 3187 10915 5668 ...
## $ Men : int 940 1059 1364 2172 4922 1787 1210 1502 5486 2897 ...
## $ Women : int 1008 1097 1604 2251 5841 2064 1551 1685 5429 2771 ...
## $ Hispanic : num 0.9 0.8 0 10.5 0.7 13.1 3.8 1.3 1.4 0.4 ...
## $ White : num 87.4 40.4 74.5 82.8 68.5 72.9 74.5 84 89.5 85.5 ...
## $ Black : num 7.7 53.3 18.6 3.7 24.8 11.9 19.7 10.7 8.4 12.1 ...
## $ Native : num 0.3 0 0.5 1.6 0 0 0 3.1 0 0 ...
## $ Asian : num 0.6 2.3 1.4 0 3.8 0 0 0 0 0.3 ...
## $ Pacific : num 0 0 0.3 0 0 0 0 0 0 0 ...
## $ Citizen : int 1503 1662 2335 3306 7666 2642 2060 2391 7778 4217 ...
## $ Income : int 61838 32303 44922 54329 51965 63092 34821 73728 60063 41287 ...
## $ IncomeErr : int 11900 13538 5629 7003 6935 9585 7867 2447 8602 7857 ...
## $ IncomePerCap : int 25713 18021 20689 24125 27526 30480 20442 32813 24028 24710 ...
## $ IncomePerCapErr: int 4548 2474 2817 2870 2813 7550 3245 4669 2233 4149 ...
## $ Poverty : num 8.1 25.5 12.7 2.1 11.4 14.4 28.9 13 13.9 6.8 ...
## $ ChildPoverty : num 8.4 40.3 19.7 1.6 17.5 21.9 41.9 25.9 18.3 10 ...
## $ Professional : num 34.7 22.3 31.4 27 49.6 24.2 19.5 42.8 31.5 29.3 ...
## $ Service : num 17 24.7 24.9 20.8 14.2 17.5 29.6 10.7 17.5 13.7 ...
## $ Office : num 21.3 21.5 22.1 27 18.2 35.4 25.3 34.2 26.1 17.7 ...
## $ Construction : num 11.9 9.4 9.2 8.7 2.1 7.9 10.1 5.5 7.8 11 ...
## $ Production : num 15.2 22 12.4 16.4 15.8 14.9 15.5 6.8 17.1 28.3 ...
## $ Drive : num 90.2 86.3 94.8 86.6 88 82.7 92.4 84.3 90.1 88.7 ...
## $ Carpool : num 4.8 13.1 2.8 9.1 10.5 6.9 7.6 8.1 8.6 7.9 ...
## $ Transit : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Walk : num 0.5 0 0 0 0 0 0 0 0 0 ...
## $ OtherTransp : num 2.3 0.7 0 2.6 0.6 6 0 1.7 0 1.2 ...
## $ WorkAtHome : num 2.1 0 2.5 1.6 0.9 4.5 0 5.9 1.3 2.1 ...
## $ MeanCommute : num 25 23.4 19.6 25.3 24.8 19.8 20 24.3 29.4 32.9 ...
## $ Employed : int 943 753 1373 1782 5037 1560 1166 1502 4348 2485 ...
## $ PrivateWork : num 77.1 77 64.1 75.7 67.1 79.4 82 78.1 73.3 77.9 ...
## $ PublicWork : num 18.3 16.9 23.6 21.2 27.6 14.7 14.6 14.8 22.1 15.2 ...
## $ SelfEmployed : num 4.6 6.1 12.3 3.1 5.3 5.8 3.4 7.1 4.6 6.9 ...
## $ FamilyWork : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Unemployment : num 5.4 13.3 6.2 10.8 4.2 10.9 11.4 8.2 8.7 7.2 ...
Above you will see the beginning entries of the data set, beginning alphabetically with Alabama, listing its first few demographic and economic parameters.
Using the summary(census) command, you will be able to see the min/max, median, mean, and quartiles of each variable. This can be useful for getting a gauge on how these variables are comparable to eachother.
summary(census)
## CensusTract State County
## Min. :1.001e+09 California : 8057 Los Angeles: 2346
## 1st Qu.:1.304e+10 Texas : 5265 Cook : 1326
## Median :2.805e+10 New York : 4918 Orange : 939
## Mean :2.839e+10 Florida : 4245 Jefferson : 927
## 3rd Qu.:4.200e+10 Pennsylvania: 3218 Maricopa : 916
## Max. :7.215e+10 Illinois : 3123 Montgomery : 833
## (Other) :45175 (Other) :66714
## TotalPop Men Women Hispanic
## Min. : 0 Min. : 0 Min. : 0 Min. : 0.00
## 1st Qu.: 2891 1st Qu.: 1409 1st Qu.: 1461 1st Qu.: 2.40
## Median : 4063 Median : 1986 Median : 2066 Median : 7.00
## Mean : 4326 Mean : 2128 Mean : 2198 Mean : 16.86
## 3rd Qu.: 5442 3rd Qu.: 2674 3rd Qu.: 2774 3rd Qu.: 20.40
## Max. :53812 Max. :27962 Max. :27250 Max. :100.00
## NA's :690
## White Black Native Asian
## Min. : 0.00 Min. : 0.00 Min. : 0.0000 Min. : 0.000
## 1st Qu.: 39.40 1st Qu.: 0.70 1st Qu.: 0.0000 1st Qu.: 0.200
## Median : 71.40 Median : 3.70 Median : 0.0000 Median : 1.400
## Mean : 62.03 Mean : 13.27 Mean : 0.7277 Mean : 4.588
## 3rd Qu.: 88.30 3rd Qu.: 14.40 3rd Qu.: 0.4000 3rd Qu.: 4.800
## Max. :100.00 Max. :100.00 Max. :100.0000 Max. :91.300
## NA's :690 NA's :690 NA's :690 NA's :690
## Pacific Citizen Income IncomeErr
## Min. : 0.000 Min. : 0 Min. : 2611 Min. : 390
## 1st Qu.: 0.000 1st Qu.: 2037 1st Qu.: 37683 1st Qu.: 5317
## Median : 0.000 Median : 2863 Median : 51094 Median : 7732
## Mean : 0.145 Mean : 3043 Mean : 57226 Mean : 9134
## 3rd Qu.: 0.000 3rd Qu.: 3838 3rd Qu.: 70117 3rd Qu.: 11258
## Max. :84.700 Max. :37416 Max. :248750 Max. :123116
## NA's :690 NA's :1100 NA's :1100
## IncomePerCap IncomePerCapErr Poverty ChildPoverty
## Min. : 128 Min. : 85 Min. : 0.00 Min. : 0.00
## 1st Qu.: 19123 1st Qu.: 2312 1st Qu.: 7.20 1st Qu.: 7.00
## Median : 25344 Median : 3127 Median : 13.40 Median : 17.80
## Mean : 28491 Mean : 3943 Mean : 16.96 Mean : 22.49
## 3rd Qu.: 33894 3rd Qu.: 4537 3rd Qu.: 23.10 3rd Qu.: 33.80
## Max. :254204 Max. :134380 Max. :100.00 Max. :100.00
## NA's :740 NA's :740 NA's :835 NA's :1118
## Professional Service Office Construction
## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.000
## 1st Qu.: 24.10 1st Qu.: 13.4 1st Qu.: 20.10 1st Qu.: 5.000
## Median : 32.60 Median : 17.9 Median : 23.80 Median : 8.400
## Mean : 34.80 Mean : 19.1 Mean : 23.95 Mean : 9.292
## 3rd Qu.: 43.88 3rd Qu.: 23.6 3rd Qu.: 27.50 3rd Qu.: 12.500
## Max. :100.00 Max. :100.0 Max. :100.00 Max. :100.000
## NA's :807 NA's :807 NA's :807 NA's :807
## Production Drive Carpool Transit
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 7.10 1st Qu.: 72.00 1st Qu.: 6.000 1st Qu.: 0.000
## Median : 11.80 Median : 79.70 Median : 8.800 Median : 1.100
## Mean : 12.86 Mean : 75.53 Mean : 9.627 Mean : 5.456
## 3rd Qu.: 17.40 3rd Qu.: 84.90 3rd Qu.: 12.300 3rd Qu.: 4.700
## Max. :100.00 Max. :100.00 Max. :100.000 Max. :100.000
## NA's :807 NA's :797 NA's :797 NA's :797
## Walk OtherTransp WorkAtHome MeanCommute
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 1.20
## 1st Qu.: 0.400 1st Qu.: 0.400 1st Qu.: 1.800 1st Qu.:20.80
## Median : 1.400 Median : 1.100 Median : 3.500 Median :25.00
## Mean : 3.123 Mean : 1.892 Mean : 4.368 Mean :25.67
## 3rd Qu.: 3.500 3rd Qu.: 2.500 3rd Qu.: 5.900 3rd Qu.:29.80
## Max. :100.000 Max. :100.000 Max. :100.000 Max. :80.00
## NA's :797 NA's :797 NA's :797 NA's :949
## Employed PrivateWork PublicWork SelfEmployed
## Min. : 0 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 1249 1st Qu.: 74.60 1st Qu.: 9.60 1st Qu.: 3.500
## Median : 1846 Median : 80.10 Median : 13.40 Median : 5.500
## Mean : 1984 Mean : 78.98 Mean : 14.62 Mean : 6.234
## 3rd Qu.: 2553 3rd Qu.: 84.60 3rd Qu.: 18.20 3rd Qu.: 8.100
## Max. :24075 Max. :100.00 Max. :100.00 Max. :100.000
## NA's :807 NA's :807 NA's :807
## FamilyWork Unemployment
## Min. : 0.0000 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 5.100
## Median : 0.0000 Median : 7.700
## Mean : 0.1698 Mean : 9.029
## 3rd Qu.: 0.0000 3rd Qu.: 11.400
## Max. :26.5000 Max. :100.000
## NA's :807 NA's :802
To look at a specific variable from a set of data, add a ‘$’ symbol after the data file name followed by the varaible of interest. The print command used below will being printing out all of the values for that variable but will reach a max point and omit the remaining values of that variable.
summary(census$County)
## Los Angeles Cook Orange
## 2346 1326 939
## Jefferson Maricopa Montgomery
## 927 916 833
## Harris Kings Wayne
## 791 788 768
## Clark Queens San Diego
## 694 669 628
## Dallas Washington Middlesex
## 565 562 535
## Suffolk Miami-Dade Franklin
## 527 519 497
## Marion Riverside Cuyahoga
## 457 453 447
## Jackson Lake Hillsborough
## 435 427 407
## Allegheny King Essex
## 402 399 392
## Monroe Philadelphia Santa Clara
## 390 384 372
## San Bernardino Bexar Hamilton
## 369 366 365
## Broward Alameda Tarrant
## 362 361 357
## Bronx Douglas Oakland
## 339 338 338
## Palm Beach Erie Polk
## 338 328 325
## Madison Sacramento Shelby
## 322 317 317
## Hennepin Milwaukee Nassau
## 299 298 297
## El Paso New York Lee
## 291 288 275
## St. Louis Fairfax Fulton
## 266 258 253
## Pinellas Fairfield Honolulu
## 246 244 244
## Mecklenburg Johnson Oklahoma
## 242 241 241
## Pima Cumberland Union
## 241 240 240
## Delaware Hartford Westchester
## 236 224 223
## Prince George's Travis Macomb
## 218 218 217
## DuPage Baltimore Salt Lake
## 216 214 212
## Contra Costa Kent Fayette
## 208 207 201
## Baltimore city Fresno Orleans
## 200 199 198
## San Francisco York Davidson
## 197 196 195
## DeKalb Lancaster New Haven
## 195 190 190
## Worcester Pierce Wake
## 189 188 187
## Adams Greene Bergen
## 186 182 179
## District of Columbia Duval Knox
## 179 177 176
## Tulsa San Juan Ventura
## 175 174 174
## Warren Multnomah Richmond
## 173 171 170
## (Other)
## 37326
Use the plot command to compare two variables. Titles and color can also be added to enhance the plot.
plot(census[,5:6], main="Percent of population men vs women", pch=20, cex =0.2, col = 'blue')
Other types of graphs ie histograms and boxplots can be created to visualize the data.
hist(census[,'Income'])
boxplot(census[,'Income'])
Density plotting allows for a smoother distribution of plot values.
plot(density(census[,'Women']))
points(density(census[,'TotalPop']), type = 'l', col ='blue')
More complex visualizations can also be made using ggplot, ie scatterplots. Adding an alpha parameter changes the transparency of the data points.
library(ggplot2)
demo = census
ggplot(data = demo) +
geom_point(mapping = aes(x = TotalPop, y = Unemployment, alpha=0.2, col=State)) +
xlab("Total County Population") + ylab("No. of Unemployed per County")
## Warning: Removed 802 rows containing missing values (geom_point).
Looking at this scatterplot, it seems that as the total county population size increases, the number of unemployed persons decreases. There is a high density of unemployment at county populations below 20,000.