home

Reading Data Files

Here we will explore a way to read data files and get a glimpse of the structure and variables within the data. US Census Demographic Data will be used to observe demographic and economic factors among the Unites States.

Manipulating Data

To manipulate data, we will call certain columns of the data file up.

Observing Data Patterns

Using ggplot and other graphical data codes, we can compare variables to observe patterns that occur in the data.

Run R

#How to read a csv text data file
census = read.csv('/usr/share/data/kaggle/census.csv')
head(census)
str(census)
## 'data.frame':    74001 obs. of  37 variables:
##  $ CensusTract    : num  1e+09 1e+09 1e+09 1e+09 1e+09 ...
##  $ State          : Factor w/ 52 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ County         : Factor w/ 1928 levels "Abbeville","Acadia",..: 90 90 90 90 90 90 90 90 90 90 ...
##  $ TotalPop       : int  1948 2156 2968 4423 10763 3851 2761 3187 10915 5668 ...
##  $ Men            : int  940 1059 1364 2172 4922 1787 1210 1502 5486 2897 ...
##  $ Women          : int  1008 1097 1604 2251 5841 2064 1551 1685 5429 2771 ...
##  $ Hispanic       : num  0.9 0.8 0 10.5 0.7 13.1 3.8 1.3 1.4 0.4 ...
##  $ White          : num  87.4 40.4 74.5 82.8 68.5 72.9 74.5 84 89.5 85.5 ...
##  $ Black          : num  7.7 53.3 18.6 3.7 24.8 11.9 19.7 10.7 8.4 12.1 ...
##  $ Native         : num  0.3 0 0.5 1.6 0 0 0 3.1 0 0 ...
##  $ Asian          : num  0.6 2.3 1.4 0 3.8 0 0 0 0 0.3 ...
##  $ Pacific        : num  0 0 0.3 0 0 0 0 0 0 0 ...
##  $ Citizen        : int  1503 1662 2335 3306 7666 2642 2060 2391 7778 4217 ...
##  $ Income         : int  61838 32303 44922 54329 51965 63092 34821 73728 60063 41287 ...
##  $ IncomeErr      : int  11900 13538 5629 7003 6935 9585 7867 2447 8602 7857 ...
##  $ IncomePerCap   : int  25713 18021 20689 24125 27526 30480 20442 32813 24028 24710 ...
##  $ IncomePerCapErr: int  4548 2474 2817 2870 2813 7550 3245 4669 2233 4149 ...
##  $ Poverty        : num  8.1 25.5 12.7 2.1 11.4 14.4 28.9 13 13.9 6.8 ...
##  $ ChildPoverty   : num  8.4 40.3 19.7 1.6 17.5 21.9 41.9 25.9 18.3 10 ...
##  $ Professional   : num  34.7 22.3 31.4 27 49.6 24.2 19.5 42.8 31.5 29.3 ...
##  $ Service        : num  17 24.7 24.9 20.8 14.2 17.5 29.6 10.7 17.5 13.7 ...
##  $ Office         : num  21.3 21.5 22.1 27 18.2 35.4 25.3 34.2 26.1 17.7 ...
##  $ Construction   : num  11.9 9.4 9.2 8.7 2.1 7.9 10.1 5.5 7.8 11 ...
##  $ Production     : num  15.2 22 12.4 16.4 15.8 14.9 15.5 6.8 17.1 28.3 ...
##  $ Drive          : num  90.2 86.3 94.8 86.6 88 82.7 92.4 84.3 90.1 88.7 ...
##  $ Carpool        : num  4.8 13.1 2.8 9.1 10.5 6.9 7.6 8.1 8.6 7.9 ...
##  $ Transit        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Walk           : num  0.5 0 0 0 0 0 0 0 0 0 ...
##  $ OtherTransp    : num  2.3 0.7 0 2.6 0.6 6 0 1.7 0 1.2 ...
##  $ WorkAtHome     : num  2.1 0 2.5 1.6 0.9 4.5 0 5.9 1.3 2.1 ...
##  $ MeanCommute    : num  25 23.4 19.6 25.3 24.8 19.8 20 24.3 29.4 32.9 ...
##  $ Employed       : int  943 753 1373 1782 5037 1560 1166 1502 4348 2485 ...
##  $ PrivateWork    : num  77.1 77 64.1 75.7 67.1 79.4 82 78.1 73.3 77.9 ...
##  $ PublicWork     : num  18.3 16.9 23.6 21.2 27.6 14.7 14.6 14.8 22.1 15.2 ...
##  $ SelfEmployed   : num  4.6 6.1 12.3 3.1 5.3 5.8 3.4 7.1 4.6 6.9 ...
##  $ FamilyWork     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Unemployment   : num  5.4 13.3 6.2 10.8 4.2 10.9 11.4 8.2 8.7 7.2 ...

Above you will see the beginning entries of the data set, beginning alphabetically with Alabama, listing its first few demographic and economic parameters.

Using the summary(census) command, you will be able to see the min/max, median, mean, and quartiles of each variable. This can be useful for getting a gauge on how these variables are comparable to eachother.

summary(census)
##   CensusTract                 State               County     
##  Min.   :1.001e+09   California  : 8057   Los Angeles: 2346  
##  1st Qu.:1.304e+10   Texas       : 5265   Cook       : 1326  
##  Median :2.805e+10   New York    : 4918   Orange     :  939  
##  Mean   :2.839e+10   Florida     : 4245   Jefferson  :  927  
##  3rd Qu.:4.200e+10   Pennsylvania: 3218   Maricopa   :  916  
##  Max.   :7.215e+10   Illinois    : 3123   Montgomery :  833  
##                      (Other)     :45175   (Other)    :66714  
##     TotalPop          Men            Women          Hispanic     
##  Min.   :    0   Min.   :    0   Min.   :    0   Min.   :  0.00  
##  1st Qu.: 2891   1st Qu.: 1409   1st Qu.: 1461   1st Qu.:  2.40  
##  Median : 4063   Median : 1986   Median : 2066   Median :  7.00  
##  Mean   : 4326   Mean   : 2128   Mean   : 2198   Mean   : 16.86  
##  3rd Qu.: 5442   3rd Qu.: 2674   3rd Qu.: 2774   3rd Qu.: 20.40  
##  Max.   :53812   Max.   :27962   Max.   :27250   Max.   :100.00  
##                                                  NA's   :690     
##      White            Black            Native             Asian       
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0000   Min.   : 0.000  
##  1st Qu.: 39.40   1st Qu.:  0.70   1st Qu.:  0.0000   1st Qu.: 0.200  
##  Median : 71.40   Median :  3.70   Median :  0.0000   Median : 1.400  
##  Mean   : 62.03   Mean   : 13.27   Mean   :  0.7277   Mean   : 4.588  
##  3rd Qu.: 88.30   3rd Qu.: 14.40   3rd Qu.:  0.4000   3rd Qu.: 4.800  
##  Max.   :100.00   Max.   :100.00   Max.   :100.0000   Max.   :91.300  
##  NA's   :690      NA's   :690      NA's   :690        NA's   :690     
##     Pacific          Citizen          Income         IncomeErr     
##  Min.   : 0.000   Min.   :    0   Min.   :  2611   Min.   :   390  
##  1st Qu.: 0.000   1st Qu.: 2037   1st Qu.: 37683   1st Qu.:  5317  
##  Median : 0.000   Median : 2863   Median : 51094   Median :  7732  
##  Mean   : 0.145   Mean   : 3043   Mean   : 57226   Mean   :  9134  
##  3rd Qu.: 0.000   3rd Qu.: 3838   3rd Qu.: 70117   3rd Qu.: 11258  
##  Max.   :84.700   Max.   :37416   Max.   :248750   Max.   :123116  
##  NA's   :690                      NA's   :1100     NA's   :1100    
##   IncomePerCap    IncomePerCapErr     Poverty        ChildPoverty   
##  Min.   :   128   Min.   :    85   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 19123   1st Qu.:  2312   1st Qu.:  7.20   1st Qu.:  7.00  
##  Median : 25344   Median :  3127   Median : 13.40   Median : 17.80  
##  Mean   : 28491   Mean   :  3943   Mean   : 16.96   Mean   : 22.49  
##  3rd Qu.: 33894   3rd Qu.:  4537   3rd Qu.: 23.10   3rd Qu.: 33.80  
##  Max.   :254204   Max.   :134380   Max.   :100.00   Max.   :100.00  
##  NA's   :740      NA's   :740      NA's   :835      NA's   :1118    
##   Professional       Service          Office        Construction    
##  Min.   :  0.00   Min.   :  0.0   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.: 24.10   1st Qu.: 13.4   1st Qu.: 20.10   1st Qu.:  5.000  
##  Median : 32.60   Median : 17.9   Median : 23.80   Median :  8.400  
##  Mean   : 34.80   Mean   : 19.1   Mean   : 23.95   Mean   :  9.292  
##  3rd Qu.: 43.88   3rd Qu.: 23.6   3rd Qu.: 27.50   3rd Qu.: 12.500  
##  Max.   :100.00   Max.   :100.0   Max.   :100.00   Max.   :100.000  
##  NA's   :807      NA's   :807     NA's   :807      NA's   :807      
##    Production         Drive           Carpool           Transit       
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.000   Min.   :  0.000  
##  1st Qu.:  7.10   1st Qu.: 72.00   1st Qu.:  6.000   1st Qu.:  0.000  
##  Median : 11.80   Median : 79.70   Median :  8.800   Median :  1.100  
##  Mean   : 12.86   Mean   : 75.53   Mean   :  9.627   Mean   :  5.456  
##  3rd Qu.: 17.40   3rd Qu.: 84.90   3rd Qu.: 12.300   3rd Qu.:  4.700  
##  Max.   :100.00   Max.   :100.00   Max.   :100.000   Max.   :100.000  
##  NA's   :807      NA's   :797      NA's   :797       NA's   :797      
##       Walk          OtherTransp        WorkAtHome       MeanCommute   
##  Min.   :  0.000   Min.   :  0.000   Min.   :  0.000   Min.   : 1.20  
##  1st Qu.:  0.400   1st Qu.:  0.400   1st Qu.:  1.800   1st Qu.:20.80  
##  Median :  1.400   Median :  1.100   Median :  3.500   Median :25.00  
##  Mean   :  3.123   Mean   :  1.892   Mean   :  4.368   Mean   :25.67  
##  3rd Qu.:  3.500   3rd Qu.:  2.500   3rd Qu.:  5.900   3rd Qu.:29.80  
##  Max.   :100.000   Max.   :100.000   Max.   :100.000   Max.   :80.00  
##  NA's   :797       NA's   :797       NA's   :797       NA's   :949    
##     Employed      PrivateWork       PublicWork      SelfEmployed    
##  Min.   :    0   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.: 1249   1st Qu.: 74.60   1st Qu.:  9.60   1st Qu.:  3.500  
##  Median : 1846   Median : 80.10   Median : 13.40   Median :  5.500  
##  Mean   : 1984   Mean   : 78.98   Mean   : 14.62   Mean   :  6.234  
##  3rd Qu.: 2553   3rd Qu.: 84.60   3rd Qu.: 18.20   3rd Qu.:  8.100  
##  Max.   :24075   Max.   :100.00   Max.   :100.00   Max.   :100.000  
##                  NA's   :807      NA's   :807      NA's   :807      
##    FamilyWork       Unemployment    
##  Min.   : 0.0000   Min.   :  0.000  
##  1st Qu.: 0.0000   1st Qu.:  5.100  
##  Median : 0.0000   Median :  7.700  
##  Mean   : 0.1698   Mean   :  9.029  
##  3rd Qu.: 0.0000   3rd Qu.: 11.400  
##  Max.   :26.5000   Max.   :100.000  
##  NA's   :807       NA's   :802

To look at a specific variable from a set of data, add a ‘$’ symbol after the data file name followed by the varaible of interest. The print command used below will being printing out all of the values for that variable but will reach a max point and omit the remaining values of that variable.

summary(census$County)
##          Los Angeles                 Cook               Orange 
##                 2346                 1326                  939 
##            Jefferson             Maricopa           Montgomery 
##                  927                  916                  833 
##               Harris                Kings                Wayne 
##                  791                  788                  768 
##                Clark               Queens            San Diego 
##                  694                  669                  628 
##               Dallas           Washington            Middlesex 
##                  565                  562                  535 
##              Suffolk           Miami-Dade             Franklin 
##                  527                  519                  497 
##               Marion            Riverside             Cuyahoga 
##                  457                  453                  447 
##              Jackson                 Lake         Hillsborough 
##                  435                  427                  407 
##            Allegheny                 King                Essex 
##                  402                  399                  392 
##               Monroe         Philadelphia          Santa Clara 
##                  390                  384                  372 
##       San Bernardino                Bexar             Hamilton 
##                  369                  366                  365 
##              Broward              Alameda              Tarrant 
##                  362                  361                  357 
##                Bronx              Douglas              Oakland 
##                  339                  338                  338 
##           Palm Beach                 Erie                 Polk 
##                  338                  328                  325 
##              Madison           Sacramento               Shelby 
##                  322                  317                  317 
##             Hennepin            Milwaukee               Nassau 
##                  299                  298                  297 
##              El Paso             New York                  Lee 
##                  291                  288                  275 
##            St. Louis              Fairfax               Fulton 
##                  266                  258                  253 
##             Pinellas            Fairfield             Honolulu 
##                  246                  244                  244 
##          Mecklenburg              Johnson             Oklahoma 
##                  242                  241                  241 
##                 Pima           Cumberland                Union 
##                  241                  240                  240 
##             Delaware             Hartford          Westchester 
##                  236                  224                  223 
##      Prince George's               Travis               Macomb 
##                  218                  218                  217 
##               DuPage            Baltimore            Salt Lake 
##                  216                  214                  212 
##         Contra Costa                 Kent              Fayette 
##                  208                  207                  201 
##       Baltimore city               Fresno              Orleans 
##                  200                  199                  198 
##        San Francisco                 York             Davidson 
##                  197                  196                  195 
##               DeKalb            Lancaster            New Haven 
##                  195                  190                  190 
##            Worcester               Pierce                 Wake 
##                  189                  188                  187 
##                Adams               Greene               Bergen 
##                  186                  182                  179 
## District of Columbia                Duval                 Knox 
##                  179                  177                  176 
##                Tulsa             San Juan              Ventura 
##                  175                  174                  174 
##               Warren            Multnomah             Richmond 
##                  173                  171                  170 
##              (Other) 
##                37326

Use the plot command to compare two variables. Titles and color can also be added to enhance the plot.

plot(census[,5:6], main="Percent of population men vs women", pch=20, cex =0.2, col = 'blue')

Other types of graphs ie histograms and boxplots can be created to visualize the data.

hist(census[,'Income'])

boxplot(census[,'Income'])

Density plotting allows for a smoother distribution of plot values.

plot(density(census[,'Women']))
points(density(census[,'TotalPop']), type = 'l', col ='blue')

More complex visualizations can also be made using ggplot, ie scatterplots. Adding an alpha parameter changes the transparency of the data points.

library(ggplot2)
demo = census
ggplot(data = demo) +
  geom_point(mapping = aes(x = TotalPop, y = Unemployment, alpha=0.2, col=State)) +
  xlab("Total County Population") + ylab("No. of Unemployed per County")
## Warning: Removed 802 rows containing missing values (geom_point).

Looking at this scatterplot, it seems that as the total county population size increases, the number of unemployed persons decreases. There is a high density of unemployment at county populations below 20,000.

home