Earthquake Analysis (2/4): Categorical Variables Exploratory Analysis

This is the second part of our post series about the exploratory analysis of a publicly available dataset reporting earthquakes and similar events within a specific time window of 30 days. In the following, we are going to analyze the categorical variables of our dataset. The categorical variables can take on one of a limited, and usually fixed a number of possible values. Factor variables are categorical variables that can be either numeric or string variables. R stores categorical variables into a factor. Their analysis may require statistical tools different from the ones used for quantitative variables.

Packages

I am going to take advantage of the following packages.

suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(Hmisc))
suppressPackageStartupMessages(library(lubridate))
suppressPackageStartupMessages(library(vcd))
suppressPackageStartupMessages(library(vcdExtra))
suppressPackageStartupMessages(library(gmodels))

Packages versions are herein listed.

packages <- c("ggplot2", "dplyr", "Hmisc", "lubridate", "vcd", "vcdExtra", "gmodels")
version <- lapply(packages, packageVersion)
version_c <- do.call(c, version)
data.frame(packages=packages, version = as.character(version_c))
##    packages version
## 1   ggplot2   3.1.0
## 2     dplyr 0.8.0.1
## 3     Hmisc   4.2.0
## 4 lubridate   1.7.4
## 5       vcd   1.4.4
## 6  vcdExtra   0.7.1
## 7   gmodels  2.18.1

Running on Windows-10 the following R language version.

R.version
##                _                           
## platform       x86_64-w64-mingw32          
## arch           x86_64                      
## os             mingw32                     
## system         x86_64, mingw32             
## status                                     
## major          3                           
## minor          5.2                         
## year           2018                        
## month          12                          
## day            20                          
## svn rev        75870                       
## language       R                           
## version.string R version 3.5.2 (2018-12-20)
## nickname       Eggshell Igloo

Getting Data

As shown in the first post, we start our analysis by downloading the earthquake dataset from earthquake.usgs.gov site, specifically the last 30 days dataset flavor. Please note that such eartquake dataset is day by day updated to cover the last 30 days of data collection. Furthermore, it is not the most recent dataset available, as I collected it some weeks ago. If such dataset is not already present into our workspace, we download and save it to be loaded into the quakes local variable.

if ("all_week.csv" %in% dir(".") == FALSE) {
  url <- "https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv"
  download.file(url = url, destfile = "all_week.csv")
}
quakes <- read.csv("all_month.csv", header=TRUE, sep=',', stringsAsFactors = FALSE)

quakes$time <- ymd_hms(quakes$time)
quakes$updated <- ymd_hms(quakes$updated)
quakes$magType <- as.factor(quakes$magType)
quakes$net <- as.factor(quakes$net)
quakes$type <- as.factor(quakes$type)
quakes$status <- as.factor(quakes$status)
quakes$locationSource <- as.factor(quakes$locationSource)
quakes$magSource <- as.factor(quakes$magSource)

Exploratory Analysis – Categorical Variables

The categorical variables can be detected by testing if their class is factor.

(factor_vars <- names(which(sapply(quakes, class) == "factor")))
## [1] "magType"        "net"            "type"           "status"        
## [5] "locationSource" "magSource"

length(factor_vars)
## [1] 6

The describe() function within HMisc package can be useful for categorical variables as well.

describe(quakes[,factor_vars])
## quakes[, factor_vars] 
## 
##  6  Variables      8407  Observations
## ---------------------------------------------------------------------------
## magType 
##        n  missing distinct 
##     8407        0       10 
##                                                                       
## Value               mb mb_lg    md    mh    ml   mun    mw   mwr   mww
## Frequency      2   604    47  2423    14  5203     2     4    19    89
## Proportion 0.000 0.072 0.006 0.288 0.002 0.619 0.000 0.000 0.002 0.011
## ---------------------------------------------------------------------------
## net 
##        n  missing distinct 
##     8407        0       14 
## 
## ak (2469, 0.294), ci (1344, 0.160), hv (253, 0.030), ismpkansas (8,
## 0.001), ld (4, 0.000), mb (157, 0.019), nc (1435, 0.171), nm (28, 0.003),
## nn (604, 0.072), pr (427, 0.051), se (15, 0.002), us (897, 0.107), uu
## (588, 0.070), uw (178, 0.021)
## ---------------------------------------------------------------------------
## type 
##        n  missing distinct 
##     8407        0        7 
## 
## chemical explosion (2, 0.000), earthquake (8232, 0.979), explosion (58,
## 0.007), ice quake (16, 0.002), other event (3, 0.000), quarry blast (95,
## 0.011), rock burst (1, 0.000)
## ---------------------------------------------------------------------------
## status 
##        n  missing distinct 
##     8407        0        2 
##                               
## Value      automatic  reviewed
## Frequency       1691      6716
## Proportion     0.201     0.799
## ---------------------------------------------------------------------------
## locationSource 
##        n  missing distinct 
##     8407        0       15 
##                                                                       
## Value         ak    ci    hv  ismp    ld    mb    nc    nm    nn    ok
## Frequency   2470  1344   253     8     4   157  1435    28   604     6
## Proportion 0.294 0.160 0.030 0.001 0.000 0.019 0.171 0.003 0.072 0.001
##                                         
## Value         pr    se    us    uu    uw
## Frequency    427    15   890   588   178
## Proportion 0.051 0.002 0.106 0.070 0.021
## ---------------------------------------------------------------------------
## magSource 
##        n  missing distinct 
##     8407        0       15 
##                                                                       
## Value         ak    ci    hv  ismp    ld    mb    nc    nm    nn    ok
## Frequency   2480  1344   253     8     4   157  1435    28   604     5
## Proportion 0.295 0.160 0.030 0.001 0.000 0.019 0.171 0.003 0.072 0.001
##                                         
## Value         pr    se    us    uu    uw
## Frequency    427    15   881   588   178
## Proportion 0.051 0.002 0.105 0.070 0.021
## ---------------------------------------------------------------------------

We notice from the magType description that two records have a null string magType. We then replace them we the NA value.

quakes$magType[quakes$magType == ""] <- NA

To understand relationship or dependencies among categorical variables, we take advantage of various types of tables and graphical methods. Also stratifying variables can be encompassed in order to highlight if the relationship between two primary variables is the same or different for all levels of the stratifying variable under consideration.

The contingency table are said to be of one-way flavor when involving just one categorical variable. They are said two-way when involving two categorical variables, and so on (N-way).

For example, here is the one-way contingency table for the magType variable.

(tbl <- table(quakes$magType))
## 
##          mb mb_lg    md    mh    ml   mun    mw   mwr   mww 
##     0   604    47  2423    14  5203     2     4    19    89