This is the second part of our post series about the exploratory analysis of a publicly available dataset reporting earthquakes and similar events within a specific time window of 30 days. In the following, we are going to analyze the categorical variables of our dataset. The categorical variables can take on one of a limited, and usually fixed a number of possible values. Factor variables are categorical variables that can be either numeric or string variables. R stores categorical variables into a factor. Their analysis may require statistical tools different from the ones used for quantitative variables.

Packages

I am going to take advantage of the following packages.

suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(Hmisc))
suppressPackageStartupMessages(library(lubridate))
suppressPackageStartupMessages(library(vcd))
suppressPackageStartupMessages(library(vcdExtra))
suppressPackageStartupMessages(library(gmodels))

Packages versions are herein listed.

packages <- c("ggplot2", "dplyr", "Hmisc", "lubridate", "vcd", "vcdExtra", "gmodels")
version <- lapply(packages, packageVersion)
version_c <- do.call(c, version)
data.frame(packages=packages, version = as.character(version_c))
##    packages version
## 1   ggplot2   3.1.0
## 2     dplyr 0.8.0.1
## 3     Hmisc   4.2.0
## 4 lubridate   1.7.4
## 5       vcd   1.4.4
## 6  vcdExtra   0.7.1
## 7   gmodels  2.18.1

Running on Windows-10 the following R language version.

R.version
##                _                           
## platform       x86_64-w64-mingw32          
## arch           x86_64                      
## os             mingw32                     
## system         x86_64, mingw32             
## status                                     
## major          3                           
## minor          5.2                         
## year           2018                        
## month          12                          
## day            20                          
## svn rev        75870                       
## language       R                           
## version.string R version 3.5.2 (2018-12-20)
## nickname       Eggshell Igloo

Getting Data

As shown in the first post, we start our analysis by downloading the earthquake dataset from earthquake.usgs.gov site, specifically the last 30 days dataset flavor. Please note that such eartquake dataset is day by day updated to cover the last 30 days of data collection. Furthermore, it is not the most recent dataset available, as I collected it some weeks ago. If such dataset is not already present into our workspace, we download and save it to be loaded into the quakes local variable.

if ("all_week.csv" %in% dir(".") == FALSE) {
  url <- "https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv"
  download.file(url = url, destfile = "all_week.csv")
}
quakes <- read.csv("all_month.csv", header=TRUE, sep=',', stringsAsFactors = FALSE)

quakes$time <- ymd_hms(quakes$time)
quakes$updated <- ymd_hms(quakes$updated)
quakes$magType <- as.factor(quakes$magType)
quakes$net <- as.factor(quakes$net)
quakes$type <- as.factor(quakes$type)
quakes$status <- as.factor(quakes$status)
quakes$locationSource <- as.factor(quakes$locationSource)
quakes$magSource <- as.factor(quakes$magSource)

Exploratory Analysis – Categorical Variables

The categorical variables can be detected by testing if their class is factor.

(factor_vars <- names(which(sapply(quakes, class) == "factor")))
## [1] "magType"        "net"            "type"           "status"        
## [5] "locationSource" "magSource"
length(factor_vars)
## [1] 6

The describe() function within HMisc package can be useful for categorical variables as well.

describe(quakes[,factor_vars])
## quakes[, factor_vars] 
## 
##  6  Variables      8407  Observations
## ---------------------------------------------------------------------------
## magType 
##        n  missing distinct 
##     8407        0       10 
##                                                                       
## Value               mb mb_lg    md    mh    ml   mun    mw   mwr   mww
## Frequency      2   604    47  2423    14  5203     2     4    19    89
## Proportion 0.000 0.072 0.006 0.288 0.002 0.619 0.000 0.000 0.002 0.011
## ---------------------------------------------------------------------------
## net 
##        n  missing distinct 
##     8407        0       14 
## 
## ak (2469, 0.294), ci (1344, 0.160), hv (253, 0.030), ismpkansas (8,
## 0.001), ld (4, 0.000), mb (157, 0.019), nc (1435, 0.171), nm (28, 0.003),
## nn (604, 0.072), pr (427, 0.051), se (15, 0.002), us (897, 0.107), uu
## (588, 0.070), uw (178, 0.021)
## ---------------------------------------------------------------------------
## type 
##        n  missing distinct 
##     8407        0        7 
## 
## chemical explosion (2, 0.000), earthquake (8232, 0.979), explosion (58,
## 0.007), ice quake (16, 0.002), other event (3, 0.000), quarry blast (95,
## 0.011), rock burst (1, 0.000)
## ---------------------------------------------------------------------------
## status 
##        n  missing distinct 
##     8407        0        2 
##                               
## Value      automatic  reviewed
## Frequency       1691      6716
## Proportion     0.201     0.799
## ---------------------------------------------------------------------------
## locationSource 
##        n  missing distinct 
##     8407        0       15 
##                                                                       
## Value         ak    ci    hv  ismp    ld    mb    nc    nm    nn    ok
## Frequency   2470  1344   253     8     4   157  1435    28   604     6
## Proportion 0.294 0.160 0.030 0.001 0.000 0.019 0.171 0.003 0.072 0.001
##                                         
## Value         pr    se    us    uu    uw
## Frequency    427    15   890   588   178
## Proportion 0.051 0.002 0.106 0.070 0.021
## ---------------------------------------------------------------------------
## magSource 
##        n  missing distinct 
##     8407        0       15 
##                                                                       
## Value         ak    ci    hv  ismp    ld    mb    nc    nm    nn    ok
## Frequency   2480  1344   253     8     4   157  1435    28   604     5
## Proportion 0.295 0.160 0.030 0.001 0.000 0.019 0.171 0.003 0.072 0.001
##                                         
## Value         pr    se    us    uu    uw
## Frequency    427    15   881   588   178
## Proportion 0.051 0.002 0.105 0.070 0.021
## ---------------------------------------------------------------------------

We notice from the magType description that two records have a null string magType. We then replace them we the NA value.

quakes$magType[quakes$magType == ""] <- NA

To understand relationship or dependencies among categorical variables, we take advantage of various types of tables and graphical methods. Also stratifying variables can be encompassed in order to highlight if the relationship between two primary variables is the same or different for all levels of the stratifying variable under consideration.

The contingency table are said to be of one-way flavor when involving just one categorical variable. They are said two-way when involving two categorical variables, and so on (N-way).

For example, here is the one-way contingency table for the magType variable.

(tbl <- table(quakes$magType))
## 
##          mb mb_lg    md    mh    ml   mun    mw   mwr   mww 
##     0   604    47  2423    14  5203     2     4    19    89

A graphical representation of the same as bar plot is shown.

ggplot(data = quakes, aes(x=magType, fill = magType)) + geom_bar(stat='count') + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + guides(fill=FALSE)

One-way contingency table of events based on the network the event has been registered from.

table(quakes$net)
## 
##         ak         ci         hv ismpkansas         ld         mb 
##       2469       1344        253          8          4        157 
##         nc         nm         nn         pr         se         us 
##       1435         28        604        427         15        897 
##         uu         uw 
##        588        178

A graphical representation of the same as bar plot is shown.

ggplot(data = quakes, aes(x = net, fill = net)) + geom_bar(stat = 'count') + guides(fill = FALSE)

One-way contingency table of the events based on their type.

table(quakes$type)
## 
## chemical explosion         earthquake          explosion 
##                  2               8232                 58 
##          ice quake        other event       quarry blast 
##                 16                  3                 95 
##         rock burst 
##                  1

A graphical representation of the same as bar plot is shown.

ggplot(data = quakes, aes(x = type, fill = type)) + geom_bar(stat = 'count') + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + guides(fill = FALSE)

One-way contingency table of events based on their status.

table(quakes$status)
## 
## automatic  reviewed 
##      1691      6716

A graphical representation of the same as bar plot is shown.

ggplot(data = quakes, aes(x = status, fill = status)) + geom_bar(stat='count') + guides(fill=FALSE)

One-way contingency table of events based on the location source.

table(quakes$locationSource)
## 
##   ak   ci   hv ismp   ld   mb   nc   nm   nn   ok   pr   se   us   uu   uw 
## 2470 1344  253    8    4  157 1435   28  604    6  427   15  890  588  178

A graphical representation of the same as bar plot is shown.

ggplot(data = quakes, aes(x = locationSource, fill = locationSource)) + geom_bar(stat='count') + guides(fill=FALSE)

One-way contingency table of the events based on the network that originally authored the reported magnitude for this event.

table(quakes$magSource)
## 
##   ak   ci   hv ismp   ld   mb   nc   nm   nn   ok   pr   se   us   uu   uw 
## 2480 1344  253    8    4  157 1435   28  604    5  427   15  881  588  178

A graphical representation of the same as bar plot is shown.

ggplot(data = quakes, aes(x = magSource, fill = magSource)) + geom_bar(stat='count') + guides(fill=FALSE)

Two-way contingency table based upon the network as data contributor and the magType, i.e. the method or algorithm used to calculate the preferred magnitude for the event

table(quakes$net, quakes$magType)
##             
##                     mb mb_lg   md   mh   ml  mun   mw  mwr  mww
##   ak            0    1     0    0    0 2467    0    0    0    1
##   ci            0    0     0    0    3 1339    2    0    0    0
##   hv            0    0     0  145    0  108    0    0    0    0
##   ismpkansas    0    0     0    0    0    8    0    0    0    0
##   ld            0    0     0    0    0    4    0    0    0    0
##   mb            0    0     0    6    0  151    0    0    0    0
##   nc            0    0     0 1423    0    7    0    3    0    0
##   nm            0    0     0   28    0    0    0    0    0    0
##   nn            0    0     0    0    0  604    0    0    0    0
##   pr            0    0     0  425    0    2    0    0    0    0
##   se            0    0     0   15    0    0    0    0    0    0
##   us            0  603    47    0    0  140    0    0   19   88
##   uu            0    0     0  365    0  222    0    1    0    0
##   uw            0    0     0   16   11  151    0    0    0    0

A graphical representation of the same as bar plot is shown.

ggplot(data = quakes, aes(x=net, fill = magType)) + geom_bar(stat='count')

Starting from this two-way contingency table:

(tbl <- table(quakes$net, quakes$status))
##             
##              automatic reviewed
##   ak              1105     1364
##   ci                29     1315
##   hv                90      163
##   ismpkansas         0        8
##   ld                 0        4
##   mb                 0      157
##   nc               460      975
##   nm                 0       28
##   nn                 6      598
##   pr                 0      427
##   se                 0       15
##   us                 0      897
##   uu                 0      588
##   uw                 1      177

its corresponding row proportions table is shown. Each row shows the fraction of automatic/reviewed earthquakes given a certain network. Each row sum up to one.

prop.table(tbl, 1)
##             
##                automatic    reviewed
##   ak         0.447549615 0.552450385
##   ci         0.021577381 0.978422619
##   hv         0.355731225 0.644268775
##   ismpkansas 0.000000000 1.000000000
##   ld         0.000000000 1.000000000
##   mb         0.000000000 1.000000000
##   nc         0.320557491 0.679442509
##   nm         0.000000000 1.000000000
##   nn         0.009933775 0.990066225
##   pr         0.000000000 1.000000000
##   se         0.000000000 1.000000000
##   us         0.000000000 1.000000000
##   uu         0.000000000 1.000000000
##   uw         0.005617978 0.994382022

Column proportions table. Each row shows the fraction of earthquakes network given a specific status (automatic/reviewed). Each column sums up to one.

prop.table(tbl, 2)
##             
##                 automatic     reviewed
##   ak         0.6534594914 0.2030970816
##   ci         0.0171496156 0.1958010721
##   hv         0.0532229450 0.0242703990
##   ismpkansas 0.0000000000 0.0011911852
##   ld         0.0000000000 0.0005955926
##   mb         0.0000000000 0.0233770101
##   nc         0.2720283856 0.1451756998
##   nm         0.0000000000 0.0041691483
##   nn         0.0035481963 0.0890410959
##   pr         0.0000000000 0.0635795116
##   se         0.0000000000 0.0022334723
##   us         0.0000000000 0.1335616438
##   uu         0.0000000000 0.0875521144
##   uw         0.0005913661 0.0263549732

Let us condider the bar plot that can be the plot to represent the net based events counts and, at the same time, highlighting its status with different fill colors.

ggplot(data = quakes, aes(x = net, fill = status)) + geom_bar(stat = 'count')

We want to give a better graphical representation, where the different proportion in status can be better perceived. We then build a dataframe collecting our network + status earthquakes information in frequency form. Starting from it, we will show the resulting spineplot.

tbl_df <- as.data.frame(tbl)
colnames(tbl_df) <- c("net", "status", "Freq")
tbl_df$net <- factor(tbl_df$net)
tbl_df
##           net    status Freq
## 1          ak automatic 1105
## 2          ci automatic   29
## 3          hv automatic   90
## 4  ismpkansas automatic    0
## 5          ld automatic    0
## 6          mb automatic    0
## 7          nc automatic  460
## 8          nm automatic    0
## 9          nn automatic    6
## 10         pr automatic    0
## 11         se automatic    0
## 12         us automatic    0
## 13         uu automatic    0
## 14         uw automatic    1
## 15         ak  reviewed 1364
## 16         ci  reviewed 1315
## 17         hv  reviewed  163
## 18 ismpkansas  reviewed    8
## 19         ld  reviewed    4
## 20         mb  reviewed  157
## 21         nc  reviewed  975
## 22         nm  reviewed   28
## 23         nn  reviewed  598
## 24         pr  reviewed  427
## 25         se  reviewed   15
## 26         us  reviewed  897
## 27         uu  reviewed  588
## 28         uw  reviewed  177

The spineplot makes more evident count differences of the status among net providing with a common scale in the range [0,1].

(xtabs_res <- xtabs(Freq ~ net + status, data = tbl_df))
##             status
## net          automatic reviewed
##   ak              1105     1364
##   ci                29     1315
##   hv                90      163
##   ismpkansas         0        8
##   ld                 0        4
##   mb                 0      157
##   nc               460      975
##   nm                 0       28
##   nn                 6      598
##   pr                 0      427
##   se                 0       15
##   us                 0      897
##   uu                 0      588
##   uw                 1      177
spineplot(xtabs_res)

The xtabs() function creates cross-tabulations of data using the formula style input. Further, applying the summary() to the xtabs() result, we get a chi-squared test of independence of all factors, while indicating the number of cases and dimensions.

type_net_count <- quakes %>% group_by(status, net) %>% dplyr::summarise(Freq = n())
xtabs_res <- xtabs(Freq ~ net + status, data = type_net_count)
summary(xtabs_res)
## Call: xtabs(formula = Freq ~ net + status, data = type_net_count)
## Number of cases in table: 8407 
## Number of factors: 2 
## Test for independence of all factors:
##  Chisq = 2082.2, df = 13, p-value = 0
##  Chi-squared approximation may be incorrect

Same spineplot as resulting by a slightly different approach.

spineplot(xtabs_res)

Margin tables are another way to summarise categorical data.

(mt <- margin.table(tbl, 2:1))
##            
##               ak   ci   hv ismpkansas   ld   mb   nc   nm   nn   pr   se
##   automatic 1105   29   90          0    0    0  460    0    6    0    0
##   reviewed  1364 1315  163          8    4  157  975   28  598  427   15
##            
##               us   uu   uw
##   automatic    0    0    1
##   reviewed   897  588  177

Per row and per column sums can be added by means of the addmargins() function.

addmargins(mt)
##            
##               ak   ci   hv ismpkansas   ld   mb   nc   nm   nn   pr   se
##   automatic 1105   29   90          0    0    0  460    0    6    0    0
##   reviewed  1364 1315  163          8    4  157  975   28  598  427   15
##   Sum       2469 1344  253          8    4  157 1435   28  604  427   15
##            
##               us   uu   uw  Sum
##   automatic    0    0    1 1691
##   reviewed   897  588  177 6716
##   Sum        897  588  178 8407

The Cross Tabulation result is shown.

mt <- margin.table(table(quakes$status, quakes$type), 2:1)
CrossTable(mt, prop.chisq = FALSE, prop.c = TRUE, prop.r = TRUE, format = "SPSS")
## 
##    Cell Contents
## |-------------------------|
## |                   Count |
## |             Row Percent |
## |          Column Percent |
## |           Total Percent |
## |-------------------------|
## 
## Total Observations in Table:  8407 
## 
##                    |  
##                    | automatic  |  reviewed  | Row Total | 
## -------------------|-----------|-----------|-----------|
## chemical explosion |        0  |        2  |        2  | 
##                    |    0.000% |  100.000% |    0.024% | 
##                    |    0.000% |    0.030% |           | 
##                    |    0.000% |    0.024% |           | 
## -------------------|-----------|-----------|-----------|
##         earthquake |     1688  |     6544  |     8232  | 
##                    |   20.505% |   79.495% |   97.918% | 
##                    |   99.823% |   97.439% |           | 
##                    |   20.079% |   77.840% |           | 
## -------------------|-----------|-----------|-----------|
##          explosion |        3  |       55  |       58  | 
##                    |    5.172% |   94.828% |    0.690% | 
##                    |    0.177% |    0.819% |           | 
##                    |    0.036% |    0.654% |           | 
## -------------------|-----------|-----------|-----------|
##          ice quake |        0  |       16  |       16  | 
##                    |    0.000% |  100.000% |    0.190% | 
##                    |    0.000% |    0.238% |           | 
##                    |    0.000% |    0.190% |           | 
## -------------------|-----------|-----------|-----------|
##        other event |        0  |        3  |        3  | 
##                    |    0.000% |  100.000% |    0.036% | 
##                    |    0.000% |    0.045% |           | 
##                    |    0.000% |    0.036% |           | 
## -------------------|-----------|-----------|-----------|
##       quarry blast |        0  |       95  |       95  | 
##                    |    0.000% |  100.000% |    1.130% | 
##                    |    0.000% |    1.415% |           | 
##                    |    0.000% |    1.130% |           | 
## -------------------|-----------|-----------|-----------|
##         rock burst |        0  |        1  |        1  | 
##                    |    0.000% |  100.000% |    0.012% | 
##                    |    0.000% |    0.015% |           | 
##                    |    0.000% |    0.012% |           | 
## -------------------|-----------|-----------|-----------|
##       Column Total |     1691  |     6716  |     8407  | 
##                    |   20.114% |   79.886% |           | 
## -------------------|-----------|-----------|-----------|
## 
## 

With the help of the structable within vcd package, we show a 3-WAY contingency table.

structable(type ~ status + magType, data = quakes)
##                   type chemical explosion earthquake explosion ice quake other event quarry blast rock burst
## status    magType                                                                                           
## automatic                               0          0         0         0           0            0          0
##           mb                            0          1         0         0           0            0          0
##           mb_lg                         0          0         0         0           0            0          0
##           md                            0        515         0         0           0            0          0
##           mh                            0          0         0         0           0            0          0
##           ml                            0       1172         3         0           0            0          0
##           mun                           0          0         0         0           0            0          0
##           mw                            0          0         0         0           0            0          0
##           mwr                           0          0         0         0           0            0          0
##           mww                           0          0         0         0           0            0          0
## reviewed                                0          0         0         0           0            0          0
##           mb                            0        603         0         0           0            0          0
##           mb_lg                         0         47         0         0           0            0          0
##           md                            0       1893         2         0           0           13          0
##           mh                            0         14         0         0           0            0          0
##           ml                            0       3873        53        16           3           82          1
##           mun                           2          0         0         0           0            0          0
##           mw                            0          4         0         0           0            0          0
##           mwr                           0         19         0         0           0            0          0
##           mww                           0         89         0         0           0            0          0

Geographical Regions vs. Depth Category Earthquakes Analysis

Based on ref. [5] definitions we create new factor variables.

earthquakes <- quakes %>% filter(type == "earthquake")
earthquakes <- earthquakes[complete.cases(earthquakes[, c("latitude", "longitude", "depth")]),]

emisphere_ns <- as.factor(ifelse(earthquakes$latitude >= 0, "north", "south"))
emisphere_ew <- as.factor(ifelse(earthquakes$longitude >= 0, "east", "west"))

earthquakes$region <- ifelse(earthquakes$latitude > 0 & earthquakes$longitude > 0, "NE", 
                             ifelse(earthquakes$latitude > 0  & earthquakes$longitude < 0, "NW",
                                    ifelse(earthquakes$latitude < 0 & earthquakes$longitude > 0, "SE", "SW")))

earthquakes$depth_type <- ifelse(earthquakes$depth <= 70, "shallow", ifelse(earthquakes$depth <= 300, "intermediate", "deep"))
earthquakes$region <- factor(earthquakes$region)
earthquakes$depth_type <- factor(earthquakes$depth_type)

Resulting tables are shown.

(tbl <- table(earthquakes$region, earthquakes$depth_type))
##     
##      deep intermediate shallow
##   NE    9           65     197
##   NW    0          526    7109
##   SE   10           40     109
##   SW   31           45      91
prop.table(tbl, 1)
##     
##            deep intermediate    shallow
##   NE 0.03321033   0.23985240 0.72693727
##   NW 0.00000000   0.06889325 0.93110675
##   SE 0.06289308   0.25157233 0.68553459
##   SW 0.18562874   0.26946108 0.54491018
prop.table(tbl, 2)
##     
##            deep intermediate    shallow
##   NE 0.18000000   0.09615385 0.02624567
##   NW 0.00000000   0.77810651 0.94710898
##   SE 0.20000000   0.05917160 0.01452172
##   SW 0.62000000   0.06656805 0.01212363

We wonder if the region plays a role for the depth type classification. We run the following chi-square test to figure it out.

chisq.test(earthquakes$region, earthquakes$depth_type, simulate.p.value = TRUE)
## 
##  Pearson's Chi-squared test with simulated p-value (based on 2000
##  replicates)
## 
## data:  earthquakes$region and earthquakes$depth_type
## X-squared = 1322.4, df = NA, p-value = 0.0004998

And it is statistically significant as the reported p-value highlights.

The Cross Tabulation is as follows.

mt <- margin.table(tbl, 2:1)
CrossTable(mt, prop.chisq = FALSE, prop.c = TRUE, prop.r = TRUE, format = "SPSS")
## 
##    Cell Contents
## |-------------------------|
## |                   Count |
## |             Row Percent |
## |          Column Percent |
## |           Total Percent |
## |-------------------------|
## 
## Total Observations in Table:  8232 
## 
##              |  
##              |       NE  |       NW  |       SE  |       SW  | Row Total | 
## -------------|-----------|-----------|-----------|-----------|-----------|
##         deep |        9  |        0  |       10  |       31  |       50  | 
##              |   18.000% |    0.000% |   20.000% |   62.000% |    0.607% | 
##              |    3.321% |    0.000% |    6.289% |   18.563% |           | 
##              |    0.109% |    0.000% |    0.121% |    0.377% |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
## intermediate |       65  |      526  |       40  |       45  |      676  | 
##              |    9.615% |   77.811% |    5.917% |    6.657% |    8.212% | 
##              |   23.985% |    6.889% |   25.157% |   26.946% |           | 
##              |    0.790% |    6.390% |    0.486% |    0.547% |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
##      shallow |      197  |     7109  |      109  |       91  |     7506  | 
##              |    2.625% |   94.711% |    1.452% |    1.212% |   91.181% | 
##              |   72.694% |   93.111% |   68.553% |   54.491% |           | 
##              |    2.393% |   86.358% |    1.324% |    1.105% |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
## Column Total |      271  |     7635  |      159  |      167  |     8232  | 
##              |    3.292% |   92.748% |    1.931% |    2.029% |           | 
## -------------|-----------|-----------|-----------|-----------|-----------|
## 
## 

The spineplot is able to highlight the better difference in proportions than the barplot. We herein below show the resulting spineplot.

type_region_count <- earthquakes %>% group_by(depth_type, region) %>% dplyr::summarise(Freq = n())
xtabs_res <- xtabs(Freq ~ depth_type + region, data = type_region_count)
spineplot(xtabs_res)

Under independence hypothesis, we may compute expected frequencies as follows.

(exp <- independence_table(mt))
##               
##                        NE         NW          SE         SW
##   deep           1.646016   46.37391   0.9657434   1.014334
##   intermediate  22.254130  626.97522  13.0568513  13.713800
##   shallow      247.099854 6961.65087 144.9774052 152.271866

We now bring our contingency table in frequency form.

tbl_df <- as.data.frame(tbl)
colnames(tbl_df) <- c("region", "depth_type", "Freq")
tbl_df$region <- factor(tbl_df$region)
tbl_df$depth_type <- factor(tbl_df$depth_type)
tbl_df
##    region   depth_type Freq
## 1      NE         deep    9
## 2      NW         deep    0
## 3      SE         deep   10
## 4      SW         deep   31
## 5      NE intermediate   65
## 6      NW intermediate  526
## 7      SE intermediate   40
## 8      SW intermediate   45
## 9      NE      shallow  197
## 10     NW      shallow 7109
## 11     SE      shallow  109
## 12     SW      shallow   91

Pattern of association can be revealed by the CMHtest() within vcdExtra package or the sieve plot.

CMHtest(Freq ~ region + depth_type, data = earthquakes)
## Cochran-Mantel-Haenszel Statistics for region by depth_type 
## 
##                  AltHypothesis   Chisq Df        Prob
## cor        Nonzero correlation  271.56  1  5.1951e-61
## rmeans  Row mean scores differ  817.17  3 8.1840e-177
## cmeans  Col mean scores differ  609.18  2 5.2283e-133
## general    General association 1322.22  6 1.6789e-282

The sieve plot highlight frequencies different from the expected as determined by independency. The area of each rectangle is always proportional to expected frequency, however the observed frequency is shown by the number of squares in each rectangle.

sieve(Freq ~ region + depth_type, data = earthquakes, shade = TRUE, legend = TRUE, labeling = labeling_values)

Further, color and intensity of shadows together with the Pearson residuals legend highlight deviations from the expected.

The association plot puts deviation from independence in the foreground, in the sense that the area of each box is made proportional to the (observed – expected) frequency. Cells with observed > expected frequency rise above the baseline representing independence, while cells that contain less than the expected frequency fall below it.

assoc(~ region + depth_type, data = earthquakes, shade = TRUE, legend = TRUE, labeling = labeling_values)

The mosaic plot can be applied to N-way contingency tables. For a 2-way table as herein shown, the mosaic display is like a grouped barchart, where the heights (or widths) of the bars show the relative frequencies of one variable and widths (or heights) of the sections in each bar show the conditional frequencies of the second variable given the first.

mosaic(~ region + depth_type, data = earthquakes, shade = TRUE, legend = TRUE, labeling = labeling_residuals)

Please note that in above mosaic plot, the frequency equal to zero for NW region deep earthquake does not find an explicit graphical representation. There are several labeling option for the mosaic plot, see Table 5.1 at ref. [4]. In our example, we choose to label the cells with the Pearson residuals value. The Pearson residuals can be computed as herein shown.

resids <- (mt - exp)/sqrt(exp)
round(resids, 1)
##                  NE   NW   SE   SW
##   deep          5.7 -6.8  9.2 29.8
##   intermediate  9.1 -4.0  7.5  8.4
##   shallow      -3.2  1.8 -3.0 -5.0

Where mt is the margin table, exp is the expected value based on indipendence hypothesis as previously computed above.

If you want to go more in detail with categorical data exploratory analysis, read ref. [4] as a valuable source of technical content and examples.

If you have any questions, please feel free to comment below.

References

  1. Earthquake dataset
  2. Eathquake dataset terms
  3. Introductory Statistics with R, 2nd Edition, P. Dalgaard, Springer
  4. Discrete Data Analysis with R, M. Friendly D. Meyer, CRC press
  5. Determining the Depth of an Earthquake