Recently, I become interested to grasp the data from webpages, such as Wikipedia, and to visualize it with R. As I did in my previous post, I use rvest
package to get the data from webpage and ggplot
package to visualize the data.
In this post, I will map the life expectancy in White and African-American in the US.
Load the required packages.
## LOAD THE PACKAGES #### library(rvest) library(ggplot2) library(dplyr) library(scales)
Import the data from Wikipedia.
## LOAD THE DATA #### le = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_by_life_expectancy") le = le %>% html_nodes("table") %>% .[[2]]%>% html_table(fill=T)
Now I have to clean the data. Below I have explain the role of each code.
## CLEAN THE DATA #### # check the structure of dataset str(le) # select only columns with data le = le[c(1:8)] # get the names from 3rd row and add to columns names(le) = le[3,] # delete rows and columns which I am not interested le = le[-c(1:3), ] le = le[, -c(5:7)] # rename the names of 4th and 5th column names(le)[c(4,5)] = c("le_black", "le_white") # make variables as numeric le = le %>% mutate( le_black = as.numeric(le_black), le_white = as.numeric(le_white)) # check the structure of dataset str(le) 'data.frame': 54 obs. of 417 variables: $ X1 : chr "" "Rank\nState\nLife Expectancy, All\n(in years)\nLife Expectancy, African American\n(in years)\nLife Expectancy, Asian American\n"| __truncated__ "Rank" "1" ... $ X2 : chr NA "Rank" "State" "Hawaii" ... $ X3 : chr NA "State" "Life Expectancy, All\n(in years)" "81.3" ... $ X4 : chr NA "Life Expectancy, All\n(in years)" "Life Expectancy, African American\n(in years)" "-" ... $ X5 : chr NA "Life Expectancy, African American\n(in years)" "Life Expectancy, Asian American\n(in years)" "82.0" ... $ X6 : chr NA "Life Expectancy, Asian American\n(in years)" "Life Expectancy, Latino\n(in years)" "76.8" ... $ X7 : chr NA "Life Expectancy, Latino\n(in years)" "Life Expectancy, Native American\n(in years)" "-" ... ..... ..... 'data.frame': 51 obs. of 7 variables: $ Rank : chr "1" "2" "3" "4" ... $ State : chr "Hawaii" "Minnesota" "Connecticut" "California" ... $ Life Expectancy, All (in years): chr "81.3" "81.1" "80.8" "80.8" ... $ le_black : num NA 79.7 77.8 75.1 78.8 77.4 NA NA 75.5 NA ... $ le_white : num 80.4 81.2 81 79.8 80.4 80.5 80.4 80.1 80.3 80.1 ... $ le_diff : num NA 1.5 3.2 4.7 1.6 ... $ region : chr "hawaii" "minnesota" "connecticut" "california" ...
Since there are some differences in life expectancy between White and African-American, I will calculate the differences and will map it.
le = le %>% mutate(le_diff = (le_white - le_black))
I will load the map data and will merge the datasets togather.
## LOAD THE MAP DATA #### states = map_data("state") str(states) # create a new variable name for state le$region = tolower(le$State) # merge the datasets states = merge(states, le, by="region", all.x=T) str(states) 'data.frame': 15537 obs. of 6 variables: $ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ... $ lat : num 30.4 30.4 30.4 30.3 30.3 ... $ group : num 1 1 1 1 1 1 1 1 1 1 ... $ order : int 1 2 3 4 5 6 7 8 9 10 ... $ region : chr "alabama" "alabama" "alabama" "alabama" ... $ subregion: chr NA NA NA NA ... 'data.frame': 15537 obs. of 12 variables: $ region : chr "alabama" "alabama" "alabama" "alabama" ... $ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ... $ lat : num 30.4 30.4 30.4 30.3 30.3 ... $ group : num 1 1 1 1 1 1 1 1 1 1 ... $ order : int 1 2 3 4 5 6 7 8 9 10 ... $ subregion : chr NA NA NA NA ... $ Rank : chr "49" "49" "49" "49" ... $ State : chr "Alabama" "Alabama" "Alabama" "Alabama" ... $ Life Expectancy, All (in years): chr "75.4" "75.4" "75.4" "75.4" ... $ le_black : num 72.9 72.9 72.9 72.9 72.9 72.9 72.9 72.9 72.9 72.9 ... $ le_white : num 76 76 76 76 76 76 76 76 76 76 ... $ le_diff : num 3.1 3.1 3.1 3.1 3.1 ...
Now its time to make the plot. First I will plot the life expectancy in African-American in the US. For few states, we don’t have the data, and therefore I will color it in grey color.
## MAKE THE PLOT #### # Life expectancy in African American ggplot(states, aes(x = long, y = lat, group = group, fill = le_black)) + geom_polygon(color = "white") + scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="#eeeeee", breaks = pretty_breaks(n = 5)) + labs(title="Life expectancy in African American") + coord_map()
The code below is for White people in US.
# Life expectancy in White American ggplot(states, aes(x = long, y = lat, group = group, fill = le_white)) + geom_polygon(color = "white") + scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="Gray", breaks = pretty_breaks(n = 5)) + labs(title="Life expectancy in White") + coord_map()
Finally, I will map the differences between white and African American people in US.
# Differences in Life expectancy between White and African American ggplot(states, aes(x = long, y = lat, group = group, fill = le_diff)) + geom_polygon(color = "white") + scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="#eeeeee", breaks = pretty_breaks(n = 5)) + labs(title="Differences in Life Expectancy between \nWhite and African Americans by States in US") + coord_map()
On my previous post I got a comment to add the pop-up effect as I hover over the states. This is a simple task as Andrea exmplained in his comment. What you have to do is to install the plotly
package, to create a object for ggplot, and then to use this function ggplotly(map_plot)
to plot it.
library(plotly) map_plot = ggplot(states, aes(x = long, y = lat, group = group, fill = le_black)) + geom_polygon(color = "white") + scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="#eeeeee", breaks = pretty_breaks(n = 5)) + labs(title="Life expectancy in African American") + coord_map() ggplotly(map_plot)
That’s all! Leave a comment below if you have any question.