Inspired by this article about sentiment analysis and this guide to webscraping, I have decided to get my hands dirty by scraping and analyzing a sample of reviews on the website Goodreads.
The goal of this project is to demonstrate a complete example, going from data collection to machine learning analysis, and to illustrate a few of the dead ends and mistakes I encountered on my journey. We’ll be looking at the reviews for five popular romance books. I have voluntarily chosen books in the same genre in order to make comments text more homogeneous a priori; these five books are popular enough that I can easily pull a few thousands reviews for each, yielding a significant corpus with minimum effort. If you don’t like romance books, feel free to replicate the analysis with your genre of choice!
To make the article more digestible, I have divided it in three segments:
- Part 1: Webscraping
- Part 2: Exploratory data analysis and sentiment analysis
- Part 3: Predictive analytics with machine learning
This post includes the part 1, and the two following parts will be posted on DataScience+ in 1-week intervals.
Part 1: Webscraping
Goodreads’s reviews are a trove of text content begging to be scraped, with an interesting non-text variable attached, the ratings left by the reviewers. But there is a problem: navigation between pages of comments is done through a javascript button, not an html link. Fear not: this problem has actually a pretty simple solution, through the use of the RSelenium package (which has a nice vignette here).
Setup
Let’s load the libraries we’ll require during the analyses, and define some variables we’ll use later.
library(data.table) # Required for rbindlist library(dplyr) # Required to use the pipes %>% and some table manipulation commands library(magrittr) # Required to use the pipes %>% library(rvest) # Required for read_html library(RSelenium) # Required for webscraping with javascript url <- "https://www.goodreads.com/book/show/18619684-the-time-traveler-s-wife#other_reviews" book.title <- "The time traveler's wife" output.filename <- "GR_TimeTravelersWife.csv"
Note that I’m working on a book-by-book basis. This means we have to manually change the variables above and re-run the script for each book. This could be automated to work on a grander scale, but that’s good enough for what I want to do here. Also, I’d rather not overload Goodreads’s servers by pulling massive amounts of data from them.
Let’s then start the RSelenium server. I have had some trouble with Firefox, and I have had to reinstall a previous version of the browser (which Firefox frowns upon), so your mileage may vary here.
startServer() remDr <- remoteDriver(browserName = "firefox", port = 4444) # instantiate remote driver to connect to Selenium Server remDr$open() # open web browser remDr$navigate(url)
These instructions start an instance of Firefox and navigate to the url as if you were directly interacting with the browser.
We then initialize the data frame that we’ll be populating with the data later.
global.df <- data.frame(book=character(), reviewer = character(), rating = character(), review = character(), stringsAsFactors = F)
We are now ready to proceed with the webscraping process itself!
The webscraping process
To extract the content we want, we’ll be looping through the 100 pages or so of comments for each of the books. Here I remove the loop to show the code going through one page only and explain its workings line by line.
First, we need to identify “where” the reviews appear in the page code. This is done by using SelectorGadget, a Chrome extension that allows you to identify CSS selectors. Once we have identified the proper CSS selector (here “#bookReviews .stacked”), we just pass its name to the findElements function of the RSelenium server.
reviews <- remDr$findElements("css selector", "#bookReviews .stacked")
We extract the html code for the reviews, then the text component.
reviews.html <- lapply(reviews, function(x){x$getElementAttribute("outerHTML")[[1]]}) reviews.list <- lapply(reviews.html, function(x){read_html(x) %>% html_text()} ) reviews.text <- unlist(reviews.list)
We now have the text of the reviews in list format, but a rapid inspection of it shows that there is still a lot of work to do to get a clean text. This we will do by using regular expressions (regex).
Cleaning the reviews with Regex
In my experience with text analytics, regex are both a blessing and a curse. A blessing because how else can you remove all non-letters characters in a string in one short command? And a curse because it’s a fairly esoteric language that is hard to understand or remember when you re-read your code later. So if you are not familiar with regex, I would definitely advise very generous commenting at the brief moment in time when you actually understand what your code does.
# Removing all characters that are not letters or dash reviews.text2 <- gsub("[^A-Za-z\\-]|\\.+", " ", reviews.text) # Removing the end of line characters and extra spaces reviews.clean <- gsub("\n|[ \t]+", " ", reviews.text2)
In order to write these commands, I have found these resources useful:
- http://www.regular-expressions.info/
- http://stat545.com/block022_regular-expression.html
- https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html
Putting the reviews in table format
We now have our reviews in a reasonably clean state. But due to the underlying structure of the html code, we have a problem: for each review, we have the name of the reviewer and his/her rating in one string, and the review in the following string. In addition to that, the system to preview reviews means that the beginning of the review appears twice in the string. We’ll have to clean all that, again using regex, to get our data in table format.
We start by counting the number of reviews we have, i.e. half the number of strings and creating a temporary data frame that we’ll use to store the data before transferring it to the main data frame.
n <- floor(length(reviews)/2) reviews.df <- data.frame(book = character(n), reviewer = character(n), rating = character(n), review = character(n), stringsAsFactors = F)
We loop through our strings and populate our data frame, by extracting the fields we want review by review, based on recurring stop words. A for loop will do for this non-production example, but for production code you’d probably want to vectorialize everything you can.
The following code might appear a bit cryptic so first I’ll explain what I’m going to do:
- In the first part, I identify several expressions that can appear between the reviewer’s name and the rating, and use them in a regex to determine the ending point of the name in the string, then extract the name.
- In the second part, I identify several expressions that can appear at the end of the rating, and use them in a regex to determine the ending point of the rating; sometimes none of these expressions appear, so I have a conditional telling R to go to the end of the string if it finds none of the expressions (by convention, it returns the position as being -1). Then I extract the rating.
- In the third part, I remove the beginning of each review, which is repeated in the html file, by looking for the position in the string where the first 50 characters of the string appear again. I have a conditional in place to deal with cases when the review is short enough that its beginning is not repeated. I deal with the end of the review in the same way I did with the end of the rating
- Finally, note the structure of the loop: I’m not looping through the strings one by one, but through the reviews, each review taking 2 consecutive strings, hence the
2*j
and2*j-1
indices.
for(j in 1:n){ reviews.df$book[j] <- book.title #Isolating the name of the author of the review auth.rat.sep <- regexpr(" rated it | marked it | added it ", reviews.clean[2*j-1]) reviews.df$reviewer[j] <- substr(reviews.clean[2*j-1], 5, auth.rat.sep-1) #Isolating the rating rat.end <- regexpr("ยทย | Shelves| Recommend| review of another edition", reviews.clean[2*j-1]) if (rat.end==-1){rat.end <- nchar(reviews.clean[2*j-1])} reviews.df$rating[j] <- substr(reviews.clean[2*j-1], auth.rat.sep+10, rat.end-1) #Removing the beginning of each review that was repeated on the html file short.str <- substr(reviews.clean[2*j], 1, 50) rev.start <- unlist(gregexpr(short.str, reviews.clean[2*j]))[2] if (is.na(rev.start)){rev.start <- 1} rev.end <- regexpr("\\.+more|Blog", reviews.clean[2*j]) if (rev.end==-1){rev.end <- nchar(reviews.clean[2*j])} reviews.df$review[j] <- substr(reviews.clean[2*j], rev.start, rev.end-1) }
Now that our temporary data frame has been populated, we can transfer its content to our main data frame.
global.lst <- list(global.df, reviews.df) global.df <- rbindlist(global.lst)
And finally, we tell RSelenium to “click” on the next page button, by passing the proper CSS selector that we identified with SelectorGadget. Final trick: I found that in the initial iterations, RSelenium was too slow to load the pages, and was not responding in time to the instructions at the beginning of the next loop, so we tell R to wait 3 seconds at the end of each loop.
NextPageButton <- remDr$findElement("css selector", ".next_page") NextPageButton$clickElement() Sys.sleep(3)
After closing the overall loop, we can save the final data frame in a file.
write.csv(global.df,output.filename)
The result data frame looks like this:
book | reviewer | rating | review |
---|---|---|---|
The time traveler’s wife | Liz S | it was ok | I recently read… |
Eleanor & Park | Danielle | did not like it | Why can’t there be… |
Me Before You | Swaps | it was amazing | This review has been… |
You can find the full code, including the loops I have omitted here, on my github.