A significant chunk of the data that we encounter on a daily basis is available in an unstructured, free text format. Hence, the ability to glean useful bits of information from this unstructured pile can be quite valuable.
In this post, we will attempt a basic analysis of the text from the first Presidential debate between Clinton and Trump.
A good part of this post involves data manipulation steps to convert the raw transcript text (of the debate) into a more structured/ ordered form, which you can then start analyzing – This initial data manipulation process to transform the raw text into a more structured form suitable for further analysis/modelling, is a key step in any text analytics effort, and hence a key focus point of this post.
Post data transformation and structuring, we attempt to answer a few simple questions from the data (such as Who spoke more, Who interrupted more, Key discussion points etc).
You would not get any profound political insights from this analysis! But hopefully it will help showcase some of the most commonly used functions & techniques for manipulating strings/ free text.
Read the data in R
A little googling gave me this link which has the entire transcript of the debate. You can either copy paste the content in a text file and read it into R or use web scraping to get the data in R. I will use the web scraping approach here (without going into the scraping code in detail – since that’s a separate topic in itself)
library(rvest) # Read the link transcript_link <- read_html("https://www.washingtonpost.com/news/the-fix/wp/2016/09/26/the-first-trump-clinton-presidential-debate-transcript-annotated/") # I use a browser add in (Selector Gadget) to extract out the relevant html tag for the transcript # the tag is titled "#main-content" - I then extract out the content using the following functions from the rvest package transcript <- transcript_link %>% html_nodes("#main-content") %>% html_text()
Structure the data
Right now, the entire text is just one undifferentiated whole – As a prelude to any meaningful analysis, we would want to separate out the text corresponding to each of the 3 participants (i.e including the moderator)
How do we do that?
There are different ways in which you can attempt this problem – One possible approach is provided below (This need not be the most concise approach – In fact, I can think of at least one other solution which will result in more compact code. But since the objective of this post is to demonstrate some of the most useful string manipulation functions, we will go with this approach itself)
So this is the approach:
If you look at the transcript, the speaker name is highlighted in caps to indicate when a particular speaker starts talking (i.e. one of these 3 values – “CLINTON”, “TRUMP” or “HOLT”). We can think of these as markers in our text. And if we extract out the position of these markers, we can identify when a particular speaker starts talking. We use the str_locate_all
function from stringr
package for this.
# We have 3 different patterns to search for - which will include in our expression using the or operator '|' # str_locate_all will give the index (start and end position) of all the matches markers <- str_locate_all(transcript, pattern = "CLINTON|TRUMP|HOLT") # This returns a list with one component - we extract out that component markers <- markers[[1]] # Now markers is a matrix indicating the start and end positions # We are only interested in the start positions, so we extract that out markers <- markers[,1]
Now markers has the starting index of when a particular speaker starts talking. As next step, we use the marker vector to separate out the conversation into chunks. As an example, look at the first 2 values of the markers vector: markers[1:2]
which gives 1017 & 2292
We can then use the base substr
function to extract out these chunks. E.g.
substr(transcript, 1017, 2291)
will pull out the first chunk (spoken by Holt) from our text (You may have noticed that I have reduced the end index by 1 – Else, the code will also pick up the first letter of the next marker – C in this case (from CLINTON) )
Since we need to do this for the entire length of the markers vector, we put this code in a loop:
# Initialize a vector to store the results res <- vector(mode = "character", length = length(markers) - 1) for (i in 1:(length(markers)-1)) { res[i] <- substr(transcript,markers[i],markers[i+1]-1) }
Ok. So now we have each chunk of the conversation as a separate value in a character vector.
As final step, we need to separate out chunks related to Clinton and those related to Trump (Let’s exclude Holt for now). This is how we can do it:
Each element of our res vector starts off with the speaker name (i.e our marker). We can use the grepl
function to search for these markers (grepl is logical grep and returns a True/False output)
So, for e.g. grepl("CLINTON",res[[1]])
will return False, since it’s spoken by Holt whereas grepl("CLINTON",res[[2]])
will return True since its spoken by Clinton.
Since we need to check this for each element in our res vector, we need a loop – But, rather than use an explicit for loop we will use the higher order function sapply
. Finally, the output of sapply (which will return a logical vector) is used to subset the res vector. This is the final code:
clinton <- res[sapply(res,function(x) grepl("CLINTON",x))] trump <- res[sapply(res,function(x) grepl("TRUMP",x))]
Great! So we finally have a more structured and ordered representation of our data. We now have separate vectors (represented by the variables “clinton” and “trump”) holding the text specific to each of these speakers.
Analyzing the data
Now that we have ordered the data as per the speaker, we can perform various different comparative analysis. I will showcase just a couple of basic analysis in this post.
Who spoke more ? (i.e as per the number of words spoken)
We can split the text into words using the str_split
from the stringr
package. As before, since we need to do this for each element in our vector, we would loop through the vector using sapply
tot_words_clinton <- unlist(sapply(clinton, function(x) str_split(x, " "))) # This also returns some blank values which we exclude tot_words_clinton <- tot_words_clinton[tot_words_clinton != ""] length(tot_words_clinton)
The output is 6491 for Clinton vis-a-vis 8817 for Trump (Note: This may not be the exact number of words – For e.g. if we have some punctuation separated by spaces it may get extracted out as well. But this should still be fairly close. Also, I believe some of the word annotation functions (from the openNLP
package for instance) should take care of this and give you a more precise number)
What were the key topics discussed?
A basic approach to this would be to generate a word frequency list and then review the list to get a sense of the discussion topics. We can do this as follows:
# We use the count function from the plyr package on the "tot_words_clinton" variable which has the "bag-of-words" # We then use the dplyr function arrange to sort the words as per descending frequency word_freq = plyr::count(tot_words_clinton) %>% arrange(desc(freq))
Quickly glancing through the list can help us identify relatively high frequency words such as “nuclear”, “jobs”, “police”, “economy”, “gun”, “debt” etc, which can help us get a sense of the key topics that got discussed.
However, there is a fair amount of manual effort involved in scrolling through the word list to identify words of interest. How can we more quickly pinpoint interesting words?
One slight improvement to the above code would be to filter out the “stop words” – i.e. common/high frequency words). The code for this would be as follows:
library(tm) # Extract the list of common key words stop_words <- stopwords() # Remove these words from our "bag-of-words" list tot_words_clinton <- tot_words_clinton[!(tot_words_clinton %in% stop_words)]
Running a word frequency on this generates a list with around l00 lesser words making it a bit more easier to identify words of interest.
Total number of interruptions
As per this article, “Trump interrupted Clinton 51 times, while she interrupted him 17 times”. Can we get this from our data?
Well I tried an approach which gave me something close though not these exact numbers. The idea is as follows: If there were no interruptions, each conversation chunk ends with a complete sentence followed by an appropriate punctuation (a period or a question mark for example). Conversations which are incomplete is indicated with an ellipsis (…) in the transcript. For e.g. read the 11th element in the “clinton” vector clinton[11]
which gives
[1] "CLINTON: And I have -- well, not quite that long. I think my husband did a pretty good job in the 1990s. I think a lot about what worked and how we can make it work again... "
So the number of incomplete conversations can be a rough indicator for the number of interruptions that a speaker encountered. We compute this as follows:
# Loop thru each element of our vector # Use regular expr. to match the ellipsis pattern sum(sapply(trump,function(x) grepl("[.]{3}",x))) sum(sapply(clinton,function(x) grepl("[.]{3}",x)))
This gives a value of 40 for Clinton and 18 for Trump (which is, at least, a rough approximation of the actual values)
What else can we do ?
A lot more! For e.g. you may want to do a sentiment analysis and compare the scores for each speaker. What I find more interesting (and challenging) though, is to see how we can use Part of Speech (POS) tagging among other NLP techniques to glean more valuable information from the data.
For example, I came across this link recently which talks about Trump’s “strange speaking style”. Can we use NLP to determine how well formed a sentence is, to analyze claims such as this? It will quite interesting to find out!
Hope you find this post useful. In case of any queries or inputs, please feel free to comment below. Thanks!