In a previous post of mine published at DataScience+, I analyzed the text of the first presidential debate using relatively simple string manipulation functions to answer some high-level questions from the available text.
In this post, we leverage a few other NLP techniques to analyze another text corpus – A collection of tweets. Given a tweet, we would like to extract the keywords or phrases which conveys the gist of the meaning of the tweet. For the purpose of this demo, we will extract President Donald Trump’s tweets (~3000 in total) from Twitter using Twitter’s API.
I would not cover the twitter data extraction part in this post and directly jump on to the actual analysis (The data extraction code is in Python). However, I have uploaded a csv file with the extracted tweets here. You can download the file to your local machine & load it in your R console if you wish to follow along with the tutorial.
So let’s jump in now to the actual analysis!
How do we access key phrases from the text
There are various different approaches that one can try for this. We will try out one specific approach in this post – POS (Part of Speech) tagging followed by pattern-based chunking and extraction
Rationale behind this approach
Consider the following tweet from our sample:
After being forced to apologize for its bad and inaccurate coverage of me after winning the election, the FAKE NEWS @nytimes is still lost!
A POS tag output for this text would be as follows:
After/IN being/VBG forced/VBN to/TO apologize/VB for/IN its/PRP$ bad/JJ and/CC inaccurate/JJ coverage/NN of/IN me/PRP after/IN winning/VBG the/DT election/NN ,/, the/DT FAKE/NNP NEWS/NNP @nytimes/NNP is/VBZ still/RB lost/VBN !/
A definition of each of these tags can be found here.
We can make a couple of simple observations from the POS tagged text:
- As is obvious, certain tags are more informative than others. For. E.g. Noun tags (starting with NN) would carry more information than prepositions or conjunctions. Similarly, if we would like to know “what” is being spoken about, Noun words may be more relevant than others.
- Secondly, chunks of words would carry more meaning than looking at individual words in isolation. For e.g. in the text above the word “coverage” alone does not adequately convey the meaning of the text. However, “inaccurate coverage” very nicely captures one of the key themes in this tweet.
How do we implement it
This is where you may face some challenges if you are trying to implement this in R. Python seems to be more generally preferred for tasks such as this, with many of the POS tagging & chunking examples that you will find online, based on the NLTK library in Python. While NLTK is great to work with, I tried implementing this in R as well using the openNLP package.
This is the flow that we will follow:
Text --> Sentence Annotation --> Word Annotation --> POS Tagging --> Chunking
Given below is an explanation of the approach:
Step 1: POS Tagging
The POS tagging code is pretty much based on the code examples that are given as part of openNLP’s documentation. Here is the code:
#We will try the approach for 1 tweet; Finally we will convert this into a function x <- "the text of the tweet" x <- as.String(x) # Before POS tagging, we need to do Sentence annotation followed by word annotation wordAnnotation <- annotate(x, list(Maxent_Sent_Token_Annotator(), Maxent_Word_Token_Annotator())) # POS tag the words & extract the "words" from the output POSAnnotation <- annotate(x, Maxent_POS_Tag_Annotator(), wordAnnotation) POSwords <- subset(POSAnnotation, type == "word") # Extract the tags from the words tags <- sapply(POSwords$features, '[[', "POS") # Create a data frame with words and tags tokenizedAndTagged <- data.frame(Tokens = x[POSwords], Tags = tags)
Step 2: Chunking and Extraction
For us to chunk the POS tagged text, we would have to first define what POS pattern we would consider as a chunk. For e.g. an Adjective-Noun(s) combination (JJ-NN) can be a useful pattern to extract (in the example above this pattern would have given us the “inaccurate coverage” chunk). Similarly, we may wish to chunk and extract proper nouns (so for e.g. in this tweet – “ Hope you like my nomination of Judge Neil Gorsuch for the United States Supreme Court. He is a good and brilliant man, respected by all.“, chunking and extracting proper nouns (NNP, NNPS) would give us “Judge Neil Gorsuch” & “United States Supreme Court”)
So once we define what POS pattern we consider as a chunk, the next step is to extract them.
This is a bit trickier (In python’s nltk, there is a very useful function that helps extract chunks from POS tagged text using RegEx based pattern search. I couldn’t find a similar function in openNLP, so I wrote a simple workaround for this function in R)
Given below is the code with explanatory comments:
# Define a flag(tags_mod) for pos tags - Flag set to 1 if it contains the POS tag we are interested in else 0 # In this case we only want Noun and Adjective tags (NN, JJ) # Note that this will also capture variations such as NNP, NNPS etc tokenizedAndTagged$Tags_mod = grepl("NN|JJ", tokenizedAndTagged$Tags) # Initialize a vector to store chunk indexes chunk = vector() # Iterate thru each word and assign each one to a group # if the word doesn’t belong to NN|JJ tags (i.e. tags_mod flag is 0) assign it to the default group (0) # If the ith tag is in “NN|JJ” (i.e. tags_mod flag is 1) assign it to group i-1 if the (i-1)th tag_mod flag is also 1; else assign it to a new group chunk[1] = as.numeric(tokenizedAndTagged$Tags_mod[1]) for (i in 2:nrow(tokenizedAndTagged)) { if(!tokenizedAndTagged$Tags_mod[i]) { chunk[i] = 0 } else if (tokenizedAndTagged$Tags_mod[i] == tokenizedAndTagged$Tags_mod[i-1]) { chunk[i] = chunk[i-1] } else { chunk[i] = max(chunk) + 1 } }
Finally extract matching pattern
# Split and chunk words text_chunk <- split(as.character(tokenizedAndTagged$Tokens), chunk) tag_pattern <- split(as.character(tokenizedAndTagged$Tags), chunk) names(text_chunk) <- sapply(tag_pattern, function(x) paste(x, collapse = "-")) # Extract chunks matching pattern # We will extract JJ-NN chunks and two or more continuous NN tags # "NN.-NN" -> The "." in this regex will match all variants of NN: NNP, NNS etc res = text_chunk[grepl("JJ-NN|NN.-NN", names(text_chunk))]
The overall function is available here.
Testing the Approach
Testing the approach on a few tweets gave fairly promising results.
I am providing below some sample results.
Sample Tweet 1:
“Jeff Sessions is an honest man. He did not say anything wrong. He could have stated his response more accurately, but it was clearly not….”
Chunking Output:
c("Jeff Sessions", "honest man")
Sample Tweet 2:
Since November 8th, Election Day, the Stock Market has posted $3.2 trillion in GAINS and consumer confidence is at a 15 year high. Jobs!
Chunking Output:
c("Election Day", "Stock Market")
Sample Tweet 3:
Russia talk is FAKE NEWS put out by the Dems, and played up by the media, in order to mask the big election defeat and the illegal leaks
Chunking Output:
c("Russia talk", "FAKE NEWS", "big election defeat", "illegal leaks")
Conclusion
As can be seen, this approach seems to be working fairly well.
The extracted chunks do convey some of the key themes present in the text.
We can, of course, try out more complex techniques & alternate approaches and if time permits, I will try exploring & blogging about a few of them in the future.
However, the intent of this post was to provide readers a quick overview of POS tagging and chunking approaches and the usefulness of these techniques for certain NLP Tasks.
Hope you find this useful!