It seems rather strange to think that, just under eight months ago, I had not written any computer code (I’m not including little bits of BASIC from the ’80s), and yet lines of code or the blinking cursor of Terminal no longer instil a sense of rising panic. Although programming has a very steep learning curve, it is relatively easy to gain a basic understanding, and, with this, the confidence to experiment.
R has rapidly become my favourite programming language, so I was interested to follow a link from Scott Weingart’s blog post ‘Not Enough Perspectives Pt. 1’ to Matthew Jockers’ new R package ‘Syuzhet’. As this is an area I hope to research as part of my PhD I decided to give it a try, using the ‘Introduction to the Syuzhet Package‘ (Jockers, 2015) as a guide. I used a short text from Jane Austen’s juvenilia – ‘LETTER the FOURTH From a YOUNG LADY rather impertinent to her friend’. I removed the speech marks from the text as this causes problems with the code.
# Experiment based on http://cran.r-project.org/web/packages/syuzhet/vignettes/syuzhet-vignette.html # 'Introduction to the Syuzhet Package' Jockers 20-2-2015 # Having installed the syuzhet package from CRAN, access it using library: library(syuzhet) # Input a text, for longer texts use get_text_as_string() # This text is from Austen's Juvenillia from Project Gutenberg example_text <- "We dined yesterday with Mr Evelyn where we were introduced to a very agreable looking Girl his Cousin... #[I haven't included the whole text I used - it can be viewed here #http://www.gutenberg.org/files/1212/1212-h/1212-h.htm#link2H_4_0036] This was an answer I did not expect - I was quite silenced, and never felt so awkward in my Life - -." # Use get_sentences to create a character vector of sentences s_v <- get_sentences(example_text) # Check that all is well! class(s_v) str(s_v) head(s_v) # Use get_sentiment to assess the sentiment of each sentence. This function # takes the character vector and one of four possible extraction methods sentiment_vector <- get_sentiment(s_v, method = "bing") sentiment_vector # The different methods give slightly different results - same text different method afinn_s_v <- get_sentiment(s_v, method = "afinn") afinn_s_v # An estimate of the "overall emotional valence" of the passage or text sum(sentiment_vector) # To calculate "the central tendency, the mean emotional valence" mean(sentiment_vector) # A summary of the emotions in the text summary(sentiment_vector) # To visualise this using a line plot plot(sentiment_vector, type = "l", main = "Plot Trajectory 'LETTER the FOURTH From a YOUNG LADY'", xlab = "Narrative Time", ylab = "Emotional Valence") abline(h = 0, col = "red") # To extract the sentence with the most negative emotional valence negative <- s_v[which.min(sentiment_vector)] negative # and to extract the most positive sentence positive <- s_v[which.max(sentiment_vector)] positive # Use get_nrc_sentiment to categorize each sentence by eight emotions and two # sentiments and returns a data frame nrc_data <- get_nrc_sentiment(s_v) # To subset the 'sad' sentences sad_items <- which(nrc_data$sadness >= 0) s_v[sad_items] # To view the emotions as a barplot barplot(sort(colSums(prop.table(nrc_data[, 1:8]))), horiz = T, cex.names = 0.7, las = 1, main = "Emotions in 'Letter the Fourth'", xlab = "Percentage", col = 1:8)
The sentiment vector – this assigns a value to each sentence in the text.
Visualising the sentiment vector as a line graph shows the fluctuations within the text:
Visualising the emotions within the text:
I have only started to explore this package and have applied it to a very short passage (44 sentences), while this shows what Syuzhet can do in a general way, it does not demonstrate its full capabilities. In addition, as I haven’t fully read up on the package and the thinking behind it, my analysis may well be plagued with errors.
However, these are my thoughts so far. Running a brief trial, using three of the methods available, highlights some of the difficulties of sentiment analysis, while all three identified the same sentence as the most ‘negative’:
each of the methods identified a different sentence as being the most ‘positive’:
This is something Jockers discusses further in his blog post ‘My Sentiments (Exactly?)‘, highlighting that sentiment analysis is difficult for humans, as well as machines:
However, he also points out:
It seems that a combination of close and ‘distant’ reading, what Mueller calls ‘scaled reading’, is likely to be of most use if analysis at the sentence level is desired. Having only a relatively recent and limited experience of programming in R, I have found using the Syuzhet package very straightforward and am looking forward to using it again very soon.
UPDATE: 3rd April 2015
There is a great deal of academic discussion surrounding the methods discussed here. As I read further I will add another post exploring the core points and including a reading list.