«

»

Mar 10 2016

Beyond the Word Cloud

Possibly the most common entry level visualisation in computational textual analysis is the word cloud. There are multiple online tools (the best known probably being Wordle) which allow you to create them in all sorts of styles. A word cloud can be very useful for identifying frequent terms at a glance, or to indicate possible themes.

Voyant Tools, created by Stéfan Sinclair and Geoffrey Rockwell has its own version Cirrus, among a range of other analytical tools, to which a stop list can be applied.

Austen's Mansfield Park Chapters 1-3

Austen’s Mansfield Park Chapters 1-3

This example shows the opening three chapters of Austen’s Mansfield Park with the English stop list applied. It is possible to identify some of the key characters and about possible themes. We could create a cloud for the opening three chapters of Edgeworth’s Patronage and visually compare the two,
Edgeworth's Patronage Chapters 1-3

Edgeworth’s Patronage Chapters 1-3

but what does this really reveal about the similarities and differences between the two texts?

While word clouds indicate the frequency of words through size, the colour and spatial position of the word is decorative rather than functional. Jacob Harris and Robert Hein have written critical pieces on the lure of the word cloud, the false sense of achievement this type of visualisation can inspire while failing to address the question ‘So what?’. However the benefits, possible developments and applications of this type of visualisation have also started to be explored. Quim Castellà and Charles Sutton discuss the use of multiple word clouds in their article ‘Word Storms: Multiples of Word Clouds for Visual Comparison of Documents’, and identify three levels of analysis: “a quick impression of the topics in the corpus”, “compare and contrast documents”, and “an impression of a single document”. While Glen Coppersmith and Erin Kelly use “dynamic wordclouds” to investigate the contents of a corpus and “vennclouds” to compare the relationship between corpora in their paper ‘Dynamic Wordclouds and Vennclouds for Exploratory Data Analysis’.

As my research uses R programming I decided to explore whether it is possible to create an improved word cloud using R. This is something that both Drew Conway and Rolf Fredheim have explored in their blogs, and proposed possible solutions. I was particularly interested in Fredheim’s solution as it incorporated statistical significance, however initially I was unable to get the code to work. I set myself two main tasks:

    1. To create a working algorithm
    2. To see whether it was possible to develop the algorithm further

word_difference_map recreates Fredheim’s code with a few changes and corrections. I have added comments to explain what I have changed and why, and also to help me understand what each piece of code is doing. I have included the part of the code here minus most of the comments, a full copy including comments can be found in my GitHub repo. The algorithm reads in two texts, processes the texts and applies a stop list, then calculates the word frequency from the two text samples.

txt1 <- readLines('Austen_1814_MP_3ch.txt')
txt2 <- readLines('Edgeworth_1814_P_3ch.txt')
# Load tm package
require(tm)

# Convert for use in tm package - creates a Volatile Corpus 
txt1c <- Corpus(VectorSource(txt1))
txt2c <- Corpus(VectorSource(txt2))

# Remove punctuation, whitespace, change to lower case, stem, apply stoplist, 
# remove numbers. Named entities not removed

# Create Document Term Matrix
dtm1 <- DocumentTermMatrix(txt1c) 
dtm2 <- DocumentTermMatrix(txt2c)

# Convert DTM to data frame
dtm1 <- data.frame(as.matrix(dtm1))
dtm2 <- data.frame(as.matrix(dtm2))

# Made v1 and v2 the author names
Austen <- as.matrix(sort(sapply(dtm1, "sum"), decreasing = T)
                [1:length(dtm1)], colnames = count)
Edgeworth <- as.matrix(sort(sapply(dtm2, "sum"), decreasing = T)
                [1:length(dtm2)], colnames = count)

# Removing missing values
Austen <- Austen[complete.cases(Austen),] # - in original used v2 ?typo
Edgeworth <- Edgeworth[complete.cases(Edgeworth),]

words1 <- data.frame(Austen)
words2 <- data.frame(Edgeworth)

# Merge the two tables by row names
wordsCompare <- merge(words1, words2, by="row.names", all = T)
# Replace NA with 0
wordsCompare[is.na(wordsCompare)] <- 0

Going beyond the calculations needed to create a word cloud, the algorithm calculates the proportion, z score and difference, allowing these to be used in the creation of the finished visualisation.

wordsCompare$prop <- wordsCompare$Austen/sum(wordsCompare$Austen) 
wordsCompare$prop2 <- wordsCompare$Edgeworth/sum(wordsCompare$Edgeworth)

# Broke down the z score formula a little to understand how it worked
a <- wordsCompare$prop
b <- wordsCompare$prop2
c <- wordsCompare$Austen
d <- wordsCompare$Edgeworth
e <- sum(c)
f <- sum(d)

# z score formula - adds column for z scores
wordsCompare$z <- (a - b) / ((sqrt(((sum(c) * a) + (sum(d) * b)) / (sum(c) + 
                       sum(d)) * (1 - ((sum(c) * a) + (sum(d) * b)) / (sum(c) +
                        sum(d))))) * (sqrt((sum(c) + sum(d)) / (sum(c) *
                                        sum(d)))))

# Keep data of moderate significance - confidence level of 95%
wordsCompare <- subset(wordsCompare, abs(z) > 1.96)

# Order words according to significance 
wordsCompare <- wordsCompare[order(abs(wordsCompare$z), decreasing = T),]

# Plot the data points
require(ggplot2)
ggplot(wordsCompare, aes(z, dif)) + geom_point()

wordsCompare$z2 <- 1 # adds a column of 1s 
wordsCompare$z2[abs(wordsCompare$z) >= 1.96] <- 0
# Makes $z2 0 if $z is greater than or equal to 1.96 - all are so are replaced 
# with 0. Those less than 1.96 were already removed. Try 2.58.
wordsCompare$z2[abs(wordsCompare$z) >= 2.58] <- 0

# Fixed this bit - it should be $dif not $z 
wordsCompare <- wordsCompare[wordsCompare$dif >-99 & wordsCompare$dif <99,]

wordsCompare <- wordsCompare[order(abs(wordsCompare$prop2+wordsCompare$prop),
                                   decreasing = T),]

# Plot 
ggplot(head(wordsCompare, 50), aes(dif, log(abs(Austen + Edgeworth)),size = (Austen + Edgeworth),label=Row.names, colour = z2))+
        geom_text(fontface = 2, alpha = .8) +
        scale_size(range = c(3, 12)) +
        ylab("log number of mentions") +
        xlab("Percentage difference between samples \n <----------More in Edgeworth --------|--------More in Austen----------->") +
        geom_vline(xintercept=0, colour  = "red", linetype=2)+
        theme_bw() + theme(legend.position = "none") +
        ggtitle("Differences in Terms Used by \n Austen and Edgeworth")

Plot showing Z Score and Difference

Plot showing Z Score and Difference

Word Difference Map

Word Difference Map

The algorithm has two main outputs a scatterplot which shows the relationship between z score and difference, and the word difference map itself.

Getting the algorithm to work took time and a fair bit of trial and error. However, several things occurred to me:

    1. Applying the stop list may remove potentially interesting differences
    2. Removing the less significant words may make the plot less cluttered, but also removes the similarities
    3. The scatterplot is used as a check, but there is an opportunity to make this a more useful visualisation
    4. At a glance, the significance of the colours is not clear to someone who has not seen the code

My updated version (word_diff_map2) keeps all the terms, as I chose not to use a stop list, I also chose not to stem the terms.

The main changes I made to the code were:

wordsCompare$z2 <- 0 # insignificant terms
wordsCompare$z2[abs(wordsCompare$z) >= 1.96] <- 1 # significant at 95% confidence 
wordsCompare$z2[abs(wordsCompare$z) >= 2.58] <- 2 # significant at 99% confidence
wordsCompare$z2[abs(wordsCompare$z) >= 3.08] <- 3 # significant at 99.79%

and

# Plot by z score and difference highlighting min and max outlying words
ggplot(wordsCompare, aes(z, dif, colour=z2, label=Row.names)) + geom_point()+
        scale_colour_gradientn(name="Z Score", labels=c("Not significant", "1.96", "2.58", "3.08"),colours=rainbow(4))+
        geom_text(aes(label=ifelse(wordsCompare$z == max(wordsCompare$z), as.character(Row.names),'')), hjust=0, vjust=0) +
        geom_text(aes(label=ifelse(wordsCompare$z == min(wordsCompare$z), as.character(Row.names),'')), hjust=0, vjust=0) +
        ylab("Difference")+
        xlab("Z Score")+
        theme(panel.background = element_rect(colour = "pink"))+
        ggtitle("Relationship between z score and percentage difference \n in Austen and Edgeworth")
Updated version of Scatterplot

Updated version of Scatterplot

The points on the scatterplot were highlighted according to their statistical significance and labels were included to indicate the terms with the highest and lowest z score. I used the same colours for the main word difference map for consistency and a key for both plots.

Updated version of Word Difference Map

Updated version of Word Difference Map

An interesting point that the new version of the algorithm highlighted was the difference between the authors’ use of feminine and masculine terms in the opening three chapters. Edgeworth uses “his”, “him”, “father” and “himself” more frequently, whereas Austen uses “she”, “her” “Mrs”, “lady” and “sister” more.

The next steps will be to try the code with a larger section of text and to see whether some of the areas explored in Coppersmith and Kelly can be incorporated.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>