Aug 02 2017

On Word Embeddings and Fish – Letters of 1916

Letters of 1916 is a public humanities project run by Maynooth University and directed by Professor Susan Schreibman. The project is creating a crowd-sourced digital collection of letters written between 1st November 1915 and 31st October 1916. For my internship, I have been working on the project as a data scientist, analysing the collection.

Following my two posts on Topic Modelling (which can be found here and here), I am moving on to a different type of vector space modelling – Word Embedding.

Like Topic Modelling, Word Embeddings have their origins in computer science, specifically methods of information retrieval. As before, I carry out my analysis using R, this time using the ‘wordVectors’ package created by Schmidt and Li (2015). This package uses a version of the Word2Vec algorithm originally created by Mikolov et al (2013) at Google. To visualise the results I have used both the built-in plot function, and an adapted plot created using ‘ggplot2’ (Wickham 2009) and ‘ggrepel’ (Slowikowski 2016).

Preparing the text
Word Embeddings can be created with relatively little pre-processing, although depending on the size and type of the corpus you may wish to experiment with stop word removal. For the analysis below, the texts were in plain text files and were pre-processed using the prep_word2vec function. The collection is a subset of the larger Letter of 1916 collection, totalling 1372 letters. This function takes a collection of plain text documents and creates a single plain text file – you can also opt to remove capital letters, which I chose to do here.

Creating Word Embeddings

Word Embeddings are a useful addition to the tools that can be used to explore text collections, as Ben Schmidt notes: “For digital humanists, they merit attention because they allow a much richer exploration of the vocabularies or discursive spaces implied by massive collections of texts than most other reductions out there” (2015 ‘Word Embeddings’). Unlike a Topic Model, which provides an answer to ‘what topics are in this text collection?’ a Word Embedding allows us to ask `what is being said about this topic/word in this corpus?’.

The first step is to train the Word Embedding model. The vector space model maps the words in a corpus in a multi-dimensional space, which represents the semantic and syntactic relationships between words. The relationships between words are encoded as a vector of length n characteristics/contexts/dimensions, n being the chosen number of vectors which encode the relationships. The characteristics/contexts/dimensions are created by the computer and, at present, “this is a highly debated topic in the NLP/ML community, so my scientifically accurate answer is that we don’t yet know” (Levy 2015) what they specifically represent. The model I created has 300 vectors, a window of 12, negative sampling of 5, and uses the default skip-gram option. I will go into more detail about how Word Embedding works in a future post.

Changing the number of vectors will change the model produced, for example:

100 Vector Model – Ten words closest to ‘rising’

300 Vector Model – Ten words nearest to ‘rising’

These two images show the ten words nearest to a vector for ‘rising’ using a model with 100 vectors and a model with 300. Although the differences are subtle, the larger number of vectors creates a more nuanced interpretation.

Visualising the Results

Once the model has been created you can start to explore it. One way is to use closest_to to create a list of words nearest to a chosen target word. For example, if we wanted to see the words closest_to ‘rising’ in the letters we would get this:

10 Words Closest to ‘rising’

We can view a greater number of words near to the term ‘rising’ by viewing it as a plot. The built-in plot function uses t-SNE (t-distributed stochastic neighbour embedding) to reduce the dimensions for each word to a point that can be plotted in 2D. The built in plot returns a result like this:

T-SNE plot Terms closest to ‘rising’

Unfortunately, the terms are overlapping in places which makes the plot quite hard to read. I decided to create a custom plot to try to solve this problem. I used the ‘ggplot2’ and ‘ggrepel’ packages which allowed me to mark the position of each word with a point, and to offset the label to improve readability.

Close up of Custom Plot

Analysing the Results

The previous image is a close-up from the plot of the 500 words closest to ‘rising’. One of the advantages of this method of text exploration is that it can confirm things we know to be in the texts, for example, this cluster indicates several terms which are linked to the Rising, such as ‘rebels’, ‘seditious’, ‘speeches’; but it can also highlight the unexpected, or as Nicholas and Herman (2009:22) refer to it “serendipitous discovery”. This provides a way of searching a text while reducing the problems of search itself: “Search is a form of data mining, but a strangely focused form that only shows you what you already know to expect” (Underwood 2014:66).

In the cluster above we can see the word ‘shark’, which seems rather out of place. To examine the context of the words a Key Word in Context (KWIC) search (such as the one Jockers refers to in his 2014 book Text Analysis with R for Students of Literature) can be used to provide context. This reveals that this is a reference, in a single letter, to the sinking of HMS Shark at the Battle of Jutland. The letter provides a detailed account of the sinking of HMS Shark and the fate of its captain and crew – something that could easily be overlooked in

KWIC results for ‘shark’

In the same cluster, towards the top right, we have the terms ‘preserved’ and ‘herrings’ – what, we might ask, do herrings have to do with the 1916 Easter Rising?


This highlights the importance of the human element in examining the output of tools like this. Computers are great for crunching numbers and creating plots, but when it comes to spotting patterns there is nothing better than a human. With a little knowledge about the Easter Rising, we can spot words which seem to fit into a pattern and those that seemingly do not. Again, we use the KWIC to explore the context of our ‘herrings’ to see if we can identify how they are linked to the Rising.

KWIC results for ‘herrings’

This time our references come from four different letters. The last two references are letters from Soldiers in WWI talking about food they have at the Front and food they would like their families to send. However, the remaining five references are from two letters regarding the conditions for post-Rising internment prisoners at Frongoch Prison in Wales. This leads us to an interesting potential research question: ‘what were the conditions like for internment prisoners?’ as well as some initial sources for further investigation.

A final point to note about our fishy terms is that neither term is particularly frequent, ‘shark’ appears only 6 times in the corpus and ‘herrings’ only 7 times. The chance of a “serendipitous discovery” for these low frequency terms is fairly unlikely if we were using close reading alone. This highlights the usefulness of combining Word Embedding and close reading to examine large collections of texts.

Leave a Reply

Your email address will not be published.