Letters of 1916 is a public humanities project run by Maynooth University and directed by Professor Susan Schreibman. The project is creating a crowd-sourced digital collection of letters written between 1st November 1915 and 31st October 1916. For my internship, I have been working on the project as a data scientist, analysing the collection.
One of the most popular tools for exploring text collections is topic modelling. Topic modelling is a method of exploring latent topics within a text collection, often using Latent Dirichlet Allocation. In simple terms, “Topic modeling is a way of extrapolating backward from a collection of documents to infer the discourses (“topics”) that could have generated them” (Underwood, 2012).
Identifying Number of Topics
A challenge when creating topic models is determining the optimum number of topics. The R package ‘ldatuning’ (Murzintcev, 2015), which maps the corpus against four separate metrics, was used to identify an appropriate number of topics, 30.
The topic model was created in R using the ‘topicmodels’ package (Grün and Hornik, 2011) an implementation of Latent Dirchlet Allocation using Gibbs sampling. The resulting topics were visualised as a series of word clouds (available at http://sarajkerr.com/Dataviz/intern/images/Letters1916_new.gif).
Interactive Topic Visualisations Using LDAvis
The topics were also visualised using ‘LDAvis’ (Sievert and Shirley, 2015), where the most distinct terms in the topic can be interactively viewed by adjusting the relevance metric λ to 0.5. The image below illustrates Topic 14 – Rebellion, the red bars indicate the terms in the topic while the blue bars indicate the terms in the corpus as a whole.
The topics identified by the topic model highlight a number of interesting themes within the collection: Topics 11 and 22 refer to prison; Topic 14 to rebellion; Topic 15 official correspondence; Topic 17 Roger Casement; Topic 18 letters to Lady Clonbrock; Topic 21 legal matters; Topic 23 the murders of Sheehey Skeffington, Dickson, and MacIntyre; and Topic 30 prisoners of war.
While the topic models are informative, they have a relatively narrow focus which limits the opportunities for “serendipitous discovery” (Nicholas and Herman 2009:22).
In my next post I will explore vector space models, also known as word embeddings.