Jul 06 2017

Topic Modelling: PoS Tagging – Letters of 1916

Letters of 1916 is a public humanities project run by Maynooth University and directed by Professor Susan Schreibman. The project is creating a crowd-sourced digital collection of letters written between 1st November 1915 and 31st October 1916. For my internship, I have been working on the project as a data scientist, analysing the collection.

In my previous post I discussed topic modelling the Letters of 1916 collection. Before moving on to a post on word embeddings, I though I would explore some additional points. Topic modelling is not an exact science, there is a certain amount of trial and error involved. This means that some of the topics extracted from a text collection can be overly influenced by frequent words or the texts themselves. Matthew Jockers demonstrated some of these problems in his blog post “Secret” Recipe for Topic Modelling, showing that character names and the effect of modelling complete novels, can result in topics which reflect the text rather than its themes. While stop word lists can address the issue of names and some of the more commonly used words, this is sometimes not enough.

Part of Speech Tagging (PoS)
In his blog post, Jockers suggests the use of Part of Speech tagging as an additional preprocessing step to reduce the noise from other parts of speech. He does, however add the caveat: “I think this is a good way to capture thematic information; it certainly does not capture such things as affect (i.e. attitudes towards the theme) or other nuances that may be very important to literary analysis and interpretation”.

To apply this additional processing step to the Letters of 1916 we need to use a part of speech tagger – in this case the freely available TreeTagger (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/). The .txt files are first passed through TreeTagger which tags each word for its part of speech. The words tagged with nouns (NN) and plural nouns (NNS) are then extracted and saved into a new file with the original letter ID. The texts are processed using the ‘tm’ package in R with numbers, punctuation and stop words being removed, and a Document Term Matrix created. As before, I used ‘ldatuning’ to identify the optimum number of topics – this time it was a little less clear, although 30 still appears to be the best option.

Topic Modelling Noun-Only Corpus

Letters of 1916 Topics – Nouns

What this preprocessing step seems to achieve is to refine the words in the topics, making the theme for each group of letters easier to identify. For some of the topics the difference is relatively minor as we can see in the Internment topic:

Internment Topic – Full Corpus

Internment Topic – Nouns Corpus

For other topics, the focus on the nouns helps to clarify the topic further. This is evident in the topic I have called Letters Before Death:

Letters Before Death Topic – Full Corpus

Letters Before Death Topic – Nouns Corpus

What seems to be clear is that examining the topics created from the full text corpus, as well as those from the nouns-only corpus, may prove useful in understanding the collection.

Leave a Reply

Your email address will not be published.