Nov 05 2016

Getting Organised – Referencing

books-1281581_1920
September marked the half way point of my PhD. So, as it is time to get things organised and focus on writing, I have decided to write a series of posts about my workflow. This is partly to help me clarify and streamline my processes, but also in case it is of use to anyone else.

My first post is about referencing. Referencing is one of those things you have to do, but is all too easy to leave to the last minute as, quite frankly, it is boring. I have spent many sleepless nights trying to sort out a bibliography started way too late, especially when I was doing my undergraduate degree and there was no laptop to help. Luckily there are lots of tools available now which can make this task relatively painless.

Reference Management
One of the absolute essentials, for me at least, is a reference management tool. I use Mendeley, partly because I have been using it since my MA, but mostly because it is free and has desktop and web versions which sync. This means I am confident that my library of references is safe and backed up (I also back up to an external hard drive, because you can never have too many back up copies – yes I am slightly paranoid about losing my work).

References can be added simply by dragging a pdf file into the document list, or manually. The authors’ names appear on the left hand side and the details of a selected document appear on the right hand side, which allows you to check the details and correct them if needed.You can view your references as a table or as citations, and you can choose from a number of different referencing styles. There is also a notes section, but to be honest I never use this. If PDF files are saved into Mendeley you can open and read them in the desktop or via the web and iPhone versions – great for reading on the go.

However, where Mendeley really comes into its own is organising your library of references. Over the course of my MA and PhD studies I have amassed a huge library of references, over 600 and still growing. I have created a Thesis folder in Mendeley, with a sub-folder for each chapter. Each reference I use in a particular chapter will be added to the appropriate folder, this will reduce the references I have to check to smaller chunks, and the folder can also be used as a reading list for each chapter.

So, references are organised and checked for accuracy, but this still doesn’t solve the dreary task of creating a bibliography. The solution I found combines LaTex (which I will write about in my next post) and a BibTex file created from my Mendeley library.

Creating a BibTex File
To create a BibTex file from your library, you need to go to the preferences tab. Go to the BibTex tab and tick ‘Enable BibTex syncing’.

Mendeley Preferences Tab

Mendeley Preferences Tab

There are three options, I go with ‘Create one BibTex file for my whole library’ so I don’t have to worry whether my references are in a particular folder. Browse and select where you want the file to be saved, then click ‘Apply’ and ‘Save’. And that is it, as you add to your Mendeley library your BibTex file is automatically updated to add your new references.

Citation Keys
The other thing you need to do is make sure ‘citation key’ is ticked under ‘Document Details’. When you select a document you will see that in the details on the right a citation key has been created (circled in red here),

Mendeley Desktop with Citation Key circled

Mendeley Desktop with Citation Key circled

the default is Author/Date but you can change it to whatever works best for you. This little shorthand reference to the document is the key to creating a bibliography and in text citations with minimal effort using LaTex.

Oct 03 2016

Jane Austen in Vector Space – Presentation at JADH

In September, I presented a paper which discussed the application of vector space models to a corpus of Jane Austen’s published novels at the Japanese Association for Digital Humanities Conference in Tokyo.

The paper was titled ‘Jane Austen in Vector Space: Applying vector space models to 19th century literature’ and outlined some of the findings from my pilot study applying data mining techniques to Austen’s novels.

The advent of distant and scaled reading techniques within literary studies has enabled the exploration of texts in a manner which “defamiliarize…making them unrecognizable in a way…that helps scholars identify features they might not otherwise have seen” (Clement, Tanya. “Text Analysis, Data Mining and Visualisations in Literary Scholarship.” MLA Commons | Literary studies in the digital age. Oct. 2013. Web.). Topic modelling is, perhaps, the most popular of these tools for Digital Humanists who wish to transform texts and view them through a different lens. However, the application of ‘word2vec’ (an algorithm which represents words as points in space, and the meanings and relationships between them as vectors) has the potential to be of even greater use. It can work effectively on a smaller corpus and can be applied to full texts, whereas, as Jockers has noted (“‘Secret’ recipe for topic modeling themes’. matthewjockers.net. 12 Apr. 2013, Web.), topic modelling is more effective when working with a large, noun only corpus. In addition, ‘word2vec’ allows the exploration of discourses surrounding a theme. Rather than asking ‘which topics or themes are in this corpus of texts?’ the application of the ‘word2vec’ algorithm allows us to ask ‘what does the corpus say about this theme?’.

Links can be found here to the conference Proceedings, Slides and a draft of the presentation.

Jul 14 2016

GST1 – 1: Author Seminar

GST 1 is a module at Maynooth University which aims to improve research skills and employability. To gain 5 ECTS for this module you need to attend 6 sessions and produce a diary entry or set of notes for each one.

Author Seminar: Scientific Journals, Peer Review and How to Write a Great Research Paper

This session was presented by Rupal Malde from Elsevier and therefore some of the content is skewed towards Elsevier’s practices.

The Publishing Cycle

The Publishing Cycle

The accepted article is assigned a DOI which follows it throughout the production process. During the production process the article goes through: preprint, manuscript accepted, document proof, published, and electronic stages.

From submission to acceptance takes an average of 21 weeks, however it is ok to email after 3 months to query what is happening with a paper.

Choosing the Journal
There are a number of metrics which can indicate the rank and prestige of the journal. These metrics vary by field and Humanities is not represented as strongly as STEM subjects.

  • SJR is the SCImago Journal Rank which indicates the prestige of the journal
  • SNIP is the Source Normalized Impact per Paper

For Elsevier journals there is a tool which can be found here which takes in your title and abstract, and an option to narrow the subject field. The results show impact ratings, acceptance rate, editorial time, publication time and open access fees.

Reasons for Rejection

  • Doesn’t match the aims and scope of the journal
  • It is incomplete or does not follow the journal specific structure
  • Data and statistics are inaccurate
  • Over-confident conclusions which can’t be justified
  • It’s incomprehensible
  • It’s boring

Getting Accepted
The manuscript needs to be accurate, concise, clear and objective. It is more likely to be accepted if:

  • It covers an important issue
  • It develops a framework
  • It leads to new questions
  • The methods are appropriate
  • The methods are rigorous and the data support the conclusions
  • Connections are made to prior work with accurate referencing
  • The article tells a good story and has a short snappy title

Finding Relevant Information
It is important to keep up to date, in particular with peer reviewed information. A keyword alert can be set up with SCOPUS, the latest content includes articles in press. The content covers about 5000 publishers, 22k journals, 80k conferences, and 400 book series. The results can be exported to Mendeley.

The use of this big data means that results can be analysed allowing trends in the field to be identified. Metrics for the articles can also be examined. Articles written by you can be imported into ORCID.

Getting Noticed – Promoting your Research
There are three main stages to promoting your research:

  1. Prepare
  2. Promote
  3. Monitor

Prepare
Being aware of how searches work is useful as it allows you to prepare your publications carefully. You need to be consistent with your name.

Keywords are a major part of SEO and should be included, where possible, in the title, highlights, image captions, and abstract. A strong keyword is descriptive but not too broad or technical.

Science Direct (Elsevier) offers AudioSlides – a 5 minute snippet about the article which can be used widely. Graphical Abstracts can also be used which are shown in the table of contents and keyword searches (but not currently in the PDF).

Promote
Be prepared to network!

Elsevier has a free app called Poster in My Pocket which uses QR codes to allow Elsevier conference attendees to access a copy of posters they are interested in. Although this is an interesting development it is limited to Elsevier conferences most of which are for Science.

Look for opportunities to use internal and external media to promote your article. Share links to your article – Elsevier provides a customised short link with free access. Link to this from the University website to boost SEO. Some publishers offer a number of free downloads (e.g. 25).

Create an online CV – make it clear and include links to your work (e.g. LinkedIn and Mendeley).

Use innovations in publishing, for example open data, computer code, interactive data visualisation, or multimedia presentations to make your article more interesting or interactive.

Be aware of new journal types. Micro-article journals (e.g. MethodsX, SoftwareX, Data in Brief) allow you to focus on a subset of a larger article, or an extension – this can be cited.

Monitor
Check that your links and details are up to date on a regular basis. Check your social media and email accounts for responses to your work.

Mar 10 2016

Beyond the Word Cloud

Possibly the most common entry level visualisation in computational textual analysis is the word cloud. There are multiple online tools (the best known probably being Wordle) which allow you to create them in all sorts of styles. A word cloud can be very useful for identifying frequent terms at a glance, or to indicate possible themes.

Voyant Tools, created by Stéfan Sinclair and Geoffrey Rockwell has its own version Cirrus, among a range of other analytical tools, to which a stop list can be applied.

Austen's Mansfield Park Chapters 1-3

Austen’s Mansfield Park Chapters 1-3

This example shows the opening three chapters of Austen’s Mansfield Park with the English stop list applied. It is possible to identify some of the key characters and about possible themes. We could create a cloud for the opening three chapters of Edgeworth’s Patronage and visually compare the two,
Edgeworth's Patronage Chapters 1-3

Edgeworth’s Patronage Chapters 1-3

but what does this really reveal about the similarities and differences between the two texts?

While word clouds indicate the frequency of words through size, the colour and spatial position of the word is decorative rather than functional. Jacob Harris and Robert Hein have written critical pieces on the lure of the word cloud, the false sense of achievement this type of visualisation can inspire while failing to address the question ‘So what?’. However the benefits, possible developments and applications of this type of visualisation have also started to be explored. Quim Castellà and Charles Sutton discuss the use of multiple word clouds in their article ‘Word Storms: Multiples of Word Clouds for Visual Comparison of Documents’, and identify three levels of analysis: “a quick impression of the topics in the corpus”, “compare and contrast documents”, and “an impression of a single document”. While Glen Coppersmith and Erin Kelly use “dynamic wordclouds” to investigate the contents of a corpus and “vennclouds” to compare the relationship between corpora in their paper ‘Dynamic Wordclouds and Vennclouds for Exploratory Data Analysis’.

As my research uses R programming I decided to explore whether it is possible to create an improved word cloud using R. This is something that both Drew Conway and Rolf Fredheim have explored in their blogs, and proposed possible solutions. I was particularly interested in Fredheim’s solution as it incorporated statistical significance, however initially I was unable to get the code to work. I set myself two main tasks:

    1. To create a working algorithm
    2. To see whether it was possible to develop the algorithm further

word_difference_map recreates Fredheim’s code with a few changes and corrections. I have added comments to explain what I have changed and why, and also to help me understand what each piece of code is doing. I have included the part of the code here minus most of the comments, a full copy including comments can be found in my GitHub repo. The algorithm reads in two texts, processes the texts and applies a stop list, then calculates the word frequency from the two text samples.

txt1 <- readLines('Austen_1814_MP_3ch.txt')
txt2 <- readLines('Edgeworth_1814_P_3ch.txt')
# Load tm package
require(tm)

# Convert for use in tm package - creates a Volatile Corpus 
txt1c <- Corpus(VectorSource(txt1))
txt2c <- Corpus(VectorSource(txt2))

# Remove punctuation, whitespace, change to lower case, stem, apply stoplist, 
# remove numbers. Named entities not removed

# Create Document Term Matrix
dtm1 <- DocumentTermMatrix(txt1c) 
dtm2 <- DocumentTermMatrix(txt2c)

# Convert DTM to data frame
dtm1 <- data.frame(as.matrix(dtm1))
dtm2 <- data.frame(as.matrix(dtm2))

# Made v1 and v2 the author names
Austen <- as.matrix(sort(sapply(dtm1, "sum"), decreasing = T)
                [1:length(dtm1)], colnames = count)
Edgeworth <- as.matrix(sort(sapply(dtm2, "sum"), decreasing = T)
                [1:length(dtm2)], colnames = count)

# Removing missing values
Austen <- Austen[complete.cases(Austen),] # - in original used v2 ?typo
Edgeworth <- Edgeworth[complete.cases(Edgeworth),]

words1 <- data.frame(Austen)
words2 <- data.frame(Edgeworth)

# Merge the two tables by row names
wordsCompare <- merge(words1, words2, by="row.names", all = T)
# Replace NA with 0
wordsCompare[is.na(wordsCompare)] <- 0

Going beyond the calculations needed to create a word cloud, the algorithm calculates the proportion, z score and difference, allowing these to be used in the creation of the finished visualisation.

wordsCompare$prop <- wordsCompare$Austen/sum(wordsCompare$Austen) 
wordsCompare$prop2 <- wordsCompare$Edgeworth/sum(wordsCompare$Edgeworth)

# Broke down the z score formula a little to understand how it worked
a <- wordsCompare$prop
b <- wordsCompare$prop2
c <- wordsCompare$Austen
d <- wordsCompare$Edgeworth
e <- sum(c)
f <- sum(d)

# z score formula - adds column for z scores
wordsCompare$z <- (a - b) / ((sqrt(((sum(c) * a) + (sum(d) * b)) / (sum(c) + 
                       sum(d)) * (1 - ((sum(c) * a) + (sum(d) * b)) / (sum(c) +
                        sum(d))))) * (sqrt((sum(c) + sum(d)) / (sum(c) *
                                        sum(d)))))

# Keep data of moderate significance - confidence level of 95%
wordsCompare <- subset(wordsCompare, abs(z) > 1.96)

# Order words according to significance 
wordsCompare <- wordsCompare[order(abs(wordsCompare$z), decreasing = T),]

# Plot the data points
require(ggplot2)
ggplot(wordsCompare, aes(z, dif)) + geom_point()

wordsCompare$z2 <- 1 # adds a column of 1s 
wordsCompare$z2[abs(wordsCompare$z) >= 1.96] <- 0
# Makes $z2 0 if $z is greater than or equal to 1.96 - all are so are replaced 
# with 0. Those less than 1.96 were already removed. Try 2.58.
wordsCompare$z2[abs(wordsCompare$z) >= 2.58] <- 0

# Fixed this bit - it should be $dif not $z 
wordsCompare <- wordsCompare[wordsCompare$dif >-99 & wordsCompare$dif <99,]

wordsCompare <- wordsCompare[order(abs(wordsCompare$prop2+wordsCompare$prop),
                                   decreasing = T),]

# Plot 
ggplot(head(wordsCompare, 50), aes(dif, log(abs(Austen + Edgeworth)),size = (Austen + Edgeworth),label=Row.names, colour = z2))+
        geom_text(fontface = 2, alpha = .8) +
        scale_size(range = c(3, 12)) +
        ylab("log number of mentions") +
        xlab("Percentage difference between samples \n <----------More in Edgeworth --------|--------More in Austen----------->") +
        geom_vline(xintercept=0, colour  = "red", linetype=2)+
        theme_bw() + theme(legend.position = "none") +
        ggtitle("Differences in Terms Used by \n Austen and Edgeworth")

Plot showing Z Score and Difference

Plot showing Z Score and Difference

Word Difference Map

Word Difference Map

The algorithm has two main outputs a scatterplot which shows the relationship between z score and difference, and the word difference map itself.

Getting the algorithm to work took time and a fair bit of trial and error. However, several things occurred to me:

    1. Applying the stop list may remove potentially interesting differences
    2. Removing the less significant words may make the plot less cluttered, but also removes the similarities
    3. The scatterplot is used as a check, but there is an opportunity to make this a more useful visualisation
    4. At a glance, the significance of the colours is not clear to someone who has not seen the code

My updated version (word_diff_map2) keeps all the terms, as I chose not to use a stop list, I also chose not to stem the terms.

The main changes I made to the code were:

wordsCompare$z2 <- 0 # insignificant terms
wordsCompare$z2[abs(wordsCompare$z) >= 1.96] <- 1 # significant at 95% confidence 
wordsCompare$z2[abs(wordsCompare$z) >= 2.58] <- 2 # significant at 99% confidence
wordsCompare$z2[abs(wordsCompare$z) >= 3.08] <- 3 # significant at 99.79%

and

# Plot by z score and difference highlighting min and max outlying words
ggplot(wordsCompare, aes(z, dif, colour=z2, label=Row.names)) + geom_point()+
        scale_colour_gradientn(name="Z Score", labels=c("Not significant", "1.96", "2.58", "3.08"),colours=rainbow(4))+
        geom_text(aes(label=ifelse(wordsCompare$z == max(wordsCompare$z), as.character(Row.names),'')), hjust=0, vjust=0) +
        geom_text(aes(label=ifelse(wordsCompare$z == min(wordsCompare$z), as.character(Row.names),'')), hjust=0, vjust=0) +
        ylab("Difference")+
        xlab("Z Score")+
        theme(panel.background = element_rect(colour = "pink"))+
        ggtitle("Relationship between z score and percentage difference \n in Austen and Edgeworth")
Updated version of Scatterplot

Updated version of Scatterplot

The points on the scatterplot were highlighted according to their statistical significance and labels were included to indicate the terms with the highest and lowest z score. I used the same colours for the main word difference map for consistency and a key for both plots.

Updated version of Word Difference Map

Updated version of Word Difference Map

An interesting point that the new version of the algorithm highlighted was the difference between the authors’ use of feminine and masculine terms in the opening three chapters. Edgeworth uses “his”, “him”, “father” and “himself” more frequently, whereas Austen uses “she”, “her” “Mrs”, “lady” and “sister” more.

The next steps will be to try the code with a larger section of text and to see whether some of the areas explored in Coppersmith and Kelly can be incorporated.

Dec 18 2015

Outreach: Letters of 1916 – What lies beneath

An ongoing social media presence is an important part of many crowdsourced humanities projects. This can be used to promote the project, engage a wider range of contributors and provide a channel for collaboration between academics and other interested parties.

“It is possible to suggest that beyond being a tool for writing and communicating, microblogging platforms may serve as foundations for building or enhancing a community of practice.” (Ross 35)

Planning and Hosting a Twitter Chat

Leading on from my previous blog post which explored how the Letters of 1916 project uses Twitter, I planned and hosted a Twitter chat focusing on the challenges faced by teachers using digital resources in the classroom. I chose this particular topic as the letters are a fantastic resource and, as we saw from the enthusiasm surrounding the 1916 in Transition project, teachers are keen to use them once they know what is available and how to access them.

The purpose of the chat was to:

  • promote Letters 1916 to a wide range of teachers
  • present a range of examples from the project
  • suggest ways the letters could be used in lessons
  • address any concerns that teachers might have in using digital resources in the classroom
  • The topic of the chat was launched via a blog post on the Letters of 1916 site. As part of the preparation for the Twitter chat I identified groups, for example teachers and archivists, who may be interested in the the use of digital resources in the classroom, as well as some of the relevant hashtags (e.g. #edchatie, #ukedchat). These were used in promotional tweets in the week leading up to the date of the chat (Wednesday 25th November). Some of the tweets were sent live, others were scheduled in advance (using Tweetdeck) to promote the chat to as broad a range of people as possible (see Storify ‘Initial Announcements’ and ‘Advertising the chat’).

    In addition to more general advertising, I also decided to target specific English and History teachers who use Twitter. This had a mixed reception, some, but not all, of the targeted teachers responded to the invitation – one of them had, coincidentally, been using family letters from World War 1 in lessons.

    On the day of the chat a series of tweets were sent out reminding Twitter users of the time and focus of the chat, these were more frequent as the 7pm start time of the chat drew closer (see Storify ‘Final Advertising’).

    The final part of the preparation was to draft and schedule a number of tweets, some of these were saved onto a Google Document – accessible by the Letters of 1916 team – while others were used to promote some of the topics the letters covered. The latter were scheduled at 15 minute intervals from 7:05 and were also used to act as markers for the analysis carried out after the chat. The chat itself was busy, although the number of contributors was relatively low (see Storify ‘The Chat: 7pm – 8pm’).

    Analysing the Twitter Chat

    Although the number of actual contributors seemed disappointing, something expressed by one of the contributors, this is not entirely unexpected. In his 2006 article, Nielsen states that:

    User participation often more or less follows a 90-9-1 rule:

  • 90% of users are lurkers
  • 9% of users contribute from time to time
  • 1% of users participate a lot and account for most contributions.
  • Nielsen
    He goes on to say that:

    The first step to dealing with participation inequality is to recognise that it will always be with us. It’s existed in every online community and multi-user service that has ever been studied.Nielsen

    Looking at the number of tweets and types of interaction is relatively straightforward, and proved to be very interesting. I downloaded the Twitter Analytics for my tweets (Twitter only allow you to access your own tweets for free) during the chat; this data can be saved as a .csv file. The analysis below was carried out using R.

    This first section of code reads in the .csv file and creates a data frame containing just the data for the Twitter Chat.

    # Upload the saved Twitter data .csv file into R
    twit <- read.csv("twitter_26.csv")
    # Use head() to view the top few lines and then select the relevant lines
    head(twit)
    twt <- twit[,1:22]
    # Use the glob2rx() function to create a regular expression selecting only the relevant dates
    grx <-glob2rx("2015-11-25*")
    # Use with(), grepl() and the regex to select the tweets from the correct date
    x <- with(twt, twt[grepl(grx, time), ])
    # Reverse the order of the data so it runs from earliest to latest then select the desired time range
    x2 <- x[68:1,]
    y <- which(x2$time == "2015-11-25 19:00 +0000")
    y2 <- which(x2$time == "2015-11-25 20:00 +0000") 
    chat <- x2[y:y2,]
    # Create a reduced data frame of the core numerical data
    chatT <-data.frame(tweet.No = factor(c(1:47)), levels = c(1:47),
                       imp = chat$impressions, eng = chat$engagements, 
                       rt = chat$retweets, like = chat$likes, rep = chat$replies, 
                       ht = chat$hashtag.clicks, email = chat$email.tweet, 
                       mv = chat$media.views, me = chat$media.engagements)
    

    The reduced data frame covers 47 tweets I sent during the course of the Twitter chat. The first thing I wanted to find out was how many people interacted with the tweets, Twitter Analytics calls this ‘Engagements’ and defines it as “Total number of times a user has interacted with a Tweet. This includes all clicks anywhere on the Tweet (including hashtags, links, avatar, username and Tweet expansion) , retweets, replies, follows and likes”.

    # Load the ggplot2 graphics package
    library(ggplot2)
    # Create a graph showing Engagement by Tweet
    ggplot(data=chatT, aes(x=tweet.No, y=eng, fill=tweet.No)) +
            geom_bar(colour="black", stat="identity") +
            guides(fill=FALSE) + ggtitle("Engagement by Tweet")
    
    

    Engagement

    The graph shows the engagement rate by tweet and we can see that tweets 1 (the welcome tweet), 15 and 26 (both tweets including images which show some of the topics the letters cover) gained the most engagements. The engagements can be subdivided into those who actively engaged with the chat, through replies and retweets (a total of 54 responses) and those who were active but hidden (129 clicks, likes etc.).

    The next graph shows the number of ‘Impressions’, which Twitter defines as “Number of times users saw the Tweet on Twitter”.

    # Create a graph showing Impressions by Tweet
    ggplot(data=chatT, aes(x=tweet.No, y=imp, fill=tweet.No)) +
            geom_bar(colour="black", stat="identity") +
            guides(fill=FALSE) + ggtitle("Impressions by Tweet") 
    
    

    Impressions

    This graph is particularly interesting as it highlights the number of ‘lurkers’ who can see the tweets is far higher than those who actively engage with them. This reinforces Nielsen’s notion of ‘participation inequality’.

    Exploring the peaks on the two graphs can highlight a number of areas which could help improve interaction in a Twitter chat. A number of the peaks on both graphs are visual images, which suggest that this is a key factor for gaining an audiences attention and encouraging them to comment. Other peaks may indicate the interests of the user, for example issues of technology and bandwidth, an unusual or cryptic letter extract (tweet 27 “Postcard in Irish translated “I am here again and missing Dublin. The angel didn’t meet me. He had probably left early. 1/2 #AskLetters1916”), and references to specific uses of letters in teaching (tweet 46 “English teachers could Letters1916 be used to demonstrate language register – personal, business etc? or change over time? #AskLetters1916”).

    References:

    “Analytics.” Twitter. Twitter, n.d.

    Nielsen, Jakob. “Participation inequality: The 90-9-1 rule for social features.” Nielsen Norman Group. 9 Oct. 2006. Https://www.nngroup.com/articles/participation-Inequality/. 18 Dec. 2015.

    Ross, Claire. “Social Media for Digital Humanities and Community Engagement.” Digital humanities in practice. Ed. Claire Warwick, Melissa Terras, and Julianne Nyhan. London: Facet Publishing, 2012. 23–45.

    Oct 26 2015

    Letters of 1916 – The Role of Twitter

    logo
    The Letters of 1916 project is the first public humanities project in Ireland. Its purpose is to provide a snapshot of ‘a year in a life’ in Ireland between 1st November 1915 and 31st October 1916. This is a period which covers the 1916 Rising as well as several major events in World War I (including the Battle of the Somme). However, although these major events occur during the period, the project aims to not only focus on letters which shed light onto these events, but also those which provide an insight into the everyday lives of people living in Ireland at this time.

    Crowdsourcing

    For an outline of the origins and initial definition of crowdsourcing see my earlier post here.

    Crowdsourcing in a growing part of cultural heritage and humanities projects. Ridge’s definition of crowdsourcing as “an emerging form of engagement with cultural heritage that contributes towards a shared, significant goal or research area by asking the public to undertake tasks that cannot be done automatically, in an environment where the tasks, goals (or both) provide inherent rewards for participation” is relevant for the Letters of 1916. The Letters project relies upon the public for two core aspects – the submission and upload of relevant letters and the transcription of letters, without the crowd the tasks would be too time consuming for a small group of researchers to complete. The input of volunteers means that a much broader range of letters have been sourced and transcribed, allowing the project to move towards the creation of a digital scholarly edition of letters available to the public and researchers alike.

    Theimer refers to this type of project as an example of Archive 2.0.

    “Today, archivists see their primary role as facilitating rather than controlling access. Using social media tools, archivists even invite user contributions and participation in describing, commenting, and re-using collections, creating so-called collaborative archives” (Theimer)

    Social Media

    While many websites have built in forums to encourage collaboration this is ultimately a closed group, likely to be accessed only by those who already use the site. This can be very helpful for on-task, on-site collaboration, or sites like Zooniverse which already have considerable foot-fall, but a smaller public humanities project must go further. The need for collaboration across a wide group of volunteers, as well as the need to publicise projects has led to the adoption of social media tools, for example Twitter and Facebook.

    “Social participatory media, such as social networks, blogs and podcasts, are increasingly attracting the attention of academic researchers and educational institutions. Because of the ease of use, social media offers the opportunity for powerful information sharing, collaboration, participation and community engagement” (Ross). This is not just a matter of sending out a few tweets; a well planned and organised social media strategy is needed.

    @Letters1916 – The Project’s Use of Twitter

    The Letters project began tweeting from its @Letters1916 account just prior to the official launch on 27th September 2013. From a zero base, the account has now (as of today 26th October 2015) got 3594 followers and has sent an impressive 5977 tweets. Over the past two years it is possible to see the continued attraction of followers (which has remained fairly steady at around 1700 per year and the increase in the number of tweets from 2013 per year to 3860 per year (as seen in the chart below). Screen Shot 2015-10-26 at 6.44.02 p.m.Over the past year the account has tweeted an impressive 74 times a week or 322 times a month.

    #AskLetters1916 – Direct interaction

    Dunn and Hedges, in their 2012 ‘Crowd-Sourcing Scoping Study’, emphasise the importance and benefits of engaging the crowd. From relatively early in the project, a regular question and answer session was organised via twitter using the hashtag #AskLetters1916. The purpose of this type of interaction is to engage not only existing collaborators, but also those who have an interest in the subject. The additional implementation of a range of widely used hashtags, for example #EasterRising, #Ireland, #Irishhistory and #edchatie, helps target twitter users with similar interests. The sessions are publicised in advance and hosted by a member of the Letters team whose role it is to encourage discussion.

    Initially the #AskLetters1916 was held weekly as an open forum, however this proved too vague to attract a significant number of interested followers. As a result the decision was made to make this a monthly session with a specific focus (for example ‘Women in 1916’ or ‘Digital Resources in the Classroom’) – there have now been 14 targeted sessions which have been much more successful. In addition to the live discussions the tweets are also curated and made available on Storify, allowing those who were unable to take part to access and read the tweets at a later date. The use of synchronous and asynchronous methods of communication enables the participation of volunteers from multiple locations and time zones.

    Special Events

    There have been several special events related to the Letters project which have been used as opportunities to bring the project to an even wider audience. Updates regarding the project, including social media figures are provided annually on the anniversary of the initial launch of the project (see infographics for 2014 and 2015).

    There have been two annual collaborations with Irish secondary teachers, leading to the creation of several lesson plans which use materials from the Letters project. In August 2015, working alongside the Irish Military Archives and Bureau of Military History, over 20 lesson plans were created during a three day workshop. The workshop was allocated the hashtag #teach1916 and participants were encouraged to use the tag in any tweets they sent – a very successful strategy as the tag trended in Ireland on 5th August. The lesson plans from the 2015 workshop can be found here.

    Day to Day Tweets

    While using the Twitter account to publicise special events or host monthly chats has provided some excellent interaction with a range of collaborators, one of the key elements of running a social media campaign is the necessity of regular tweets. The @Letters1916 account has two main types of regular tweet (outside retweeting related issues): requests for volunteers to contribute letters Screen Shot 2015-10-26 at 7.36.54 p.m. and a regular #Onthisday letter enabling Twitter users to see an individual letter from the project. Screen Shot 2015-10-26 at 7.37.09 p.m. This regular interaction with the public “entails a greater level of effort, time and intellectual input from an individual than just socially engaging” (Holley).

    The images below show the activity of Letters 1916, #Letters1916 and #teach1916 for the past week and the top themes in the tweets over the same period. The analysis comes from the free search in a tool called Talkwalker.Screen Shot 2015-10-26 at 4.06.54 p.m. Screen Shot 2015-10-26 at 7.45.47 p.m. This demonstrates the variety and dedication needed to maintain and engage the crowd via social media.

    Bibliography

    Dunn, Stuart, and Mark Hedges. “Crowd-Sourcing Scoping Study-Engaging the Crowd with Humanities Research.” Centre for e-Research, King’s College London. http://crowds.cerch.kcl.ac. uk/wp-uploads/2012/12/Crowdsourcingconnected-communities. pdf (2012).

    Holley, Rose. ‘Crowdsourcing: how and why should libraries do it?’, D-Lib Magazine, (Vol. 16, No. 3/4)

    Ridge, Mia . “Frequently Asked Questions about crowdsourcing in cultural heritage.” Open Objects. 3 June 2012. 27 Sept. 2015. .

    Ross, Claire. “Social Media for Digital Humanities and Community Engagement.” Digital Humanities in Practice. Ed. Claire Warwick, Melissa Terras, and Julianne Nyhan. London: Facet Publishing, 2012. 23–46.

    Theimer, Kate. “What is the Meaning of Archives 2.0?.” The American Archivist 74.1 (2011): 58-68.

    Jul 10 2015

    Marine Lives – R and The Silver Ships – Frequencies

    In this second blog post on the Marine Lives Three Silver Ships project, (see the first post here) I look at how to identify the folio pages in the HCA 13/70 Depositions which mention the three ships (Salvador, Sampson and Saint George).

    Processing and Calculating Raw Frequencies

    Using the .txt file downloaded in the last post, I started by processing the text. To write the code used in this section, I adapted code from Matthew Jockers’ (2014) book Text Analysis with R for Students of Literature. First, using the phrase “This page is for the annotation of HCA 13/70 f.x” as a marker, I broke the text into its individual pages, named them and calculated the raw frequencies of the words.

    text.v <- scan("HCA_13.70.txt", what = "character", sep = "\n")
    
    start.v <- which(text.v == "This page is for the annotation of HCA 13/70 f.1r.")
    end.v <- length(text.v)
    HCA_13.70.lines.v <- text.v[start.v : end.v]
    
    # Use grep to break the text into folios
    folio.positions.v <- grep("^This page is for the annotation of \\w", 
                              HCA_13.70.lines.v)
    
    # Add an additional line to text as an end marker 
    HCA_13.70.lines.v <- c(HCA_13.70.lines.v, "END")
    last.position.v <- length(HCA_13.70.lines.v)
    folio.positions.v <-c(folio.positions.v, last.position.v)
    
    # Extract the text on each page and calculate the frequency count of each
    # word type
    
    # Create folio page names
    f.v <- rep(seq(1:501), each=2)
    p.v <- c("r", "v")
    page.v <- rep(p.v, times=501)
    
    folName.l <- list(length = 1002)
    for (i in 1:1002) {
            folName.l[[i]] <- c("HCA_13/70_f.", f.v[i], page.v[i])
            folName.l[[i]] <- paste(folName.l[[i]], collapse = "")
    }
    
    
    # Create empty list containers
    folio.raws.l <- list()
    folio.freqs.l <- list ()
    
    for (i in 1:length(folio.positions.v)) {
            if (i != length(folio.positions.v)) {
                    folio.title <- folName.l[[i]]
                    #folio.title <- HCA_13.70.lines.v[folio.positions.v[i]]
                    start <- folio.positions.v[i]+1
                    end <- folio.positions.v[i+1]-1
                    folio.lines.v <- HCA_13.70.lines.v[start:end]
                    folio.words.v <- tolower(paste(folio.lines.v, collapse = " "))
                    folio.words.l <- strsplit(folio.words.v, "\\W")
                    folio.word.v <- unlist(folio.words.l)
                    folio.word.v <- folio.word.v[which(folio.word.v != "")]
                    folio.freqs.t <- table(folio.word.v)
                    folio.raws.l[[folio.title]] <- folio.freqs.t
            }
    }
    
    

    Plotting Frequencies

    As a test, I looked at the frequency of two words I knew were in the corpus, ‘goods’ and ‘captaine’ and plotted their frequency on a bar chart.

    # testing with 'goods' and 'captaine' 
    goods.l <- lapply(folio.raws.l, '[', 'goods')
    goods.m <-do.call(rbind, goods.l)
    
    captaine.l <- lapply(folio.raws.l, '[', 'captaine')
    captaine.m <- do.call(rbind, captaine.l)
    
    goods.v <- goods.m[,1]
    captaine.v <- captaine.m[,1]
    
    goods.captaine.m <- cbind(goods.v, captaine.v)
    dim(goods.captaine.m)
    
    colnames(goods.captaine.m) <- c("goods", "captaine")
    
    barplot(goods.captaine.m, beside=T, col="grey")
    
    Raw Frequencies of 'goods' and 'captaine'

    Raw Frequencies of ‘goods’ and ‘captaine’

    The next step was to search for the three ships – I chose to search only for one spelling variation – ‘salvador’, ‘sampson’ and ‘george’. 

    # Ship names (n.b. all instances of "george" may not be references to the St George)
    
    salvador.l <- lapply(folio.raws.l, '[', 'salvador')
    salvador.m <-do.call(rbind, salvador.l)
    
    sampson.l <- lapply(folio.raws.l, '[', 'sampson')
    sampson.m <- do.call(rbind, sampson.l)
    
    george.l <- lapply(folio.raws.l, '[', 'george')
    george.m <- do.call(rbind, george.l)
    
    salvador.v <- salvador.m[,1]
    sampson.v <- sampson.m[,1]
    george.v <- george.m[,1]
    
    # This creates a matrix of the mentions of ship names and the page
    ships.m <- cbind(salvador.v, sampson.v, george.v)
    dim(ships.m)
    
    # A plot of the references to the ships by folio
    colnames(ships.m) <- c("salvador", "sampson", "george")
    
    barplot(ships.m, beside=T, col = "grey")
    
    

    This produced a matrix with the frequencies for each mention of the ships and a bar chart which indicates that the references to the ships are grouped in a few of the depositions. Obviously, the search results for ‘george’ are skewed by deponents with the name ‘George’, which can be seen by the increased frequency.

    Raw Frequencies of 'salvador', 'sampson' and 'george'

    Raw Frequencies of ‘salvador’, ‘sampson’ and ‘george’

    Identifying Key Folios

    Having identified the raw frequencies, I converted the matrix to a data frame, changed theNA values to 0 and then subsetted the results to show only the page names which have one or more mentions of one of the ships. Finally, I subsetted this data frame to highlight the pages which have the highest number of mentions ( > 10).

    # Converting matrix to dataframe and replacing NA with 0
    ships.df <- as.data.frame(ships.m)
    
    ships.df[is.na(ships.df)] <- 0
    
    # Calculate the mentions per page and subset for those above 0
    ships.df$mentions <- rowSums(ships.df)
    
    # All three ships
    ships.mention.df <- subset(ships.df, mentions > 0)
    
    ships.high.mentions.df <- subset(ships.df, mentions >= 10)
    

    This data frame indicates that 10 of the pages have a high number of references to one or more of the ships names:

    High mentions of one or more of the ship names by folio

    Mentions of one or more of the ship names by folio

    In my next post, I will look at the use of KWIC (Key Word in Context) to search for references to the ships.

    Jul 09 2015

    Marine Lives – R and The Silver Ships – Extracting Data

    Within the larger Marine Lives project there are a number of smaller sub-projects which focus on specific areas of interest. One of these smaller projects is the Three Silver Ships project:

    “Three large ships (The Salvador, the Sampson and the Saint George) of supposedly Lubeck and Hamburg build and ownership were captured by the English in 1652 with highly valuable cargos of bullion. The ships were on their way from Cadiz with bullion from the Spanish West Indies going northwards. It was disputed in court as to whether the ships were bound legally for the Spanish Netherlands, or illegally for Amsterdam.” Marine Lives

    The purpose of the project is to identify relevant references to cases involving the Three Silver Ships in the various depositions and papers and consolidate this in a wiki. At the moment, this is being done manually and, as the depositions have multiple, two-sided pages, this means a large number of pages to search.

    As my PhD is using R to analyse novels, I thought it would be interesting to see whether I could apply my rookie programming skills to the problem. As a caveat, I have only been using R for about a year and am still very much a beginner, so there may well be more straightforward and elegant ways of working – please feel free to comment as any advice will be gratefully received!

    The first challenge was to extract the text, the bulk of which are transcriptions, from the individual wiki pages. This had several stages:

    Creating a list of URLs

    As each page is made up of a collection number (in this case HCA 13/70), a folio number (e.g. f.1), and whether the page is recto or verso (r or v), I started by creating a number sequence for the folios and a second sequence for the page. These were combined in a list with the main part of the URL and then collapsed to make a working URL.

    # Create a sequence for the folio numbers
    f.v <- rep(seq(1:501), each=2)
    
    # Create a sequence for the pages recto/verso
    p.v <- c("r", "v")
    page.v <- rep(p.v, times=501)
    
    # Create a list of URLs
    folio.l <- list(length = 1002)
    for (i in 1:1002) {
           folio.l[[i]] <- c("http://www.marinelives.org/wiki/HCA_13/70_f.",f.v[i],
                             page.v[i] , "_Annotate")
           folio.l[[i]] <- paste(folio.l[[i]], collapse = "")
    }
    

     

    Extracting the text from the wiki pages

    To extract the text I used the package boilerpipeR (Mario Annau, 2015). I used the DefaultExtractor function as I found that the LargestContentExtractor excluded some parts of the transcriptions.

    # Extract the text from the wiki pages and save as .txt file
    library(boilerpipeR)
    library(RCurl)
    
    # This function extracts the text from a wiki page
    textExtract <- function(number) {
            url <- folio.l[[number]]
            content <- getURL(url)
    
            extract <- DefaultExtractor(content) 
            text.v <- extract
            return(text.v)
    }
    
    # Create vector to hold extracted information and fill using loop
    # This is a large file (approx 5 Mb) so will take some time to run
    x.v <- vector(length = 1002)  
    for (i in 1:1002) {
            x.v[i] <- textExtract(i)
    }
    
    # Put resulting text into a named file
    HCA_13.70.v <- x.v
    

    Saving the file as a .txt file

    The final step was to save the resulting file as a .txt file, allowing me to access it offline and keeping a copy available for use without having to go through the download process again.

    # Check that working directory has been set then save as .txt file
    
    write(HCA_13.70.v, file = "HCA_13.70.txt")
    
    

    In my next post I will discuss preparing the text for analysis.

    Jul 08 2015

    Marine Lives – Summer Training Programme

    Marine Lives is a digital project to transcribe Admiralty court records dating from the 1650s. The project site is a wiki which enables transcribers to work collaboratively. The British Library has invited the project to be part of the UK Web Archive. The Marine Lives Wiki can be found here.

    The Summer Training programme is a 10 week programme which brings together volunteers from around the world, both experts and amateurs, who work with a coordinator to learn how to transcribe and work on the site.

    Our first week introduced us to the site and taught us how to create pages, upload images, create links and produce special formats (eg bold and italic) within the wiki.

    The documents being transcribed are divided into 3 main sections: Act Books, Depositions, and Personal Answers, as well as several smaller categories. There are also a series of glossaries which cover different specialist areas, e.g. commodities, legal and marine.

    The markup used for editing the wiki is similar to the markup I have used with other computer programs:

    Bold: 3 ‘ either side of the word

    Italic: 2 ‘ either side of the word

    Underline: <u> text </u>

    Headers: = header size 1 =

    == header size 2 ==

    === header size 3 ===

    ==== header size 4 ====

    To include an image: [[File: filename|sizepx|thumb|position|display text]]

    Each page for transcription is made up of a high quality image and a transcription. The name of each page has several key elements: the parent volume (HCA 13/63), the folio number (f.455), and whether it is the right (recto) or left (verso) hand page, and ‘Annotate’.

    Key metadata is included when the page is created or edited, for example whether an image has been uploaded, whether it has been transcribed, by whom and the date.

    My First Transcription

    Having learnt the basics of wiki editing and created our biography for the site, we now move onto transcription. My first page for transcription is HCA 13/124.

    This is a page from HCA Personal Answers and is made up of responses from “Christ: Collman” and “William Hargrave” from 1651.

    One of the key challenges with any transcription of handwritten material is deciphering the handwriting itself, it is not unlike translating from a foreign language in that you use the context of what you have already transcribed to make sense of the words which are unclear.In addition, as these are Admiralty legal documents there is an amount of jargon and stock phrases, from the legal and maritime worlds, which make the process tricky for the beginner.

    Apr 02 2015

    Sentiment Analysis – Further Down the ‘R’abbit Hole

    “Curiouser and curiouser!” Cried Alice (she was so much surprised, that for the moment she quite forgot how to speak good English).” – Lewis Carroll, Alice’s Adventures in Wonderland

    It seems rather strange to think that, just under eight months ago, I had not written any computer code (I’m not including little bits of BASIC from the ’80s), and yet lines of code or the blinking cursor of Terminal no longer instil a sense of rising panic. Although programming has a very steep learning curve, it is relatively easy to gain a basic understanding, and, with this, the confidence to experiment.

    R has rapidly become my favourite programming language, so I was interested to follow a link from Scott Weingart’s blog post ‘Not Enough Perspectives Pt. 1’ to Matthew Jockers’ new R package ‘Syuzhet’. As this is an area I hope to research as part of my PhD I decided to give it a try, using the ‘Introduction to the Syuzhet Package‘ (Jockers, 2015) as a guide. I used a short text from Jane Austen’s juvenilia – ‘LETTER the FOURTH From a YOUNG LADY rather impertinent to her friend’. I removed the speech marks from the text as this causes problems with the code.

    The code:

    
    # Experiment based on http://cran.r-project.org/web/packages/syuzhet/vignettes/syuzhet-vignette.html
    # 'Introduction to the Syuzhet Package' Jockers 20-2-2015
    
    # Having installed the syuzhet package from CRAN, access it using library:
    library(syuzhet)
    
    # Input a text, for longer texts use get_text_as_string()
    # This text is from Austen's Juvenillia from Project Gutenberg
    example_text <- "We dined yesterday with Mr Evelyn where we were introduced to a very
    agreable looking Girl his Cousin...
    #[I haven't included the whole text I used - it can be viewed here  #http://www.gutenberg.org/files/1212/1212-h/1212-h.htm#link2H_4_0036]
    This was an answer I did not expect - I was quite silenced, and never felt so awkward in my Life - -."
    
    # Use get_sentences to create a character vector of sentences
    s_v <- get_sentences(example_text)
    
    # Check that all is well!
    class(s_v)
    str(s_v)
    head(s_v)
    
    # Use get_sentiment to assess the sentiment of each sentence. This function
    # takes the character vector and one of four possible extraction methods
    sentiment_vector <- get_sentiment(s_v, method = "bing")
    sentiment_vector
    
    # The different methods give slightly different results - same text different method
    afinn_s_v <- get_sentiment(s_v, method = "afinn")
    afinn_s_v
    
    # An estimate of the "overall emotional valence" of the passage or text
    sum(sentiment_vector)
    
    # To calculate "the central tendency, the mean emotional valence"
    mean(sentiment_vector)
    
    # A summary of the emotions in the text
    summary(sentiment_vector)
    
    # To visualise this using a line plot
    plot(sentiment_vector, type = "l", main = "Plot Trajectory 'LETTER the FOURTH From a YOUNG LADY'", xlab = "Narrative Time", ylab = "Emotional Valence") 
    abline(h = 0, col = "red")
    
    # To extract the sentence with the most negative emotional valence
    negative <- s_v[which.min(sentiment_vector)]
    negative
    
    # and to extract the most positive sentence
    positive <- s_v[which.max(sentiment_vector)]
    positive
    
    # Use get_nrc_sentiment to categorize each sentence by eight emotions and two
    # sentiments and returns a data frame
    nrc_data <- get_nrc_sentiment(s_v)
    
    # To subset the 'sad' sentences
    sad_items <- which(nrc_data$sadness >= 0)
    s_v[sad_items]
    
    # To view the emotions as a barplot
    barplot(sort(colSums(prop.table(nrc_data[, 1:8]))), horiz = T, cex.names = 0.7,
    las = 1, main = "Emotions in 'Letter the Fourth'", xlab = "Percentage",
    col = 1:8)
    
    

    The Results:

    The sentiment vector – this assigns a value to each sentence in the text.

    [1] 0 2 1 -1 -1 0 0 0 0 0 0 2 0 0 -1 0 1 -2 0 0 1 1 4 0 0 -2 -1
    [28] -3 0 -2 1 0 0 2 1 -2 1 -1 0 -1 0 -1 0 -1

    Visualising the sentiment vector as a line graph shows the fluctuations within the text:

    Screen Shot 2015-04-01 at 19.22.55

    Visualising the emotions within the text:

    Screen Shot 2015-04-01 at 19.28.18

     

    My Thoughts

    I have only started to explore this package and have applied it to a very short passage (44 sentences), while this shows what Syuzhet can do in a general way, it does not demonstrate its full capabilities. In addition, as I haven’t fully read up on the package and the thinking behind it, my analysis may well be plagued with errors.

    However, these are my thoughts so far. Running a brief trial, using three of the methods available, highlights some of the difficulties of sentiment analysis, while all three identified the same sentence as the most ‘negative’:

    [1] “I dare say not Ma’am, and have no doubt but that anynsufferings you may have experienced could arise only from the crueltiesnof Relations or the Errors of Freinds.”

    each of the methods identified a different sentence as being the most ‘positive’:

    bing – [1] “Perfect Felicity is not the property of Mortals, and no one has a rightnto expect uninterrupted Happiness.”

    afinn – [1] “I was extremely pleased with hernappearance, for added to the charms of an engaging face, her manner andnvoice had something peculiarly interesting in them.”

    nrc – [1] “I recovered myself however in a few moments andnlooking at her with all the affection I could, My dear Miss Grenvillensaid I, you appear extremely young – and may probably stand in need ofnsome one’s advice whose regard for you, joined to superior Age, perhapsnsuperior Judgement might authorise her to give it.”

    This is something Jockers discusses further in his blog post ‘My Sentiments (Exactly?)‘, highlighting that sentiment analysis is difficult for humans, as well as machines:

    This human coding business is nuanced.  Some sentences are tricky.  But it’s not the sarcasm or the irony or the metaphor that is tricky. The really hard sentences are the ones that are equal parts positive and negative sentiment.Matthew Jockers

    However, he also points out:

    One thing I learned was that tricky sentences, such as the one above, are usually surrounded by other sentences that are less tricky.Matthew Jockers

    It seems that a combination of close and ‘distant’ reading, what Mueller calls ‘scaled reading’, is likely to be of most use if analysis at the sentence level is desired. Having only a relatively recent and limited experience of programming in R, I have found using the Syuzhet package very straightforward and am looking forward to using it again very soon.

     

    UPDATE: 3rd April 2015

    There is a great deal of academic discussion surrounding the methods discussed here. As I read further I will add another post exploring the core points and including a reading list.

    Older posts «

    » Newer posts