Jul 10 2015

Marine Lives – R and The Silver Ships – Frequencies

In this second blog post on the Marine Lives Three Silver Ships project, (see the first post here) I look at how to identify the folio pages in the HCA 13/70 Depositions which mention the three ships (Salvador, Sampson and Saint George).

Processing and Calculating Raw Frequencies

Using the .txt file downloaded in the last post, I started by processing the text. To write the code used in this section, I adapted code from Matthew Jockers’ (2014) book Text Analysis with R for Students of Literature. First, using the phrase “This page is for the annotation of HCA 13/70 f.x” as a marker, I broke the text into its individual pages, named them and calculated the raw frequencies of the words.

text.v <- scan("HCA_13.70.txt", what = "character", sep = "\n")

start.v <- which(text.v == "This page is for the annotation of HCA 13/70 f.1r.")
end.v <- length(text.v)
HCA_13.70.lines.v <- text.v[start.v : end.v]

# Use grep to break the text into folios
folio.positions.v <- grep("^This page is for the annotation of \\w", 

# Add an additional line to text as an end marker 
HCA_13.70.lines.v <- c(HCA_13.70.lines.v, "END")
last.position.v <- length(HCA_13.70.lines.v)
folio.positions.v <-c(folio.positions.v, last.position.v)

# Extract the text on each page and calculate the frequency count of each
# word type

# Create folio page names
f.v <- rep(seq(1:501), each=2)
p.v <- c("r", "v")
page.v <- rep(p.v, times=501)

folName.l <- list(length = 1002)
for (i in 1:1002) {
        folName.l[[i]] <- c("HCA_13/70_f.", f.v[i], page.v[i])
        folName.l[[i]] <- paste(folName.l[[i]], collapse = "")

# Create empty list containers
folio.raws.l <- list()
folio.freqs.l <- list ()

for (i in 1:length(folio.positions.v)) {
        if (i != length(folio.positions.v)) {
                folio.title <- folName.l[[i]]
                #folio.title <- HCA_13.70.lines.v[folio.positions.v[i]]
                start <- folio.positions.v[i]+1
                end <- folio.positions.v[i+1]-1
                folio.lines.v <- HCA_13.70.lines.v[start:end]
                folio.words.v <- tolower(paste(folio.lines.v, collapse = " "))
                folio.words.l <- strsplit(folio.words.v, "\\W")
                folio.word.v <- unlist(folio.words.l)
                folio.word.v <- folio.word.v[which(folio.word.v != "")]
                folio.freqs.t <- table(folio.word.v)
                folio.raws.l[[folio.title]] <- folio.freqs.t

Plotting Frequencies

As a test, I looked at the frequency of two words I knew were in the corpus, ‘goods’ and ‘captaine’ and plotted their frequency on a bar chart.

# testing with 'goods' and 'captaine' 
goods.l <- lapply(folio.raws.l, '[', 'goods')
goods.m <-do.call(rbind, goods.l)

captaine.l <- lapply(folio.raws.l, '[', 'captaine')
captaine.m <- do.call(rbind, captaine.l)

goods.v <- goods.m[,1]
captaine.v <- captaine.m[,1]

goods.captaine.m <- cbind(goods.v, captaine.v)

colnames(goods.captaine.m) <- c("goods", "captaine")

barplot(goods.captaine.m, beside=T, col="grey")
Raw Frequencies of 'goods' and 'captaine'

Raw Frequencies of ‘goods’ and ‘captaine’

The next step was to search for the three ships – I chose to search only for one spelling variation – ‘salvador’, ‘sampson’ and ‘george’. 

# Ship names (n.b. all instances of "george" may not be references to the St George)

salvador.l <- lapply(folio.raws.l, '[', 'salvador')
salvador.m <-do.call(rbind, salvador.l)

sampson.l <- lapply(folio.raws.l, '[', 'sampson')
sampson.m <- do.call(rbind, sampson.l)

george.l <- lapply(folio.raws.l, '[', 'george')
george.m <- do.call(rbind, george.l)

salvador.v <- salvador.m[,1]
sampson.v <- sampson.m[,1]
george.v <- george.m[,1]

# This creates a matrix of the mentions of ship names and the page
ships.m <- cbind(salvador.v, sampson.v, george.v)

# A plot of the references to the ships by folio
colnames(ships.m) <- c("salvador", "sampson", "george")

barplot(ships.m, beside=T, col = "grey")

This produced a matrix with the frequencies for each mention of the ships and a bar chart which indicates that the references to the ships are grouped in a few of the depositions. Obviously, the search results for ‘george’ are skewed by deponents with the name ‘George’, which can be seen by the increased frequency.

Raw Frequencies of 'salvador', 'sampson' and 'george'

Raw Frequencies of ‘salvador’, ‘sampson’ and ‘george’

Identifying Key Folios

Having identified the raw frequencies, I converted the matrix to a data frame, changed theNA values to 0 and then subsetted the results to show only the page names which have one or more mentions of one of the ships. Finally, I subsetted this data frame to highlight the pages which have the highest number of mentions ( > 10).

# Converting matrix to dataframe and replacing NA with 0
ships.df <- as.data.frame(ships.m)

ships.df[is.na(ships.df)] <- 0

# Calculate the mentions per page and subset for those above 0
ships.df$mentions <- rowSums(ships.df)

# All three ships
ships.mention.df <- subset(ships.df, mentions > 0)

ships.high.mentions.df <- subset(ships.df, mentions >= 10)

This data frame indicates that 10 of the pages have a high number of references to one or more of the ships names:

High mentions of one or more of the ship names by folio

Mentions of one or more of the ship names by folio

In my next post, I will look at the use of KWIC (Key Word in Context) to search for references to the ships.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>