In this second blog post on the Marine Lives Three Silver Ships project, (see the first post here) I look at how to identify the folio pages in the HCA 13/70 Depositions which mention the three ships (Salvador, Sampson and Saint George).
Processing and Calculating Raw Frequencies
Using the .txt file downloaded in the last post, I started by processing the text. To write the code used in this section, I adapted code from Matthew Jockers’ (2014) book Text Analysis with R for Students of Literature. First, using the phrase “This page is for the annotation of HCA 13/70 f.x” as a marker, I broke the text into its individual pages, named them and calculated the raw frequencies of the words.
text.v <- scan("HCA_13.70.txt", what = "character", sep = "\n") start.v <- which(text.v == "This page is for the annotation of HCA 13/70 f.1r.") end.v <- length(text.v) HCA_13.70.lines.v <- text.v[start.v : end.v] # Use grep to break the text into folios folio.positions.v <- grep("^This page is for the annotation of \\w", HCA_13.70.lines.v) # Add an additional line to text as an end marker HCA_13.70.lines.v <- c(HCA_13.70.lines.v, "END") last.position.v <- length(HCA_13.70.lines.v) folio.positions.v <-c(folio.positions.v, last.position.v) # Extract the text on each page and calculate the frequency count of each # word type # Create folio page names f.v <- rep(seq(1:501), each=2) p.v <- c("r", "v") page.v <- rep(p.v, times=501) folName.l <- list(length = 1002) for (i in 1:1002) { folName.l[[i]] <- c("HCA_13/70_f.", f.v[i], page.v[i]) folName.l[[i]] <- paste(folName.l[[i]], collapse = "") } # Create empty list containers folio.raws.l <- list() folio.freqs.l <- list () for (i in 1:length(folio.positions.v)) { if (i != length(folio.positions.v)) { folio.title <- folName.l[[i]] #folio.title <- HCA_13.70.lines.v[folio.positions.v[i]] start <- folio.positions.v[i]+1 end <- folio.positions.v[i+1]-1 folio.lines.v <- HCA_13.70.lines.v[start:end] folio.words.v <- tolower(paste(folio.lines.v, collapse = " ")) folio.words.l <- strsplit(folio.words.v, "\\W") folio.word.v <- unlist(folio.words.l) folio.word.v <- folio.word.v[which(folio.word.v != "")] folio.freqs.t <- table(folio.word.v) folio.raws.l[[folio.title]] <- folio.freqs.t } }
Plotting Frequencies
As a test, I looked at the frequency of two words I knew were in the corpus, ‘goods’ and ‘captaine’ and plotted their frequency on a bar chart.
# testing with 'goods' and 'captaine' goods.l <- lapply(folio.raws.l, '[', 'goods') goods.m <-do.call(rbind, goods.l) captaine.l <- lapply(folio.raws.l, '[', 'captaine') captaine.m <- do.call(rbind, captaine.l) goods.v <- goods.m[,1] captaine.v <- captaine.m[,1] goods.captaine.m <- cbind(goods.v, captaine.v) dim(goods.captaine.m) colnames(goods.captaine.m) <- c("goods", "captaine") barplot(goods.captaine.m, beside=T, col="grey")
The next step was to search for the three ships – I chose to search only for one spelling variation – ‘salvador’, ‘sampson’ and ‘george’.
# Ship names (n.b. all instances of "george" may not be references to the St George) salvador.l <- lapply(folio.raws.l, '[', 'salvador') salvador.m <-do.call(rbind, salvador.l) sampson.l <- lapply(folio.raws.l, '[', 'sampson') sampson.m <- do.call(rbind, sampson.l) george.l <- lapply(folio.raws.l, '[', 'george') george.m <- do.call(rbind, george.l) salvador.v <- salvador.m[,1] sampson.v <- sampson.m[,1] george.v <- george.m[,1] # This creates a matrix of the mentions of ship names and the page ships.m <- cbind(salvador.v, sampson.v, george.v) dim(ships.m) # A plot of the references to the ships by folio colnames(ships.m) <- c("salvador", "sampson", "george") barplot(ships.m, beside=T, col = "grey")
This produced a matrix with the frequencies for each mention of the ships and a bar chart which indicates that the references to the ships are grouped in a few of the depositions. Obviously, the search results for ‘george’ are skewed by deponents with the name ‘George’, which can be seen by the increased frequency.
Identifying Key Folios
Having identified the raw frequencies, I converted the matrix to a data frame, changed theNA values to 0 and then subsetted the results to show only the page names which have one or more mentions of one of the ships. Finally, I subsetted this data frame to highlight the pages which have the highest number of mentions ( > 10).
# Converting matrix to dataframe and replacing NA with 0 ships.df <- as.data.frame(ships.m) ships.df[is.na(ships.df)] <- 0 # Calculate the mentions per page and subset for those above 0 ships.df$mentions <- rowSums(ships.df) # All three ships ships.mention.df <- subset(ships.df, mentions > 0) ships.high.mentions.df <- subset(ships.df, mentions >= 10)
This data frame indicates that 10 of the pages have a high number of references to one or more of the ships names:
In my next post, I will look at the use of KWIC (Key Word in Context) to search for references to the ships.