Jul 09 2015

Marine Lives – R and The Silver Ships – Extracting Data

Within the larger Marine Lives project there are a number of smaller sub-projects which focus on specific areas of interest. One of these smaller projects is the Three Silver Ships project:

“Three large ships (The Salvador, the Sampson and the Saint George) of supposedly Lubeck and Hamburg build and ownership were captured by the English in 1652 with highly valuable cargos of bullion. The ships were on their way from Cadiz with bullion from the Spanish West Indies going northwards. It was disputed in court as to whether the ships were bound legally for the Spanish Netherlands, or illegally for Amsterdam.” Marine Lives

The purpose of the project is to identify relevant references to cases involving the Three Silver Ships in the various depositions and papers and consolidate this in a wiki. At the moment, this is being done manually and, as the depositions have multiple, two-sided pages, this means a large number of pages to search.

As my PhD is using R to analyse novels, I thought it would be interesting to see whether I could apply my rookie programming skills to the problem. As a caveat, I have only been using R for about a year and am still very much a beginner, so there may well be more straightforward and elegant ways of working – please feel free to comment as any advice will be gratefully received!

The first challenge was to extract the text, the bulk of which are transcriptions, from the individual wiki pages. This had several stages:

Creating a list of URLs

As each page is made up of a collection number (in this case HCA 13/70), a folio number (e.g. f.1), and whether the page is recto or verso (r or v), I started by creating a number sequence for the folios and a second sequence for the page. These were combined in a list with the main part of the URL and then collapsed to make a working URL.

# Create a sequence for the folio numbers
f.v <- rep(seq(1:501), each=2)

# Create a sequence for the pages recto/verso
p.v <- c("r", "v")
page.v <- rep(p.v, times=501)

# Create a list of URLs
folio.l <- list(length = 1002)
for (i in 1:1002) {
       folio.l[[i]] <- c("http://www.marinelives.org/wiki/HCA_13/70_f.",f.v[i],
                         page.v[i] , "_Annotate")
       folio.l[[i]] <- paste(folio.l[[i]], collapse = "")


Extracting the text from the wiki pages

To extract the text I used the package boilerpipeR (Mario Annau, 2015). I used the DefaultExtractor function as I found that the LargestContentExtractor excluded some parts of the transcriptions.

# Extract the text from the wiki pages and save as .txt file

# This function extracts the text from a wiki page
textExtract <- function(number) {
        url <- folio.l[[number]]
        content <- getURL(url)

        extract <- DefaultExtractor(content) 
        text.v <- extract

# Create vector to hold extracted information and fill using loop
# This is a large file (approx 5 Mb) so will take some time to run
x.v <- vector(length = 1002)  
for (i in 1:1002) {
        x.v[i] <- textExtract(i)

# Put resulting text into a named file
HCA_13.70.v <- x.v

Saving the file as a .txt file

The final step was to save the resulting file as a .txt file, allowing me to access it offline and keeping a copy available for use without having to go through the download process again.

# Check that working directory has been set then save as .txt file

write(HCA_13.70.v, file = "HCA_13.70.txt")

In my next post I will discuss preparing the text for analysis.


2 pings

    • Rowan on July 9, 2015 at 17:45
    • Reply

    Nice!! Can’t wait to see what this produces…

    I imported the existing data into the wiki, and am in the process of (slowly) setting it up – the transcriptions and other fields are now marked up in a vaguely semantic sense, and I’m keen to explore how best to exploit this in terms of searching and export.

    Were you aware of/did you try the supposedly built in RDF export (https://semantic-mediawiki.org/wiki/Help:RDF_export) before scraping? What were the limitations?

    Let me know if you’d like rawer access to the data, I’m sure we can set something up.

    1. Hi, I didn’t use anything from the semantic-mediawiki as I really have no idea how it works. I have been learning R for the past year and used boilerpipeR to scrape the text from the URLs. There are other options in the package, but I stuck to the default. There are probably much more elegant/efficient ways of doing this, but I’m feeling my way along, and I figure things can be tweaked once there is a basic framework set up.

  1. […] « Marine Lives – The Silver Ships […]

  2. […] « Marine Lives – The Silver Ships […]

Leave a Reply

Your email address will not be published.