As I have been delving deeper into the technicalities of a corpus-based approach to literature, it has become increasingly evident that I will need to get my head around some fairly complex statistical analysis. The more I read about this type of analysis, the more references to R I found. To be able to fully understand the articles in this field, I will need to get to grips with statistics as well as at least one of the computer methods used to produce this type of analysis. In addition, as I am a bit of a control freak, I want to be able to carry out my own statistical analysis (as far as is possible) and this seems to mean learning how to program using R.
I downloaded R and RStudio, following the instructions on the Coursera ‘Data Scientist’s Toolbox’ course, and found two sites which allowed me to go through the basics: Datacamp and Tryr (being from the West Country I love the pirate references on Tryr!), I have also signed up for the Coursera course in R programming. This is quite an exciting prospect as it is a world away from my areas of expertise. Although it is hard work, I think it will be worth it to be able to run my own algorithms and to know exactly how the results are achieved, rather than having to rely upon someone else. Below are my notes from following the tutorials on Datacamp and Tryr:
Expressions: a simple instruction written after the prompt (>), it could be a line of text (written in speech marks) or a simple mathematical equation (2+3). In this way, R can be used as a simple calculator. The response is written on the next line, indicated by . Logical (Boolean) values: expressions which return True or False. T and F are shorthand. Variables: These allow you to store a value or object which can be accessed later, e.g. a value for x. When x is typed, R replaces it with the assigned value. Values can be assigned either by writing x = 4 or x<-4. Data types:
- Numerics – decimal values
- Integers – natural numbers
- Logical – Boolean values
- Characters – text or string
Checking a variable type: type class(‘variable name’) – this allows you to make sure that you are working with the right type of variable for the calculation you are trying to carry out. Functions: Similar to a spreadsheet you can use functions e.g. sum(x,y,z). The values need to be in parentheses.
- sum – adds the given values
- rep – repeats the value (used with the argument times)
- sqrt – square root
Help: to get help for a particular function type help(‘function name’). Example(‘function name’) gives examples of the function used. For simple and short scripts, it is easy to type the commands as required. However, for longer and more complex commands, it is possible to save the commands as a plain text file (‘x.R’) which can be executed later. To run the script you type source(“x.R”).
A vector is a list of values (a one-dimension array). They can hold numeric, logical or character data. A vector is created using c(x,y,z) – c meaning combine. Vectors cannot hold values with different types. To name the elements of a vector use names(vector_a)=c(item names). Alternatively, create a vector with the item names and then assign that vector. E.g. item_names_vector =c(a,b,c) then names(vector_a)=item_names_vector. The names can be used to access the values or to change them: vector_a[a] would bring up the value associated with a; to change the value: vector_a[a]<-42.
For example, vectors could be created for authors and their works to identify where specific pieces of data have originated, so I could have an ‘Austen’ vector which included each of the novels.
To add vectors and assign to a total: total_vector=vector_a+vector_b. To add the values within a vector: sum(vector_a). To select elements of a vector: vector_a – the number (the array indices) indicates the value at that position, in R the array indices start at 1; to select multiple elements: vector_a[c(1,3)] or a consecutive set of elements: vector_a[c(2:5)] this is called a sequence vector. You can use  to assign a new value within a vector or to add new values, e.g. vector_a<- “biscuit” would change the second value in the vector to “biscuit”. To set ranges of values: vector_a[4:6]<- c(x,y,z). Another way of selecting a series of elements is using the seq function: seq(3,9)would return all the numbers from 3 to 9; however it is more flexible as it can be used for increments other than 1 by following the second number with the desired increment. To get the average of a vector: mean(vector_a); to get the average of selected elements: mean(vector_a[c(1,2)]).
Comparing in R:
- < less than
- > greater than
- >= greater than or equal to
- == equal to
- != not equal to
To select by comparison: selection_vector=vector_a>3 (would return TRUE/FALSE for each item greater than 3); this can then be used to identify only those items above 3: above_3=vector_a[selection_vector]. NA Values: if a value isn’t known it can be replaced with NA – R recognises this and will return NA for calculations, you can instruct R to ignore NA, e.g. sum(vector_a, na.rm=TRUE) – the default is FALSE. To see the values in a vector: print(vector_a).
Bar Graphs: the barplot function draws a bar chart with a single vector’s values – barplot(vector_a). If names have been assigned to the vector values these will be displayed as labels. The mean on the vector can be worked out mean(vector_a) and added to the barplot abline(h=mean(vector_a)) – v would create a vertical line. Median can be calculated and added in the same way. To work out the standard deviation: sd(vector_a). By naming the mean and sd vectors you can add lines to a barplot indicating the mean and 1 standard deviation above and below.
Scatter Plots: the plot function takes two vectors, one for the x-axis and one for the y-axis: plot(vector_a, vector_b) – the first vector is the x-axis.
A matrix is a collection of elements (of the same type) arranged in rows and columns (a two-dimensional array). The matrix function matrix has three arguments: the elements to be arranged, byrow or bycol (how the information will be organised), and nrow (how many rows are used). E.g matrix(1:15, bycol=TRUE, nrow=3). To name columns: colnames(matrix_a)=c(“a”,”b”,”c”), to name rows: rownames(matrix_a)=c(“a”,”b”,”c”). To calculate the total for a row: rowSums(matrix_a); for a column: colSums. To add columns or rows: cbindmerges matrices and/or vectors by column e.g. matrix_c=cbind(matrix_a,matrix_b,vector_a); and rbind merges matrices and/or vectors by row. To select elements from a matrix: matrix_a[row,column], e.g.matrix_a[1,2] – this would select the element on the first row, second column. To select a whole row: matrix_a[row,]; a whole column: matrix_a[,column]. To change a vector into a matrix: dim (dimensions) e.g. dim(vector_a) <-c(rows,columns).
Matrix Plotting To create a contour map: contour(matrix_a). To create a 3D perspective plot: persp(matrix_a). To alter the vertical expansion use expand, e.g. persp(matrix_a,expand=0.2). To create a heat map: image(matrix_a).
Factors are a statistical data type used to store categorical variables, variables with a fixed number of categories e.g. Novels by Jane Austen (e.g. Austen=c(‘sense’, ‘pride’, ‘emma’, ‘mansfield’, ‘abbey’, ‘persuasion’). To categorise the vector, use factor: JANovels=factor(Austen). This will create levels of unique values – they become integer references and the underlying integers can be viewed using as.integer(JANovels).
If you create a plot to explore aspects of the factor, you can use different characters for each level by using pch – e.g. plot(vector_a, vector_b, pch=as.integer(JANovels)). A legend can be added using legend and the levelsfunction e.g. legend(“topright”, levels(JANovels), pch=1:length(levels(JANovels))).
If the variable has a natural order (it is an ordinal variable, e.g. high, medium, low) the order of the levels can be set when creating the factor factor(vector_a, order=TRUE, levels=c(“high”, “medium”, “low”)). The summary function, when used with a factor will give you an overview.
A data frame is a bit like an Excel spreadsheet, it connects linked pieces of data into columns with rows for the values. This means that additions to a column for one data item prompts additions to the others thus keeping the whole in sync. Unlike a matrix, where all the entries need to be of the same data type, a data frame allows a variety of data types. To create a data frame: frame_a<-data.frame(vector_a,vector_b,factor_a). To access a column: frame_a[] or frame_a[[“vector_b”]] – both would return the same information, although the first method is shorter, the second is clearer; an alternative method is frame_a$vector_b.
Loading Data Frames
R has the capability to load external files, e.g. .csv (comma separated values) and .txt files. To load a CSV file into a data frame: read.csv(“file.csv”). For files that use separators other than commas, e.g. a text file using tabs you use the read.table function: read.table(file.txt, sep=”t”) – this would read a TXT file where the values are separated by tabs. The header argument can indicate that the first line is the column header: read.table(file.txt, sep=”t”, header=TRUE).
Merging Data Frames
To merge two data frames where the have a common column: merge(x=frame_a, y=frame_b).
Exploring Data Frames
The head() and tail() function allows you to see the top and bottom sections of your data frame: head(frame_a). The str() function tells you the number of observations, the number of variables, the variables’ names and type, and the first observations: str(frame_a) – this is a useful way of getting an overview of a new data set. To create a subset of a data frame: subset_a=subset(frame_a, subset(frame_a$vector_a>n)). To order the information in a data frame using a particular heading use the function order(): decrease=order(frame_a$vector_a, decreasing=TRUE). To create a new data frame using this new ordered information: frame_b=frame_a[decrease,].
To create a list you use the list() function: list_a=(vector_a, matrix_b, frame_c). To name the items while creating the list: list_a=(vectorname=vector_a, matrixname=matrix_b, framename=frame_c).
To test for correlation: cor.test(vector_a,vector_b); or for subsets of a frame: cor.test(frame_a$a, frame_a$b). This will provide the p-value and other information.
To see whether an estimate can be made for a likely result if we have data for a but incomplete data for b, using a linear model: estimate=lm(response ~ predictor) e.g. estimate=lm(frame_a$b ~ frame_a$a).
ggplot2 is a graphics package. Once it is installed you can get help: help(package=”ggplot2″). To use a package: library(ggplot2). This package can simply create more attractive plots using colour without some of the complexities.