My initial foray into the world of programming seemed to go fairly well, at least, I managed to get my head around the basics and haven’t run screaming from my laptop. There were a few areas which I found more tricky than others, but I think that some of that is because I don’t have a maths or computer background, so I have a pretty steep learning curve.
My understanding is that R is a program that you can use to sort, group and analyse data from a straightforward level right up to complex modelling.
There are several types of ‘object’ in R: vectors, lists, matrices, factors and data frames. These objects have attributes: class (type: character, numeric, integer, logical, complex), names, dimensions, length etc. These objects hold the data. Functions are used to manipulate the data; they seem to be written functionname (object).
Naming vectors and other items carefully will make the data much easier to explore, so this needs some thought as well as careful recording – the environment window of R Studio lists values, their type and their content which is very useful. I feel that I am starting to grasp what R is capable of, and why I might use it, but my practical knowledge is not yet up to where it needs to be.
Having completed Coursera‘s Data Scientist’s Toolbox course, which introduced me to R, R Studio, Git and GitHub (all available free on the internet), the next step is the R Programming course. Both courses are part of the data science specialism run by Johns Hopkins University.
Although I had covered a large proportion of the initial SWIRL content in my previous post, it has been helpful to go over the basic functions. This time round, I feel a bit more confident in the use of terminology and the structure of arguments and functions.
R uses NA to indicate missing values and will return NA as the result of any calculation with NA as one of the variables. To identify the NA results in a data set you use the is.na() command. Care has to be taken using logical expressions if NA is a possible variable as it can return odd results. For non-numerical data NaN is used, standing for ‘Not a Number’.
The is.na() function can be used to remove NA values: bad <- is.na(x) and then x[!bad]. To remove NA from multiple objects or from a data frame: use complete.cases().
To select elements from a vector use square brackets with a vector index – [vector index]. There are several types of vector index: logical vectors, positive integers, negative integers, character strings.
! gives the negation of a logical expression, so if is.na() can be used to identify results which are NA !is.na() can be used to identify results which are not NA. So by subsetting the data and creating a vector which includes only the not NA items we can avoid the possible problems.
Positive integers can be used to identify specific elements within the vector, i.e. the 3rd and 5th – x[c(3,5)], negative integers can be used to identify elements within the vector excluding specific ones, e.g. all except 3rd and 5th x[c(-3,-5)], this can also be written x[-c(3,5)].
To subset using a name you need to remember to use quotation marks inside the square brackets: x[“name1”, “name2”].
Matrices and Data Frames
The dim() function tells us the dimensions of an object or can be used to set the dimensions – for example to change a vector into columns and rows, and therefore change it into a matrix.
To subset from a list or data frame you use a double square bracket [], it can only be used to select a single element. A $ is used to extract elements by name.
To subset a matrix use x[row, column] – this will return a vector by default, to return a matrix use x[row, column, drop = FALSE].
The Working Directory
It is important to know which directory you are working in, and therefore where your information will be saved. To find out your current directory type:getwd(), to set the working directory use session tab-choose directory, or use the setwd() function.
Any file you want R to read will need to be in this directory. To access a csv file (comma separated values) you use the command read.csv(“filename.csv”). To check the contents of a directory: dir().