Hello people! Welcome to the Introduction to R course 2020 organized by the STRI community. Some of you are already familiriazed with R and Rstudio, but for others this probably is one of the first times using it. So we will start from the beginning. Along the document you will find some examples, it’s highly recommended to try to run those examples in your computer. Thanks for joining and we hope this will be useful for your future projects :)

General index

  1. R and RStudio
  2. Scripts
  3. Objetcs and functions
  4. Importing and exporting data
  5. Manipulating data

1. R and RStudio

R is a computational system which consists of two main elements: a programming language, and an environment were you can program or visualize information. If you want to read more about history, legal aspect, and some basic definitions of R, you can take a look at the Frequently Asked Questions in the R webpage by clicking here.

If this is your first time interacting with a command line or learning a computer language, R may look a bit scary, the learning curve may be steep. But do not worry. Like everything in life, getting exposed to something new multiple times and practicing will get you there! And trust us, it is rewarding! At the end of this introductory course, we hope you will be more comfortable writing basic commands, plotting some graphs, using basic stats in R, and especially, more comfortable to explore whatever you want or need to do in R.

First, lets take a look at the R interface. Try opening R in your computer. Between operating systems and different R versions there could be some differences, but in general should look like Figure1. As you can see, the interface has little information about how to use it and its not very intuitive. Luckyly, there is a program called RStudio, which gives a nicer and more intuite interface. From now on we are going to be working exclusively in RStudio, considering that it has very handy extra options in comparison to plain R.

Figure 1. R interface.

Figure 1. R interface.

The usual working display on the RStudio interface consist on a Menu bar located in the very top left part of the screen and the following four windows (Figure 2):

  1. On the bottom left corner, you will find the Console window (Figure2, blue box). Here you can type commands and see the outputs of it.

  2. Code editor window (Figure 2, red box). Keeps record of your work and comments. You can run lines of code from here as if they were in the console.

  3. The Environment and history window (Figure 2, purple box) is located at the top-right corner. The Environment tab shows you a list of the stored elements that you have created and the data imported during your session. On the History tab you will see all the command lines previously runned.

  4. The Files, Plots, Packages, Help window (Figure 2, yellow box) is very versatile, you can make many different things in this window. On the Files tab you can create new folders, delete files and move through you computer folders. It also gives you the option to set your working directory from here. The Plot tab is were you are going to visualize your plots. It has some handy options as allowing you to move though your previous plots. The Packages tab allows you to install, update and load packages. The Help tab opens a page describing information and usage of some R elements.

Figure 2. RStudio interface.

Figure 2. RStudio interface.

2. Scripts

Using the command line in the console window to write your entire code can be inconvenient, especially if your code is long, needs constant edition, or you want to add notes. It is recomended to write your code in the code editor window instead. For this task, first you need to open a new file on it. There are different types of files, but there is one that is pretty standart and easy to use called script. A script file looks like a plain text file and is were you are going to write, modify and save your code and notes. To open a new script go the menu bar in the top-left part of the screen and click on the New File icon (it’s located below the File tab, and looks like a white paper sheet with a “+” symbol). Alternatively you can press Ctrl+Shift+N on Windows and Linux, or Cmd+Shift+N on Mac.

Try writing this simple addition in your new script, like this:

1+1

Now you are ready to run a code line. First, in your script, click the line that you want to run. Second, press in your keyboard Ctrl+enter if you have Windows or Linux, or Cmd+Return if you have a Mac. As an alternative you can click on the Run button, located on the top-right of the script. Now if you look in your console, you will have the output, which is the result of 1+1 … 2.

You can add comments in your script using the hashtag symbol (#), and then writing the comment. If you press run,R will read the selected line in your script. If it finds a hashtag, it does not add anything after the hashtag to your output results. So make sure your hashtag and comments are after your code if they are sharing the same line. Try running in your script the following aritmetic operations and comments:

# This is a comment

2.5*4 # Multiplication
[1] 10
9/3   # Division
[1] 3
3^2   # Exponentiation
[1] 9

To avoid losing time and effort, make sure to save your script regularly (ProTip: RStudio sometimes crash, almighty pets may pee on laptops, zombie apocalypses may happen, etc). Save your script clicking in Save current icon (looks like a single floppy disk) located in the menu bar below the View tab. In case you have several scripts open, you can save them all at the same time clicking the Save all open (looks like two floppy disks) located next to the Save current. Alternatively you can explore saving options on the File tab on the menu bar.

3. Objects and Functions

By definition R is an object orientated language. That means R stores and interprets data and their attibutes in a programming structure called object.

The way to create a new object is writing the name of the new object and use the assignment operator (<-) to assign a value. So lets save data into an object! Create an object called a and store the number 5.5 on it, as follows:

a <- 5.5

Now, R recognizes a as an object, and knows the value 5.5 and other attributes are stored in a. To see the content of an object, you can write the name of the object in your script and run it. Try to do it:

a
## [1] 5.5

Try creating these other objects:

b <- 3L # The capital L next to a number, forces R to recognize the number as an integer

c <- "remember to stay at home" # R recognize quotation marks as characters.

d <- TRUE # R recognizes TRUE and FALSE as logical values. R is sensitive to capital letters.
#Try writing and runing d<-true and see what happens.

Now let’s use a function to see what type of data is within objects a,b,c and d. A function is a group of statements that perform a specific task. For this purpose we will use the function class(). First, we are going to write the function name in our script and we will write the object name inside the parentheses. Try the following with all the objects previously created:

class(a)
## [1] "numeric"
class(b)
## [1] "integer"
class(c)
## [1] "character"
class(d)
## [1] "logical"

Even though there are several data types in R, those four types in your output are the most common ones.

It is very important to know what type of data you have, because R will not perform some actions using an object with the incorect data type. For example, using your previous objects a and c try making the following operations:

# Adition:

a+9 # Because a is a numeric object it will give you a numeric result as an output :)

c+9 # It gives an error because object c is type character


# Square root using the sqrt function:

sqrt(a) # No problem

sqrt(c) # Error again

So there are couple of very intuitive questions to ask at this point:

  1. How do I know what kind of data type a function needs in order to run?

Well, luckly there is a function called help() or ?. Once you run this function, a help text will appear in the Help tab (bottom-right corner). In the help text you will find the description of the function (what the function does), the usage (what objects and arguments are needed and in which order to fill the funtion parentheses), the arguments (the definition of the elements described in the usage section). Also, you will find some examples and references.

Try using the function help to open the help text of the functions sum, sqrt and help itself. take a look at the different help texts and explore how they are structured.

help("sum")
?sum # As a short alternative you can use the question mark (same output)

help("sqrt")

help("help")
  1. What about if somehow I have the incorrect data type for a function?

No problem. You can change between data types with the functions as.numeric(), as integer(), as.character(), as.logical() (there are many other functions to change between data types). Now run the code below and compare the data type of both objects.

z <- "25" # Create a new object. Because of the quotation, it is stored as a character.

class(z) # Check what data type is 
## [1] "character"
z_numeric <- as.numeric(z) # Change it to numeric

class(z_numeric) # Check what type of data it is now
## [1] "numeric"

So far we know that when we create an object, R understands that the object contains data, and some extra information as the type of data. But there is another piece of information that R knows, this is the data structure. R recognizes several data structures. Let’s start whith the most basic one, the vector. A vector is basically a group of elements of the same data type together. For example, all our previous objects were vectors with one element (this is also known as a scalar). To put together two or more elements into a vector we can use the c() (combine) function. Try creating the following vector objects typing manually different values separated with commas inside of the function paretheses:

n_insects <-c(1,2,4,6,5,8,9)

week <- c("monday","tuesday","wednesday", "thursday", "friday")

weekend <- c("saturday","sunday")

If you look at the Environment window, you will find the new vector objects that you have created. In comparison to the previous objects (the example scalars), these new vectors display more information. The environment shows the data type of the elements of the vector, the number of elements inside, and the first observations of each vector.

Also you can create vectors using tandem generated data. Try using the runif() function. This function will choose random values with a uniform distribution. In the following example we will request 9 randomly selected values, ranging from 1 to 10:

monkey_strenght = runif(9,1,10)

Well, at this point we have several vector objects created. Let’s select some elements from the vector. To explore inside the vector we use square brackets ([ ]). This method works well to make a quick subset from a vector. To use it, write the object name and add the brackets right after. Fill the brackets with the number or statement that you are looking for, for example:

n_insects <-c(1,2,4,6,5,8,9) # Create the object n_insects again :)

# n_insect has 7 values

n_insects[3]      # Find the third value of the vector
## [1] 4
n_insects[c(3,7)] # Find the 3rd AND the 7th value
## [1] 4 9
n_insects[3:7]    # Find FROM the 3rd TO the 7th value 
## [1] 4 6 5 8 9
n_insects[n_insects>3] # values greater than 3
## [1] 4 6 5 8 9

Mini homework!

  1. Explore the seq()and rep() functions using help() and try to create some custom vectors. Try to play with the different options and arguments. For example:
sequence1 <- seq(from=0, to=20, by=2)

replicate1 <- rep(c(2,5,7), times=3) # Numbers
replicate2 <- rep(c("balboa","panama","atlas"),times=5) # Characters
  1. Use this space below to play and explore your new vector objects using these functions: lenght(), mean(), sd(), sum(), table(), min(), max(), head,round .

The next data structure is a matrix. The definition of matrix is similar to a vector but in two dimensions, is a group of elements arranged in columns and rows.

You can create a matrix with the matrix () function. Fill the parentheses with a vector, followed by the number of rows and the number of columns. For matrices you have to make sure the number of elements of the vector make match with the total number of rows and columns.

Try making your own matrices as follows and look at the output:

m <- matrix(c(1,2,3,4,5,6,7,8,9,10,11,12), nrow= 3, ncol=4) # Type manual

n<-matrix(1:12, nrow = 3, ncol= 4) # Generate numbers

twelve <- runif(12,1,12) # Creating a vector object to fill a matrix
  
p<-matrix(twelve, nrow = 3, ncol= 4) # Using values from another object

If you created those matrices, you will see them in the Enviroment window. Do you see something new? At the very right of the window you will find some white squares next to the matrices objects. Make click in any matrix object and automatically a new window will show-up displaying you data nicely. Alternatively you can do this with the View() function

Similar to vectors, you can find elements contained in the matrix using the square brackets[ ]. The difference is, now we have two dimensions, so we have to fill the brackets with two pieces of information. First, a desired row and then a desired column. Let’s try to find some values

n<-matrix(1:12, nrow = 3, ncol= 4) # Create a matrix
n
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
n[1,4] # Value on the 1st row and the 4th column
## [1] 10
n[3 ,1] # Value on the 3rd row and the 1st column
## [1] 3
n[1:2,4] # Values from the 1st TO the 2nd row in the 4th column
## [1] 10 11
n[2,2:4] # Values in the 2nd row FROM the 2nd to the 4th column
## [1]  5  8 11
n[2:3,2:4] # Values FROM the 2nd TO 3rd row in the 2nd TO 4rd column
##      [,1] [,2] [,3]
## [1,]    5    8   11
## [2,]    6    9   12
n[3,] # All the elements in the 3rd row since, column is not specified
## [1]  3  6  9 12
n[,2] # All the elements in the 2nd column, since row is not specified
## [1] 4 5 6
# Try to play with all this options, rows and columns.

The next data structure is the Data Frame. Probably you have heard about this structure before, since is one of the most populars from people importing data from Excel. A data frame looks like a matrix, but usually each column represents a variable, and different variables may have different data types.

So, let’s create a data data frame using the dataframe() function. But first we have to make some vectors to fill the data frame. Each vector is going to become a column in the data frame, so make sure the vectors are from the same lenght. Try the following:

# Create new vectors (toy dataset)

age<-as.integer(rnorm(12,7,1))
weight_kg<- rnorm(12,20,3)
combat_power<-runif(12,100,200)
THC_content<-rep(c("high","very_high","critical"),times=4)
sample_month<-rep(c("february","november"),each=6)

#put them together
kindergarten<-data.frame(age,weight_kg,combat_power,THC_content,sample_month)

*Disclaimer: I am using a couple of randon number generator functions as runif() and rnorm() to create the vectors (as above, in the last example). These functions generate new random numbers every time you run them. You most probably are going to have differet numbers than I in this example. But don’t worry, the output must be similar and everything should run and look very similar too.

Now try to explore the data. Sometimes data frames have so many observations that are very long and visualise it all may be difficult. Use the function head() to have a “preview” of the first values of each column and the column names. We can get some extra information of the data frame object using the str() function. Try runnning this:

head(kindergarten) # Exploration of the first values of the data frame
##   age weight_kg combat_power THC_content sample_month
## 1   9  17.83230     180.2525        high     february
## 2   7  17.48679     137.2351   very_high     february
## 3   7  23.05719     119.0767    critical     february
## 4   5  20.77948     152.4622        high     february
## 5   7  19.92055     154.8681   very_high     february
## 6   7  23.09386     187.0476    critical     february
str(kindergarten)  # Structure of the data frame
## 'data.frame':    12 obs. of  5 variables:
##  $ age         : int  9 7 7 5 7 7 6 7 6 7 ...
##  $ weight_kg   : num  17.8 17.5 23.1 20.8 19.9 ...
##  $ combat_power: num  180 137 119 152 155 ...
##  $ THC_content : chr  "high" "very_high" "critical" "high" ...
##  $ sample_month: chr  "february" "february" "february" "february" ...
summary(kindergarten) # Some descriptive stats for each column
##       age          weight_kg      combat_power   THC_content       
##  Min.   :5.000   Min.   :12.28   Min.   :119.1   Length:12         
##  1st Qu.:6.000   1st Qu.:19.40   1st Qu.:137.1   Class :character  
##  Median :7.000   Median :21.45   Median :151.3   Mode  :character  
##  Mean   :6.667   Mean   :21.27   Mean   :152.8                     
##  3rd Qu.:7.000   3rd Qu.:23.30   3rd Qu.:169.1                     
##  Max.   :9.000   Max.   :27.57   Max.   :187.0                     
##  sample_month      
##  Length:12         
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

The output of str(kindergarten) gives a lot of information about the data frame. Starts in the top-left mentioning that is a data.frame object with 12 observations (rows) and and 5 variables (columns). Below that shows the name and the data type of each column. You will see in this example a new data type called Factor, which is a categorical variable with levels that correspond to each each of the categories.

you can select individual columns using the dolar symbol ($). To use it, you have to write the name of the data frame object, add the dolar symbol, and write the name of the column. For example

kindergarten$age # Selecting the column called "age" from "Kindergarten"
##  [1] 9 7 7 5 7 7 6 7 6 7 6 6

Once you select an individual column, they behave pretty much like a vector. You can use functions on these columns or you can create new vector with the information of the column. Try this:

mean(kindergarten$age)
## [1] 6.666667
age_new=kindergarten$age # New vector object with the column data

age_sqrt=sqrt(kindergarten$age) # New vector object with sqrt values of age

As in matrices you can find specific values using square brackets. Try using the matrix examples on the kindergarten data frame. Also play using this condicional syntax (remember, inside of the braquet you have to specify row and column):

#Find the rows with age>6 using all the columns
kindergarten[kindergarten$age>6,] 
##    age weight_kg combat_power THC_content sample_month
## 1    9  17.83230     180.2525        high     february
## 2    7  17.48679     137.2351   very_high     february
## 3    7  23.05719     119.0767    critical     february
## 5    7  19.92055     154.8681   very_high     february
## 6    7  23.09386     187.0476    critical     february
## 8    7  21.44036     136.8767   very_high     november
## 10   7  21.46826     140.0155        high     november
#Find all the rows with the THC_content equals to very_high using all the columns
kindergarten[kindergarten$THC_content=="very_high",] 
##    age weight_kg combat_power THC_content sample_month
## 2    7  17.48679     137.2351   very_high     february
## 5    7  19.92055     154.8681   very_high     february
## 8    7  21.44036     136.8767   very_high     november
## 11   6  27.56750     128.6563   very_high     november
#Find all the rows with sample_month equals to february only in column 2
kindergarten[kindergarten$sample_month=="february",2]
## [1] 17.83230 17.48679 23.05719 20.77948 19.92055 23.09386

4. Importing exporting data

One of the most convenient practices to do before importing or exporting data in R is to set a working directory. This will be a default path in your computer that R will use to find files to import and save your files generated.

In your script you can select the working directory using the function setwd(). Fill the parenthesis of the function with the path of the folder that you to stablish as working directory and run that line. Just as an EXAMPLE, mine looks like this:

setwd("C:/Users/Ernesto/Desktop/R_course_2020/Tutorial1")

YOURS WILL BE A DIFFERENT PATH, depending on where in your computer you want to stablish your working directory. Just make sure to use quotation marks in the path and do not confuse slash with backslash. If you keep the code to set the working directory at the begining of yourscript it will be very easy to run it every time you use the script.

Alternatively you can go to the menu bar, click on Session then click Set Working Directory and then click on Choose Directory. This will open a window that will allow you to choose a working directory folder in a more intuitive way.

To open data in R first make sure the files you want to open in your working directory.

There are several functions to open or read files. The function you have to choose depends on the format of the your file. Two very common formats to import data in R are .csv (comma separated values) and .txt (text file) so let’s use them as examples.

If your file is in .csv format, you can use the function read.csv(). Fill the parenthesis with the name of the file. Remember to write the file name on quotation marks and add the extension .csv at the end. Try this:

plant_data1<- read.csv("iris_example.csv")

If your file is in .txt format, you can use the function read.table(). Fill the parenthesis with the name of the file. Remember write the file name on quotation marks and add the extension .txt at the end. Try this:

plant_data2<-read.table("iris_example.txt")

I recomment to take a look using of read.csv and read.table using the help function. There are couple of arguments that I think it worth to see.

After you read both files it is suggested to check that there is nothing weird. As we did with the data frames from the previous chapter, try to use the functions head() or View(), and str().

If everything is fine with the data, you can save it. To save it on .csv you can use the function write.csv() and to save it .txt you can use the function write.table(). Fill the parentheses with the name of the data frame object that you want to save, and then add on quotation marks the name of the final file with the respective file extension. Check your working directory, the files should be there.

# csv format
write.csv(plant_data1, "iris_example_plant_data.csv")

# txt format
write.table(plant_data2, "iris_example_plant_data.txt")

5. Manipulating Data

There any many functions to manipulate data in R. Several of them are not avialable as part of the default R functions, and they require to be downloaded. Since in this tutorial I am not covering R Packages, I will show you to use only R default functions.

First, open de file toy_dataset.csv and look at it.

# Open the dataset using read.csv(). Use help() and look to the "sep" argument 
xdata<-read.csv("toy_dataset.csv",sep=";")

# Look at the object strucure
str(xdata)
## 'data.frame':    12 obs. of  5 variables:
##  $ edad        : int  6 5 8 7 7 5 5 6 6 5 ...
##  $ weight_kg   : num  22.4 17.7 18.8 18.9 29.1 ...
##  $ combat_power: num  132 186 186 170 130 ...
##  $ THC_content : chr  "high" "very_high" "critical" "high" ...
##  $ food_type   : chr  "Fritanga" "Fritanga" "salad" "SAlad" ...
# Check the data
View(xdata)

If you look close, you will find couple of troubles here:

  1. The variable food_type has 2 categories, fritanga and salad. If you looked in thexdatastructure, you will see that food_type has 5 levels or categories (use the function levels() on xdata$food_type to see all the posible categories of the variable food_type). It seems that someone mixed words in lower and upper case while typing the data. We need to fix that, changing all the values to lower case will work.

  2. The column name of the first variable seems to be in another language (spanish). Changing the name to english will make the data frame more homogeneous, and it will improve the workflow (Can you imagine each of the column names in a different language?).

Try fixing the data frame following these steps:

# 1. fix the column food_type
xdata$food_type<-tolower(xdata$food_type) # tolower() changes to lowercase 
xdata$food_type<-as.factor(xdata$food_type) # ake sure is a factor after

#Look the structure again
str(xdata) # Seems to work :)
## 'data.frame':    12 obs. of  5 variables:
##  $ edad        : int  6 5 8 7 7 5 5 6 6 5 ...
##  $ weight_kg   : num  22.4 17.7 18.8 18.9 29.1 ...
##  $ combat_power: num  132 186 186 170 130 ...
##  $ THC_content : chr  "high" "very_high" "critical" "high" ...
##  $ food_type   : Factor w/ 2 levels "fritanga","salad": 1 1 2 2 1 1 2 2 1 1 ...
# 2. Change the discordant column name

colnames(xdata) # gives you a vector with the column names your object data
## [1] "edad"         "weight_kg"    "combat_power" "THC_content"  "food_type"
colnames(xdata)[1] # gives you the 1st column name your object data
## [1] "edad"
colnames(xdata)[1]<-"age" #assign new characters to the column name salected

#Look the data
head(xdata) # Also seems to work
##   age weight_kg combat_power THC_content food_type
## 1   6  22.39628     131.7923        high  fritanga
## 2   5  17.69336     186.0856   very_high  fritanga
## 3   8  18.84001     185.8586    critical     salad
## 4   7  18.89893     170.1426        high     salad
## 5   7  29.06381     129.7942   very_high  fritanga
## 6   5  18.80187     196.4630    critical  fritanga

Super! Now our data frame looks better.

Imagine that we need to create a new column which will be the ratio between the columns combat_power and weight_kg(which means ratio= values of combat_power/ values of weight_kg). Also, we need a column that would contain all the square root values for of each value from the combat_power column. Lets create those columns as follows:

#This take every value of a row for each of both columns and make the dvision
xdata$ratio<-xdata$combat_power/xdata$weight_kg

# Square root for each value in the column combat_power
xdata$sqrt_combat_power<-sqrt(xdata$combat_power)

#look the two new columns
head(xdata)
##   age weight_kg combat_power THC_content food_type     ratio sqrt_combat_power
## 1   6  22.39628     131.7923        high  fritanga  5.884563          11.48008
## 2   5  17.69336     186.0856   very_high  fritanga 10.517252          13.64132
## 3   8  18.84001     185.8586    critical     salad  9.865098          13.63300
## 4   7  18.89893     170.1426        high     salad  9.002763          13.04387
## 5   7  29.06381     129.7942   very_high  fritanga  4.465835          11.39273
## 6   5  18.80187     196.4630    critical  fritanga 10.449121          14.01653

There are many ways in R to add columns or rows. Use help() to take a look to the functions cbind() and rbind(). Try use those functions to modify a data frame.

Now our data frame is getting bigger. Sometimes we do not need all the columns from our data set. Imagine you only need few of them. The function subset() help you to extract parts of your data frame. For example, imagine that we want a data frame with only the columns: age,THC_content, and food_type. In the function parenthesis first write your data frame object, followed by the subsetting arguments (using help() to see the different possible arguments for this function, and couple of handi examples). Try the following

sub_xdata<-subset(xdata, select=c("age","THC_content","food_type")) 
sub_xdata
##    age THC_content food_type
## 1    6        high  fritanga
## 2    5   very_high  fritanga
## 3    8    critical     salad
## 4    7        high     salad
## 5    7   very_high  fritanga
## 6    5    critical  fritanga
## 7    5        high     salad
## 8    6   very_high     salad
## 9    6    critical  fritanga
## 10   5        high  fritanga
## 11   6   very_high     salad
## 12   6    critical     salad

Try with other options. For example only the part of the dataframe when the values on the column food_type are salad or only individuals with age greater than 6:

salad_xdata<-subset(xdata,food_type=="salad") 
salad_xdata
##    age weight_kg combat_power THC_content food_type    ratio sqrt_combat_power
## 3    8  18.84001     185.8586    critical     salad 9.865098          13.63300
## 4    7  18.89893     170.1426        high     salad 9.002763          13.04387
## 7    5  22.17809     123.8727        high     salad 5.585364          11.12981
## 8    6  15.92443     128.9611   very_high     salad 8.098319          11.35610
## 11   6  18.74021     178.8402   very_high     salad 9.543124          13.37311
## 12   6  18.09651     133.1754    critical     salad 7.359172          11.54016
older6_xdata<-subset(xdata,age>6) 
older6_xdata
##   age weight_kg combat_power THC_content food_type    ratio sqrt_combat_power
## 3   8  18.84001     185.8586    critical     salad 9.865098          13.63300
## 4   7  18.89893     170.1426        high     salad 9.002763          13.04387
## 5   7  29.06381     129.7942   very_high  fritanga 4.465835          11.39273