Please leave a new GitHub issue if you have questions, problems or suggestions

1 Introduction

1.1 Before we begin

Do you have:

  • R and RStudio downloaded from Software Center?
  • the folder I sent this morning containing the data?

Good.

Hello.

  • Who am I?
  • Who are you?
  • Why are we here?
  • What can you do with R?
  • What do you want to be able to do?

(Note: these are not existential questions.)

1.2 Who is this guide for?

Follow this guide if:

  • you’ve never heard of R
  • you’ve used SQL but not R
  • you’ve never coded before
  • you want to use R but don’t consider yourself a ‘programmer’

Dip into this guide if you:

  • have used R but never made use of something called RStudio
  • want to learn more about RStudio and how it works
  • have used R but never made use of something called the ‘tidyverse’
  • want to learn the best way to set up an R project to improve reproducibility

1.3 Disclaimers:

  • this guide is a basic introduction and is in no way exhaustive
  • there’s usually more than one way to do things in R – I’ve kept things simple here
  • there’s probably errors and spelling mistakes, etc

This document was originally written with a very specific public sector audience in mind and may contain references not relevant to you. See the Further Reading section at the bottom of this document if you want to find some other resources.

2 What’s the problem?

2.1 Workflow

A typical analytical workflow in our department might involve SQL, Excel and Word. Typical steps might be:

  1. Query a database with SQL code using SQL Server Management Studio
  2. QA this code
  3. Copy and paste the output into Excel
  4. Process the data in Excel
  5. Produce outputs (tables, plots, etc) manually in Excel
  6. QA your Excel file(s)
  7. Copy and paste outputs into a Word document
  8. QA the Word document
  9. You notice an error
  10. Debug somehow (go back to step 1?)

There are three main reasons why this isn’t ideal. It’s:

  • got a high chance of producing errors
  • difficult to reproduce your work (what order were the steps in your workflow?)
  • time consuming (many steps, lots of wasted time)

So, let’s discuss what we mean by ‘errors’. This is mostly a problem with spreadsheets and moving data in and out of them. You:

In terms of reproducibility, you don’t have a record of the order of doing things and therefore it’s not easy to backtrack on mistakes. A lot of documentation and commenting is required within and across multiple files to ensure that the workflow can be replicated. Typically, this is not always the case. If you write reproducible code, it may also be easier to automate it. This in turn can help free-up time for other, perhaps less trivial, tasks. For example, the Reproducible Analytical Pipeline (RAP) approach helps reduce error and speed up the process of producing official statistics.

Obviously the process takes time because you have to copy-paste values from place to place and perform quality assurance across all the files in your workflow. But there’s also the time needed to remember how you did the analysis when you’re asked to make changes long after you remember how the process works.

2.2 The bottom line

Our analytical work has a direct impact on policy decisions and therefore it affects young people, parents, learners, schools, teachers and many others.

Above all humans cannot be trusted. Let’s minimise the chance of errors, speed things up and make it easy on our future selves by minimising the chance of doing it wrong in the first place. This means breaking away from spreadsheet addiction.

2.3 R is the answer

What might an optimal analytical workflow look like in R?

  1. Run your code

This is simple. R is end-to-end: you can get data in at one end from files or a database and pump it out the other in a report or app, while also having automated testing built in. All from the same script. You also have the opportunity to more easily version your work using tools such as Git and GitHub.

2.4 But what is R?

R is a just another tool for data analysis, in the same way that Excel and SQL are tools for data analysis.

Put simply, R lets you read, wrangle and analyse data and create outputs such as graphics, documents and interactive apps. R is a coding language, which means you use it to write instructions for the computer to perform. This allows for fine control of what you want to do.

You can think of R as a place where data is abstracted away and the instructions are brought to the forefront, whereas spreadsheets are where data is at the forefront and the instructions are abstracted away (I heard this somewhere but can’t remember the source; let me know).

RStudio is simply a very useful interface for R that provides a whole bunch of useful bells and whistles.

What’s great about R? It’s:

  • free
  • available on our work laptops via Software Center
  • open-source and cross-platform (you can download it for Windows, Mac and Linux machines)
  • established and has many high-quality extensions available (‘packages’)
  • has a big and active community, both in the department (e.g. Coffee & Coding) and online (e.g. the RStudio Community)
  • got a lot of in-built help files
  • got a wealth of articles and help online (e.g. the R bloggers feed and via StackOverflow)
  • got excellent statistical and graphics capabilities in particular
  • the suite of RStudio tools make documentation, teaching and dissemination much easier

I could go on.

2.5 Should I stop using all other tools?

R is not always the answer. I’m not telling you that we must do things in any particular way. For example, you have an urgent request for the minister due in five minutes and you don’t have the experience to do it in R. Excel may be good enough. That’s absolutely fine. The argument here is that we should move towards a more reproducible model, so that when the minister comes back wanting to tweak your calculation you can be confident that you can remember what you did and how you did it.

3 Project working

Let’s assume you’re starting a new piece of work. Your life will be much easier if you manage the structure of your project from the start, rather than creating a horrible file dump of various data sets, code and documentation that you have no chance of untangling in a few months’ time.

3.1 RStudio Projects

We’re going to start by creating an ‘RStudio Project’ (capital ‘P’).

Why do this? Well, it makes your work more:

  • organised – all the code, data, outputs, etc, are stored in one place (a single project folder)
  • reproducible – your code can be re-run from scratch to produce the same outputs every time
  • transferable – you can pass the entire project folder to someone else and they’ll be able to run it on their own machine; the filepaths you specify in your code assume the home folder is the project folder, so you can write something like data/dataset.csv rather than file/path/on/my/personal/machine/that/you/cannot/access.csv

To set up an RStudio Project:

  1. Open RStudio (the icon is a white R inside a blue circle; see top of this document)
  2. File > New Project…
  3. New Directory > New Project
  4. Give your project a meaningful name in the ‘Directory Name’ box
  5. Browse for the filepath where your R Project folder will be placed
  6. Click ‘Create Project’ and RStudio will open your project (note the project name in the top right)

This process creates a directory – a folder on your machine or shared drive that you choose – containing a an RStudio Project file with the extension (suffix) ‘.Rproj’. The repository is the ‘home’ of your project and will house all the files and code that you need. Opening the .RProj file will open your RStudio Project as you last left it with the scripts you were working on.

To access your R Project in future, navigate to the project folder and double-click your R Project file, which has the .Rproj extension (e.g. your-project.Rproj).

3.2 Directory layout

So, your project directory contains an RStudio Project (.Rproj) file, but let’s now fill it with some basic folders that we’ll need to compelte our project. This helps keep things organised and can help prevent mishaps like accidentally deleting raw data.

Organisation of projects from something like designing projects by Rich Fitzjohn at Macquarie University.

The basic arrangement would be something like:

The files and folders are:

  • data for raw, untouched, read-only data sets
  • figs for any graphics you produce (could also be maps or something else)
  • output for data files processed from the raw data
  • separate script files (with extension .R) to be executed in the labelled order (more on this in the next section)
  • the .Rproj file

4 The RStudio interface

Don’t be alarmed by the RStudio interface. There’s lots of buttons and tabs, but we’ll be restricting ourselves to a relatively small subset of these to begin with.

4.1 Layout

RStudio is split into three panes when you open it first time:

Each of which has a few tabs. We care about a few of these tabs right now:

Left pane:

  • the console tab where outputs are displayed (you can also directly type code into the console, but your code won’t be saved)

Upper-right pane:

  • the Environment tab that fills when you create saved objects
  • the History tab for seeing and rerunning any previous commands

Lower-right pane:

  • the Files tab from which you can open files (it that defaults to your home folder where the .Rproj is stored for this project)
  • the Plots tab for viewing plot outputs that you’ve created
  • the Help tab for searching for help with R packages and functions

4.2 Start a script

Open a new file with File > New File > R script, or in the top left of RStudio click the button with a ‘+’ in a green circle on a white square, then click ‘R Script’:

A new pane will appear with a new scripting tab. It’s blank. You type the code into this space and run it. The inputs and results are displayed in the console below once the script has been executed. This is not too dissimilar to what you get in SQL Server Management Studio, for example.

You can have more than one scripting tab open at once. Usually you would have one script per process. For example, one for reading and manipulating data (e.g. 01_read-data), one for modelling (e.g. 02_model)and one for plotting (03_plot), i.e. sensible names with a number that indicates the order to execute the code. This will improve reproducibility.

Start your script with some useful information. Anything prefixed with a hash (#) will be recognised as a comment and won’t be executed as code. For example:

You can copy-paste or type the code from this document into your R script as we go along. Remember to add comments with # to say what you’re doing and to break your script up into sections.

4.3 Execute code

Type 1 + 1 into your scripting window (upper left pane). To ‘run’ the code, make sure your cursor is on the line containing the code and use the keyboard shortcut ‘Control + Enter’ to execute it (alternatively, click the ‘Run’ button in the top right of the scripting window). This will only run the bit of code you’ve highlighted; it won’t continue running the whole script.

## [1] 2

You should have got the answer 2. The number in brackets relates to the number of items of information that are returned to you.


CHALLENGE!

Save your script with a sensible name.

Hint: File > Save, or Control + S. You’ll be prompted to save the file in your home folder (the one containing your R Project file).


This is good, but ideally we want to store values to help simplify our code. We do this by making ‘objects’. An object can be a single number, a list of strings, a table of data, a plot, or many other things. You create an object by assigning a name to your values. You do this with the ‘assignment arrow’, <-, which is basically akin to “into an object named the thing on the left, save the thing on the right”.

For example, we can assign 1 + 1 to the object name my_num with <-. Execute the following code:

Hm. Nothing printed out in the console. Instead the object is now in your environment – see the top right pane in RStudio. You are now free to refer to this object by name in your script. For example, you can now print the contents of this object to the console with the line print(my_num) or explore it with the environment pane.

## [1] 2

Storing one value is fine. But objects can be used sed to store more than that. This next chunk of code creates a ‘vector’, where several values in the brackets have been combined together with the c() command. In this example I’ve created some character strings, each bound within a pair of quotation marks (""). Numbers don’t need to be in quotation marks (unless they’ve been stored as text).

## [1] "Pichu"   "Pikachu" "Raichu"

You can see what ‘class’ your vector is at any time with the class() function.

## [1] "numeric"

The vector my_num is composed of numbers only and so is ‘numeric’, but my_vector is composed entirely of character strings:

## [1] "character"

So we’ve create objects composed of both single values and vectors. You can think of these as being zero-dimensional and one-dimensional. The next step would be two dimenions: a table. Tables of data with rows and columns are called ‘data frames’ in R and are effectively a bunch of vectors of the same length stuck together. Consider this:

##   species number location
## 1   Pichu    172    Johto
## 2 Pikachu     25    Kanto
## 3  Raichu     26    Kanto

Can you see how this is three vectors (species, number and location) of the same length (3 values) arranged into columns? the function data.frame() binds these vectors together into (surprise) a data frame.

## [1] "data.frame"

Aha!


CHALLENGE!

Create a sensibly-named data frame object with three sensibly-named columns:

  • one for animals
  • one for a cuteness score
  • one for a ferocity score

Now print it.


4.4 Functions

You’ve been using functions already: print(), class(), data.frame() and c().

Theory: a function is a reproducible unit of code that performs a given task, such as reading a data file or fitting a model. Functions prevent you from copy-pasting your code multiple times, which could lead to errors and makes for unwieldy, unreadable code. If you can help it, Don’t Repeat Yourself.

functions are written as the function name followed by brackets. The brackets contain the arguments – the items you need to provide to the function for it to work. One argument might be be a filepath to some data, another might describe the colour of points to be plotted. They’re separated by commas.

So a generic function might look like this:

Note that you can break the function over several lines. You can put your cursor on any of these lines and run it. You don’t have to highlight the whole thing.

You can use type a question mark followed by a function name to learn about its arguments in a help file that will appear in the bottom right pane. For example, ?plot(). Try it, but don’t worry about the content for now.

Aside: you don’t necessarily need to write the argument name and an equals sign. For example, if the first argument expected by example_function() is data (you can find out by running ?example_function()) you can write example_function(my_data) instead of example_function(data = my_data). It’s good practice to write the argument names though, it’ll help you and others to understand your code and to stop any confusion. For example, specifying the arguments x = vector_x and y = vector_y in a plot function might make it clearer which axis is which when checking your code.


CHALLENGE!

It’s good practice to reset R every so often.

Why might we do this?

Hit the keyboard shortcut Control + Shift + F10 for RStudio to reset.


4.5 Packages

Functions can be bundled into packages. A bunch of packages are pre-installed with R, but there are thousands more available for download. These packages extend the basic capabilities of R.

Packages can be installed to your computer using the install.packages() function. This automatically fetches and downloads packages from the Comprehensive R Archive Network (CRAN).

Here are three packages that we’re going to use in this session:

You only need to run the installation function once for each package. The package is installed to your computer once you’ve done this and you only need to ‘remind’ RStudio where to find the package using the library() function in future.

So now we have the readr and dplyr packages installed we can call them with the library() function so we can use them.

## Warning: package 'readr' was built under R version 3.4.4
## Warning: package 'dplyr' was built under R version 3.4.4
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Sometimes a package will print a message in the console after loading. This is usually fine and only a problem in very specific circumstances. For example, you might be told that the package was developed using a newer version of R, or perhaps that a function from that package ‘conflicts’ with another already-installed function (usually because the two functions have the same name).

Okay, let’s get hold of some data!

5 Get data in and look at it

5.1 Read the data

Let’s use a dataset that I collected myself. It contains information about organisms I collected from exotic locations spanning the globe from Napoli to Hastings. It’s a file containing a data set of about 700 Pokemon – captured on the Pokemon Go app – with their characteristics data. It’s the very best dataset, like no dataset ever was.

If I haven’t given you the dataset already as a Comma-Separated Values (CSV) file, you can download it from the internet via GitHub. Save it to a folder named ‘data’ in your R Project folder like so:

And then you can read it in as follows:

## Parsed with column specification:
## cols(
##   species = col_character(),
##   combat_power = col_integer(),
##   hit_points = col_integer(),
##   weight_kg = col_double(),
##   weight_bin = col_character(),
##   height_m = col_double(),
##   height_bin = col_character(),
##   fast_attack = col_character(),
##   charge_attack = col_character()
## )

This function loads the data from the CSV file at the filepath provided (it’s in our ‘data’ folder). It prints a note to the console to tell you the columns that have been read in and also what the data type of each one is. For example, combat_power = col_integer() tells us that the data in this column has been read as integers.

But where is this data? How do we know it’s actually been read in?

If you look at the ‘Environment’ tab in the top-right pane of RStudio, you’ll see our object ‘pokemon’ is there. Helpfully, we’re told it has dimensions of 696 rows and 9 columns.

5.2 Data inspection

The first thing we should do is look at the data to check for anomalies. There are a number of ways to do this.

You can take a look at information about your data frame using:

## Observations: 696
## Variables: 9
## $ species       <chr> "krabby", "geodude", "venonat", "parasect", "eev...
## $ combat_power  <int> 51, 85, 129, 171, 172, 131, 96, 11, 112, 156, 12...
## $ hit_points    <int> 15, 23, 38, 32, 37, 320, 21, 10, 30, 35, 26, 38,...
## $ weight_kg     <dbl> 5.82, 20.88, 20.40, 19.20, 4.18, 11.20, 3.49, 36...
## $ weight_bin    <chr> "normal", "normal", "extra_small", "extra_small"...
## $ height_m      <dbl> 0.36, 0.37, 0.92, 0.87, 0.25, 0.48, 0.27, 0.80, ...
## $ height_bin    <chr> "normal", "normal", "normal", "normal", "normal"...
## $ fast_attack   <chr> "mud_shot", "rock_throw", "confusion", "bug_bite...
## $ charge_attack <chr> "vice_grip", "rock_tomb", "poison_fang", "x-scis...

Immediately this tells us that there are 696 rows and 9 columns. Column names are then listed with the data type and first few examples. This infomration is also available in the environment tab in the upper-right pane. Click the little blue arrow to have this infomration drop down.

Another way of expressing this is to simply print() to the console. The output is displayed in table format, but is truncated to fit the console window (this prevents you from printing millions of rows to the console!).

## # A tibble: 696 x 9
##    species      combat_power hit_points weight_kg weight_bin  height_m
##    <chr>               <int>      <int>     <dbl> <chr>          <dbl>
##  1 krabby                 51         15      5.82 normal         0.360
##  2 geodude                85         23     20.9  normal         0.370
##  3 venonat               129         38     20.4  extra_small    0.920
##  4 parasect              171         32     19.2  extra_small    0.870
##  5 eevee                 172         37      4.18 extra_small    0.250
##  6 voltorb               131        320     11.2  normal         0.480
##  7 shellder               96         21      3.49 normal         0.270
##  8 staryu                 11         10     36.4  normal         0.800
##  9 nidoran_male          112         30      9.49 normal         0.510
## 10 poliwag               156         35     11.2  normal         0.580
## # ... with 686 more rows, and 3 more variables: height_bin <chr>,
## #   fast_attack <chr>, charge_attack <chr>

If you want to see the whole datset you could use the View() function:

This opens up a read-only tab in the script window that displays your data in full. You can scroll around and order the columns by clicking the headers. This doesn’t affect the underlying data at all.

You can also access this by clickng the little image of a table to the right of the object in the environment pane (upper-right).

5.3 Quick summary

You can get very quick summary statistics with the summary() function. The function provides a quick summary of each column depending on its data type (integer, character, etc). This is pretty basic, but we’ll do something more impressive later.

##    species           combat_power      hit_points       weight_kg      
##  Length:696         Min.   :  10.0   Min.   : 10.00   Min.   :  0.050  
##  Class :character   1st Qu.:  76.0   1st Qu.: 23.00   1st Qu.:  2.795  
##  Mode  :character   Median : 160.0   Median : 33.00   Median :  6.440  
##                     Mean   : 206.1   Mean   : 37.42   Mean   : 15.053  
##                     3rd Qu.: 286.0   3rd Qu.: 47.00   3rd Qu.: 20.163  
##                     Max.   :1636.0   Max.   :320.00   Max.   :492.040  
##   weight_bin           height_m       height_bin        fast_attack       
##  Length:696         Min.   :0.2000   Length:696         Length:696        
##  Class :character   1st Qu.:0.3100   Class :character   Class :character  
##  Mode  :character   Median :0.5050   Mode  :character   Mode  :character  
##                     Mean   :0.6544                                        
##                     3rd Qu.:0.8900                                        
##                     Max.   :9.5200                                        
##  charge_attack     
##  Length:696        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

6 Manipulating the data frame

We’re going to use a number of sensibly-named functions from the dplyr package to do our data manipulation. These functions take verbs – not too dissimilar to SQL verbs – as their names. This makes it easy to understand what they’re doing.

dplyr is part of a suite of packages within what is called ‘the Tidyverse’. These packages are all written with the same thoughts in mind (e.g. the first argument of all the functions is the data, function names are sensible and written in snake_case, the code is optimised to run quickly, etc).

The tidyverse aims to make things simpler and fasterfor R coders.

6.1 Select

Firstly, we can select() columns of interest. There’s the first sensible function name. You’ll notice that a lot of them are verbs to make it clear that the code is actively doing something.

## # A tibble: 696 x 2
##    hit_points species     
##         <int> <chr>       
##  1         15 krabby      
##  2         23 geodude     
##  3         38 venonat     
##  4         32 parasect    
##  5         37 eevee       
##  6        320 voltorb     
##  7         21 shellder    
##  8         10 staryu      
##  9         30 nidoran_male
## 10         35 poliwag     
## # ... with 686 more rows

Note that the order you select the columns is the order they’ll appear in when they print.

And we can choose not to include certan columns by prefixing with - (hyphen/minus).

## # A tibble: 696 x 5
##    species      weight_kg height_m height_bin charge_attack
##    <chr>            <dbl>    <dbl> <chr>      <chr>        
##  1 krabby            5.82    0.360 normal     vice_grip    
##  2 geodude          20.9     0.370 normal     rock_tomb    
##  3 venonat          20.4     0.920 normal     poison_fang  
##  4 parasect         19.2     0.870 normal     x-scissor    
##  5 eevee             4.18    0.250 normal     body_slam    
##  6 voltorb          11.2     0.480 normal     discharge    
##  7 shellder          3.49    0.270 normal     bubble_beam  
##  8 staryu           36.4     0.800 normal     bubble_beam  
##  9 nidoran_male      9.49    0.510 normal     body_slam    
## 10 poliwag          11.2     0.580 normal     body_slam    
## # ... with 686 more rows

That can be quite laborious, so there are some special functions we can use inside the select function to help us out.

For example, selecting columns starting with a particular string:

## # A tibble: 696 x 2
##    weight_kg weight_bin 
##        <dbl> <chr>      
##  1      5.82 normal     
##  2     20.9  normal     
##  3     20.4  extra_small
##  4     19.2  extra_small
##  5      4.18 extra_small
##  6     11.2  normal     
##  7      3.49 normal     
##  8     36.4  normal     
##  9      9.49 normal     
## 10     11.2  normal     
## # ... with 686 more rows

Or any columns containing a given string.

## # A tibble: 696 x 2
##    weight_bin  height_bin
##    <chr>       <chr>     
##  1 normal      normal    
##  2 normal      normal    
##  3 extra_small normal    
##  4 extra_small normal    
##  5 extra_small normal    
##  6 normal      normal    
##  7 normal      normal    
##  8 normal      normal    
##  9 normal      normal    
## 10 normal      normal    
## # ... with 686 more rows

CHALLENGE!

Create an object called my_selection that uses the select() function to store from pokemon the species column and any columns that end with with "attack"


More infomration in the help file if you type ?select.

6.2 Filter

Now for subsetting the data by its rows.

We’re going to make use of some common logical operators for subsetting our data by certain conditions:

  • == – equals
  • != – not equals
  • %in% – match to several things listed with c()
  • >, <, <=, >= – greater/less than (or equal to)
  • & – ‘and’
  • | – ‘or’

Let’s start by filtering for one particular species.

## Warning: package 'bindrcpp' was built under R version 3.4.4
## # A tibble: 11 x 9
##    species    combat_power hit_points weight_kg weight_bin  height_m
##    <chr>             <int>      <int>     <dbl> <chr>          <dbl>
##  1 jigglypuff          221         93      7.04 extra_large    0.560
##  2 jigglypuff          156         80      6.83 normal         0.550
##  3 jigglypuff          349        119      3.57 extra_small    0.420
##  4 jigglypuff           10         22      4.92 normal         0.440
##  5 jigglypuff          188         94      6.56 normal         0.520
##  6 jigglypuff           33         39      7.14 extra_large    0.580
##  7 jigglypuff           56         51      5.55 normal         0.490
##  8 jigglypuff           66         51      8.13 extra_large    0.600
##  9 jigglypuff          289        111      5.02 normal         0.440
## 10 jigglypuff          348        119      4.91 normal         0.470
## 11 jigglypuff          486        146      4.90 normal         0.440
## # ... with 3 more variables: height_bin <chr>, fast_attack <chr>,
## #   charge_attack <chr>

Now everything except for one species.

## # A tibble: 610 x 9
##    species      combat_power hit_points weight_kg weight_bin  height_m
##    <chr>               <int>      <int>     <dbl> <chr>          <dbl>
##  1 krabby                 51         15      5.82 normal         0.360
##  2 geodude                85         23     20.9  normal         0.370
##  3 venonat               129         38     20.4  extra_small    0.920
##  4 parasect              171         32     19.2  extra_small    0.870
##  5 eevee                 172         37      4.18 extra_small    0.250
##  6 voltorb               131        320     11.2  normal         0.480
##  7 shellder               96         21      3.49 normal         0.270
##  8 staryu                 11         10     36.4  normal         0.800
##  9 nidoran_male          112         30      9.49 normal         0.510
## 10 poliwag               156         35     11.2  normal         0.580
## # ... with 600 more rows, and 3 more variables: height_bin <chr>,
## #   fast_attack <chr>, charge_attack <chr>

Now filtering to include three species only.

## # A tibble: 39 x 9
##    species combat_power hit_points weight_kg weight_bin  height_m
##    <chr>          <int>      <int>     <dbl> <chr>          <dbl>
##  1 staryu            11         10      36.4 normal         0.800
##  2 psyduck           97         26      26.0 extra_large    0.900
##  3 psyduck           41         17      23.6 normal         0.910
##  4 staryu           225         25      36.4 normal         0.730
##  5 staryu           154         23      18.8 extra_small    0.590
##  6 staryu            11         10      18.9 extra_small    0.680
##  7 staryu           260         29      44.2 extra_large    0.850
##  8 psyduck           44         19      23.4 normal         0.720
##  9 staryu           112         19      28.1 normal         0.780
## 10 staryu           144         23      50.4 extra_large    0.970
## # ... with 29 more rows, and 3 more variables: height_bin <chr>,
## #   fast_attack <chr>, charge_attack <chr>

We can work with numbers too.

## # A tibble: 7 x 9
##   species combat_power hit_points weight_kg weight_bin height_m height_bin
##   <chr>          <int>      <int>     <dbl> <chr>         <dbl> <chr>     
## 1 gyarad~          955         94     177.  normal         5.58 normal    
## 2 magmar           936         70      40.4 normal         1.16 normal    
## 3 magmar           991         73      31.1 extra_sma~     1.23 normal    
## 4 magmar           963         75      42.5 normal         1.28 normal    
## 5 pinsir          1184         84      68.1 normal         1.61 normal    
## 6 fearow           954         83      40.6 normal         1.20 normal    
## 7 electa~          962         74      39.0 extra_lar~     1.26 normal    
## # ... with 2 more variables: fast_attack <chr>, charge_attack <chr>

CHALLENGE!

Filter the pokemon dataframe to include species rows that:

  • are the species “abra”, “chansey”, or “bellsprout”
  • and have between 100 and 400 combat_power (hint: you’ll need an &)
  • and less than 100 hit_points

How many Pokemon are in this subset?


6.3 Mutate

Now to create new columns. We use mutate() because we’re mutating our dataframe – we’re budding a new column where there wasn’t one before. Often you’ll be creating new columns based on the content of columns that already exist, or you can fill the entire column with one thing.

For now, we’re going to create column names without spaces. It’s easier.

## # A tibble: 696 x 6
##    species      combat_power hit_points power_index caught area 
##    <chr>               <int>      <int>       <int>  <dbl> <chr>
##  1 krabby                 51         15         765     1. kanto
##  2 geodude                85         23        1955     1. kanto
##  3 venonat               129         38        4902     1. kanto
##  4 parasect              171         32        5472     1. kanto
##  5 eevee                 172         37        6364     1. kanto
##  6 voltorb               131        320       41920     1. kanto
##  7 shellder               96         21        2016     1. kanto
##  8 staryu                 11         10         110     1. kanto
##  9 nidoran_male          112         30        3360     1. kanto
## 10 poliwag               156         35        5460     1. kanto
## # ... with 686 more rows

So we’ve created a new column, caught, that’s filled for every row with 1 and another new column filled with kanto for every row.

Note that if you pass a vector to mutate(), the vector won’t be ‘recycled’ for each row of your datsaet. In other words putting mutate(new_column = c(1, 2)) won’t result in a ‘1’ in row 1, a ‘2’ in row 2 a ‘1’ in row 3 and so on. To do this, you can use transform(new_column = c(1, 2)).

You can mutate a little more easily with an if_else() statement:

## # A tibble: 696 x 3
##    hit_points species      common
##         <int> <chr>        <chr> 
##  1         15 krabby       no    
##  2         23 geodude      no    
##  3         38 venonat      no    
##  4         32 parasect     no    
##  5         37 eevee        yes   
##  6        320 voltorb      no    
##  7         21 shellder     no    
##  8         10 staryu       yes   
##  9         30 nidoran_male no    
## 10         35 poliwag      no    
## # ... with 686 more rows

And we can get more nuanced by using a case_when() statement (you may have seen this in SQL). This prevents us writing nested if_else() statements to specify multiple conditions.

## # A tibble: 696 x 3
##    hit_points species      common     
##         <int> <chr>        <chr>      
##  1         15 krabby       no         
##  2         23 geodude      no         
##  3         38 venonat      no         
##  4         32 parasect     no         
##  5         37 eevee        less_common
##  6        320 voltorb      no         
##  7         21 shellder     no         
##  8         10 staryu       common     
##  9         30 nidoran_male no         
## 10         35 poliwag      no         
## # ... with 686 more rows

CHALLENGE!

Create a new datafrmae object that takes the pokemon data and adds a column containing Pokemon body-mass index (BMI).

Hint: BMI is weight over height squared (you can square a number by writing ^2 after it).

Now use a case_when() to categorise Pokemon:

  • Underweight = <18.5
  • Normal weight = 18.5–24.9
  • Overweight = 25–29.9
  • Obesity = BMI of 30 or greater

Note that these are BMI groups for humans. And that BMI has many limitations!


6.4 Arrange

This does what it says on the tin. This alters the order of the rows in your table according to some column specification.

## # A tibble: 696 x 9
##    species combat_power hit_points weight_kg weight_bin  height_m
##    <chr>          <int>      <int>     <dbl> <chr>          <dbl>
##  1 diglett           79         10     0.790 normal         0.200
##  2 pidgey           254         44     0.820 extra_small    0.210
##  3 rattata           23         11     1.52  extra_small    0.220
##  4 pidgey           229         43     0.850 extra_small    0.220
##  5 weedle            17         13     2.25  extra_small    0.220
##  6 spearow          296         47     0.690 extra_small    0.220
##  7 spearow           89         26     1.06  extra_small    0.220
##  8 pidgey           256         46     0.820 extra_small    0.230
##  9 rattata           64         17     2.70  normal         0.230
## 10 diglett           64         10     1.05  extra_large    0.230
## # ... with 686 more rows, and 3 more variables: height_bin <chr>,
## #   fast_attack <chr>, charge_attack <chr>

And in reverse order (tallest first):

## # A tibble: 696 x 9
##    species  combat_power hit_points weight_kg weight_bin  height_m
##    <chr>           <int>      <int>     <dbl> <chr>          <dbl>
##  1 onix              299         38    192.   normal          9.52
##  2 gyarados          955         94    177.   normal          5.58
##  3 pidgey             76         26      1.25 extra_small     2.50
##  4 ekans             206         35     11.6  extra_large     2.46
##  5 lapras           1636        161    163.   extra_small     2.22
##  6 snorlax           300         85    492.   normal          2.11
##  7 dratini           298         42      4.40 extra_large     2.08
##  8 dratini           332         44      4.75 extra_large     1.99
##  9 ekans              95         24      5.20 normal          1.93
## 10 dratini           316         40      3.13 normal          1.91
## # ... with 686 more rows, and 3 more variables: height_bin <chr>,
## #   fast_attack <chr>, charge_attack <chr>

CHALLENGE!

What happens if you arrange by a column containing characters rather than numbers? For example, the species column.


6.5 Join

Again, another verb that mirrors what you can find in SQL. There are several types of join, but we’re going to focus on the most common one: the left_join(). This joins information from one table – x – to another – y – by some key matching variable of our choice.

Let’s start by reading in a lookup table that provides some extra infomration about our species.

## Parsed with column specification:
## cols(
##   species = col_character(),
##   pokedex_number = col_integer(),
##   type1 = col_character(),
##   type2 = col_character()
## )
## Observations: 801
## Variables: 4
## $ species        <chr> "bulbasaur", "ivysaur", "venusaur", "charmander...
## $ pokedex_number <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ type1          <chr> "grass", "grass", "grass", "fire", "fire", "fir...
## $ type2          <chr> "poison", "poison", "poison", NA, NA, "flying",...

Now we’re going to join this new data to our pokemon data. The key for matching these in the species column, which exists in both datasets.

## Observations: 696
## Variables: 12
## $ species        <chr> "krabby", "geodude", "venonat", "parasect", "ee...
## $ combat_power   <int> 51, 85, 129, 171, 172, 131, 96, 11, 112, 156, 1...
## $ hit_points     <int> 15, 23, 38, 32, 37, 320, 21, 10, 30, 35, 26, 38...
## $ weight_kg      <dbl> 5.82, 20.88, 20.40, 19.20, 4.18, 11.20, 3.49, 3...
## $ weight_bin     <chr> "normal", "normal", "extra_small", "extra_small...
## $ height_m       <dbl> 0.36, 0.37, 0.92, 0.87, 0.25, 0.48, 0.27, 0.80,...
## $ height_bin     <chr> "normal", "normal", "normal", "normal", "normal...
## $ fast_attack    <chr> "mud_shot", "rock_throw", "confusion", "bug_bit...
## $ charge_attack  <chr> "vice_grip", "rock_tomb", "poison_fang", "x-sci...
## $ pokedex_number <int> 98, 74, 48, 47, 133, 100, 90, 120, 32, 60, 46, ...
## $ type1          <chr> "water", "rock", "bug", "bug", "normal", "elect...
## $ type2          <chr> NA, "ground", "poison", "grass", NA, NA, NA, NA...

CHALLENGE!

Try right_join() instead of left_join(). What happens? And what about anti_join()?


6.6 Other verbs

This document does not contain an exhaustive list of other functions within the same family as select(), filter(), mutate(), arrange() and *_join(). There are other functions that will be useful for your work and other ways of manipulating your data. For example, the stringr package helps with dealing with data in strings (text, for example).

6.7 Pipes

Alright great, we’ve seen how to manipulate our dataframe a bit. But we’ve been doing it one discrete step at a time, so your script might end up looking something like this:

In other words, you might end up creating lots of intermediate variables and cluttering up your workspace and filling up memory.

You could do all this in one step by nesting each function inside the others, but that would be super messy and hard to read. Instead we’re going to ‘pipe’ data from one function to the next. The pipe operator – %>% – says ‘take what’s on the left and pass it through to the next function’.

So you can do it all in one step:

## Parsed with column specification:
## cols(
##   species = col_character(),
##   combat_power = col_integer(),
##   hit_points = col_integer(),
##   weight_kg = col_double(),
##   weight_bin = col_character(),
##   height_m = col_double(),
##   height_bin = col_character(),
##   fast_attack = col_character(),
##   charge_attack = col_character()
## )
## Observations: 204
## Variables: 8
## $ species       <chr> "geodude", "venonat", "parasect", "staryu", "ven...
## $ combat_power  <int> 85, 129, 171, 11, 137, 256, 234, 157, 140, 246, ...
## $ hit_points    <int> 23, 38, 32, 10, 38, 64, 33, 49, 56, 42, 45, 34, ...
## $ weight_kg     <dbl> 20.88, 20.40, 19.20, 36.41, 41.23, 30.20, 73.81,...
## $ height_m      <dbl> 0.37, 0.92, 0.87, 0.80, 1.26, 0.84, 1.52, 0.94, ...
## $ fast_attack   <chr> "rock_throw", "confusion", "bug_bite", "water_gu...
## $ charge_attack <chr> "rock_tomb", "poison_fang", "x-scissor", "bubble...
## $ organism      <chr> "pokemon", "pokemon", "pokemon", "pokemon", "pok...

This reads as:

  • for the object named pokemon_piped, assign (<-) the contents of a CSV file read with read_csv()
  • then select out some columns
  • then filter on a variable
  • then add a column

See how this is like a recipe?

Did you notice something? We didn’t have to keep calling the dataframe object in each function call. For example, we used filter(weight_kg > 15) rather than filter(pokemon, weight_kg > 15) because the data argument was piped in. The functions mentioned above all accept the data that’s being passed into them because they’re part of the Tidyverse. (Note that this is not true for all functions, but we can talk about that later.)

Here’s another simple example using the dataframe we built earlier:

##   species number
## 1   Pichu    172
## 2  Raichu     26

Nice and easy to read.


CHALLENGE!

Write a pipe recipe that creates a new dataframe called my_poke that takes the pokemon dataframe and:

  • select()s only the species and combat_power columns
  • left_join()s the pokedex dataframe by species
  • filter()s by those with a type1 that’s ‘normal’

7 Summaries

Assuming we’ve now wrangled out data using the dplyr functions, we can do some quick, readable summarisation that’s way better than the summary() function.

So let’s use our knowledge – and some new functions – to get the top 5 pokemon by count.

## # A tibble: 5 x 2
##   species     n
##   <chr>   <int>
## 1 pidgey     86
## 2 rattata    78
## 3 drowzee    64
## 4 spearow    42
## 5 zubat      35

The order of your functions is important – remember it’s like a recipe. Don’t crack the eggs on your cake just before serving. Do it near the beginning somewhere, I guess (I’m not much a cake maker).

There’s also a specific summarise() function that allows you to, well… summarise.

## # A tibble: 16 x 3
##    type1    count mean_cp
##    <chr>    <int>   <dbl>
##  1 fire        16    510.
##  2 fairy        5    412.
##  3 <NA>         3    390.
##  4 electric    12    373.
##  5 fighting     1    358.
##  6 grass       17    357.
##  7 dragon       4    326.
##  8 psychic     70    301.
##  9 ice          7    275.
## 10 ground       7    214.
## 11 water      157    192.
## 12 rock         9    190.
## 13 bug         63    185.
## 14 ghost       12    170.
## 15 poison      59    168.
## 16 normal     254    157.

Note that you can group by more than one thing as well. We can group on the weight_bin category within the type1 category, for example.

## # A tibble: 40 x 4
## # Groups:   type1 [?]
##    type1    weight_bin  mean_weight count
##    <chr>    <chr>             <dbl> <int>
##  1 bug      extra_large       29.1      9
##  2 bug      extra_small        8.98    16
##  3 bug      normal            10.6     38
##  4 dragon   extra_large        4.58     2
##  5 dragon   normal             2.95     2
##  6 electric extra_large       18.7      3
##  7 electric extra_small        5.74     2
##  8 electric normal            18.5      7
##  9 fairy    extra_large        9.47     2
## 10 fairy    normal             7.96     3
## # ... with 30 more rows

8 Plot the data

We’re going to keep this very short and dangle it like a rare candy in front of your nose. We’ll revisit this in more depth in a later session. For now, we’re going to use a package called ggplot2 to create some simple charts.


CHALLENGE!

Remember how to use packages? Install ggplot2 and load it from the library.


The ‘gg’ in ‘ggplot2’ stands for ‘grammar of graphics’. This is a way of thinking about plotting as having a ‘grammar’ – ‘elements that can be applied in succession to create a plot. This is ’the idea that you can build every graph from the same few components’: a data set, geoms (marks representing data points), a co-ordinate system and some other things.

The ggplot() function from the ggplot2 package is how you create these plots. You build up the graphical elements using the + rather than a pipe. Think about it as placing down a canvas and then adding layers on top.

ggplot plays nicely with the pipe – it’s part of the Tidyverse – so we can create recipes that combine data reading, data manipulation and plotting all in one go. Let’s do some manipulation before plotting and then introduce some new elements to our plot that simplify the theme and change the labels.

How about a dotplot? Coloured by type1?


CHALLENGE!

Create a boxplot for Pokemon with type1 of ‘normal’, ‘poison’, ‘ground’ and ‘water’ against their hit-points


Simple, but relatively effective. We’ll look next time at plotting in more depth. For example, yes: you can use Pokemon sprites as your plotting points. And why stop there? You can also use specific Pokemon typing colours, sprite colour palettes and theme your barplot like a Pokemon first generation HP bar. Cool, eh?

9 Further reading

9.1 Tutorials

9.2 Help/tips and tricks