Please leave a new GitHub issue if you have questions, problems or suggestions
Do you have:
Good.
Hello.
(Note: these are not existential questions.)
Follow this guide if:
Dip into this guide if you:
This document was originally written with a very specific public sector audience in mind and may contain references not relevant to you. See the Further Reading section at the bottom of this document if you want to find some other resources.
A typical analytical workflow in our department might involve SQL, Excel and Word. Typical steps might be:
There are three main reasons why this isn’t ideal. It’s:
So, let’s discuss what we mean by ‘errors’. This is mostly a problem with spreadsheets and moving data in and out of them. You:
In terms of reproducibility, you don’t have a record of the order of doing things and therefore it’s not easy to backtrack on mistakes. A lot of documentation and commenting is required within and across multiple files to ensure that the workflow can be replicated. Typically, this is not always the case. If you write reproducible code, it may also be easier to automate it. This in turn can help free-up time for other, perhaps less trivial, tasks. For example, the Reproducible Analytical Pipeline (RAP) approach helps reduce error and speed up the process of producing official statistics.
Obviously the process takes time because you have to copy-paste values from place to place and perform quality assurance across all the files in your workflow. But there’s also the time needed to remember how you did the analysis when you’re asked to make changes long after you remember how the process works.
Our analytical work has a direct impact on policy decisions and therefore it affects young people, parents, learners, schools, teachers and many others.
Above all humans cannot be trusted. Let’s minimise the chance of errors, speed things up and make it easy on our future selves by minimising the chance of doing it wrong in the first place. This means breaking away from spreadsheet addiction.
What might an optimal analytical workflow look like in R?
This is simple. R is end-to-end: you can get data in at one end from files or a database and pump it out the other in a report or app, while also having automated testing built in. All from the same script. You also have the opportunity to more easily version your work using tools such as Git and GitHub.
R is a just another tool for data analysis, in the same way that Excel and SQL are tools for data analysis.
Put simply, R lets you read, wrangle and analyse data and create outputs such as graphics, documents and interactive apps. R is a coding language, which means you use it to write instructions for the computer to perform. This allows for fine control of what you want to do.
You can think of R as a place where data is abstracted away and the instructions are brought to the forefront, whereas spreadsheets are where data is at the forefront and the instructions are abstracted away (I heard this somewhere but can’t remember the source; let me know).
RStudio is simply a very useful interface for R that provides a whole bunch of useful bells and whistles.
What’s great about R? It’s:
I could go on.
R is not always the answer. I’m not telling you that we must do things in any particular way. For example, you have an urgent request for the minister due in five minutes and you don’t have the experience to do it in R. Excel may be good enough. That’s absolutely fine. The argument here is that we should move towards a more reproducible model, so that when the minister comes back wanting to tweak your calculation you can be confident that you can remember what you did and how you did it.
Let’s assume you’re starting a new piece of work. Your life will be much easier if you manage the structure of your project from the start, rather than creating a horrible file dump of various data sets, code and documentation that you have no chance of untangling in a few months’ time.
We’re going to start by creating an ‘RStudio Project’ (capital ‘P’).
Why do this? Well, it makes your work more:
data/dataset.csv
rather than file/path/on/my/personal/machine/that/you/cannot/access.csv
To set up an RStudio Project:
This process creates a directory – a folder on your machine or shared drive that you choose – containing a an RStudio Project file with the extension (suffix) ‘.Rproj’. The repository is the ‘home’ of your project and will house all the files and code that you need. Opening the .RProj file will open your RStudio Project as you last left it with the scripts you were working on.
To access your R Project in future, navigate to the project folder and double-click your R Project file, which has the .Rproj extension (e.g. your-project.Rproj).
So, your project directory contains an RStudio Project (.Rproj) file, but let’s now fill it with some basic folders that we’ll need to compelte our project. This helps keep things organised and can help prevent mishaps like accidentally deleting raw data.
Organisation of projects from something like designing projects by Rich Fitzjohn at Macquarie University.
The basic arrangement would be something like:
The files and folders are:
Don’t be alarmed by the RStudio interface. There’s lots of buttons and tabs, but we’ll be restricting ourselves to a relatively small subset of these to begin with.
RStudio is split into three panes when you open it first time:
Each of which has a few tabs. We care about a few of these tabs right now:
Left pane:
Upper-right pane:
Lower-right pane:
Open a new file with File > New File > R script, or in the top left of RStudio click the button with a ‘+’ in a green circle on a white square, then click ‘R Script’:
A new pane will appear with a new scripting tab. It’s blank. You type the code into this space and run it. The inputs and results are displayed in the console below once the script has been executed. This is not too dissimilar to what you get in SQL Server Management Studio, for example.
You can have more than one scripting tab open at once. Usually you would have one script per process. For example, one for reading and manipulating data (e.g. 01_read-data
), one for modelling (e.g. 02_model
)and one for plotting (03_plot
), i.e. sensible names with a number that indicates the order to execute the code. This will improve reproducibility.
Start your script with some useful information. Anything prefixed with a hash (#
) will be recognised as a comment and won’t be executed as code. For example:
# Title: Sensational training script
# Purpose: To inspire new R users
# Name: Matt Dray
# Date: Jan 2018
You can copy-paste or type the code from this document into your R script as we go along. Remember to add comments with #
to say what you’re doing and to break your script up into sections.
Type 1 + 1
into your scripting window (upper left pane). To ‘run’ the code, make sure your cursor is on the line containing the code and use the keyboard shortcut ‘Control + Enter’ to execute it (alternatively, click the ‘Run’ button in the top right of the scripting window). This will only run the bit of code you’ve highlighted; it won’t continue running the whole script.
## [1] 2
You should have got the answer 2
. The number in brackets relates to the number of items of information that are returned to you.
CHALLENGE!
Save your script with a sensible name.
Hint: File > Save, or Control + S. You’ll be prompted to save the file in your home folder (the one containing your R Project file).
This is good, but ideally we want to store values to help simplify our code. We do this by making ‘objects’. An object can be a single number, a list of strings, a table of data, a plot, or many other things. You create an object by assigning a name to your values. You do this with the ‘assignment arrow’, <-
, which is basically akin to “into an object named the thing on the left, save the thing on the right”.
For example, we can assign 1 + 1
to the object name my_num
with <-
. Execute the following code:
Hm. Nothing printed out in the console. Instead the object is now in your environment – see the top right pane in RStudio. You are now free to refer to this object by name in your script. For example, you can now print the contents of this object to the console with the line print(my_num)
or explore it with the environment pane.
## [1] 2
Storing one value is fine. But objects can be used sed to store more than that. This next chunk of code creates a ‘vector’, where several values in the brackets have been combined together with the c()
command. In this example I’ve created some character strings, each bound within a pair of quotation marks (""
). Numbers don’t need to be in quotation marks (unless they’ve been stored as text).
my_vector <- c("Pichu", "Pikachu", "Raichu") # combine some values
print(my_vector) # have a look at what the object contains
## [1] "Pichu" "Pikachu" "Raichu"
You can see what ‘class’ your vector is at any time with the class()
function.
## [1] "numeric"
The vector my_num
is composed of numbers only and so is ‘numeric’, but my_vector
is composed entirely of character strings:
## [1] "character"
So we’ve create objects composed of both single values and vectors. You can think of these as being zero-dimensional and one-dimensional. The next step would be two dimenions: a table. Tables of data with rows and columns are called ‘data frames’ in R and are effectively a bunch of vectors of the same length stuck together. Consider this:
my_df <- data.frame(
species = c("Pichu", "Pikachu", "Raichu"),
number = c(172, 25, 26),
location = c("Johto", "Kanto", "Kanto")
)
print(my_df)
## species number location
## 1 Pichu 172 Johto
## 2 Pikachu 25 Kanto
## 3 Raichu 26 Kanto
Can you see how this is three vectors (species
, number
and location
) of the same length (3 values) arranged into columns? the function data.frame()
binds these vectors together into (surprise) a data frame.
## [1] "data.frame"
Aha!
CHALLENGE!
Create a sensibly-named data frame object with three sensibly-named columns:
Now print it.
You’ve been using functions already: print()
, class()
, data.frame()
and c()
.
Theory: a function is a reproducible unit of code that performs a given task, such as reading a data file or fitting a model. Functions prevent you from copy-pasting your code multiple times, which could lead to errors and makes for unwieldy, unreadable code. If you can help it, Don’t Repeat Yourself.
functions are written as the function name followed by brackets. The brackets contain the arguments – the items you need to provide to the function for it to work. One argument might be be a filepath to some data, another might describe the colour of points to be plotted. They’re separated by commas.
So a generic function might look like this:
# don't run this, it doesn't do anything!
function_name(
data = my_data,
colour = "red",
option = 5
)
Note that you can break the function over several lines. You can put your cursor on any of these lines and run it. You don’t have to highlight the whole thing.
You can use type a question mark followed by a function name to learn about its arguments in a help file that will appear in the bottom right pane. For example, ?plot()
. Try it, but don’t worry about the content for now.
Aside: you don’t necessarily need to write the argument name and an equals sign. For example, if the first argument expected by example_function()
is data (you can find out by running ?example_function()
) you can write example_function(my_data)
instead of example_function(data = my_data)
. It’s good practice to write the argument names though, it’ll help you and others to understand your code and to stop any confusion. For example, specifying the arguments x = vector_x
and y = vector_y
in a plot function might make it clearer which axis is which when checking your code.
CHALLENGE!
It’s good practice to reset R every so often.
Why might we do this?
Hit the keyboard shortcut Control + Shift + F10 for RStudio to reset.
Functions can be bundled into packages. A bunch of packages are pre-installed with R, but there are thousands more available for download. These packages extend the basic capabilities of R.
Packages can be installed to your computer using the install.packages()
function. This automatically fetches and downloads packages from the Comprehensive R Archive Network (CRAN).
Here are three packages that we’re going to use in this session:
install.packages(pkgs = "readr") # for reading data into R
install.packages(pkgs = "dplyr") # for manipulating data
You only need to run the installation function once for each package. The package is installed to your computer once you’ve done this and you only need to ‘remind’ RStudio where to find the package using the library()
function in future.
So now we have the readr
and dplyr
packages installed we can call them with the library()
function so we can use them.
## Warning: package 'readr' was built under R version 3.4.4
## Warning: package 'dplyr' was built under R version 3.4.4
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Sometimes a package will print a message in the console after loading. This is usually fine and only a problem in very specific circumstances. For example, you might be told that the package was developed using a newer version of R, or perhaps that a function from that package ‘conflicts’ with another already-installed function (usually because the two functions have the same name).
Okay, let’s get hold of some data!
Let’s use a dataset that I collected myself. It contains information about organisms I collected from exotic locations spanning the globe from Napoli to Hastings. It’s a file containing a data set of about 700 Pokemon – captured on the Pokemon Go app – with their characteristics data. It’s the very best dataset, like no dataset ever was.
If I haven’t given you the dataset already as a Comma-Separated Values (CSV) file, you can download it from the internet via GitHub. Save it to a folder named ‘data’ in your R Project folder like so:
download.file(
url = "https://raw.githubusercontent.com/mwdray/datasets/master/pokemon_go_captures.csv",
destfile = "data/pokemon_go_captures.csv" # where to save it to
)
And then you can read it in as follows:
## Parsed with column specification:
## cols(
## species = col_character(),
## combat_power = col_integer(),
## hit_points = col_integer(),
## weight_kg = col_double(),
## weight_bin = col_character(),
## height_m = col_double(),
## height_bin = col_character(),
## fast_attack = col_character(),
## charge_attack = col_character()
## )
This function loads the data from the CSV file at the filepath provided (it’s in our ‘data’ folder). It prints a note to the console to tell you the columns that have been read in and also what the data type of each one is. For example, combat_power = col_integer()
tells us that the data in this column has been read as integers.
But where is this data? How do we know it’s actually been read in?
If you look at the ‘Environment’ tab in the top-right pane of RStudio, you’ll see our object ‘pokemon’ is there. Helpfully, we’re told it has dimensions of 696 rows and 9 columns.
The first thing we should do is look at the data to check for anomalies. There are a number of ways to do this.
You can take a look at information about your data frame using:
## Observations: 696
## Variables: 9
## $ species <chr> "krabby", "geodude", "venonat", "parasect", "eev...
## $ combat_power <int> 51, 85, 129, 171, 172, 131, 96, 11, 112, 156, 12...
## $ hit_points <int> 15, 23, 38, 32, 37, 320, 21, 10, 30, 35, 26, 38,...
## $ weight_kg <dbl> 5.82, 20.88, 20.40, 19.20, 4.18, 11.20, 3.49, 36...
## $ weight_bin <chr> "normal", "normal", "extra_small", "extra_small"...
## $ height_m <dbl> 0.36, 0.37, 0.92, 0.87, 0.25, 0.48, 0.27, 0.80, ...
## $ height_bin <chr> "normal", "normal", "normal", "normal", "normal"...
## $ fast_attack <chr> "mud_shot", "rock_throw", "confusion", "bug_bite...
## $ charge_attack <chr> "vice_grip", "rock_tomb", "poison_fang", "x-scis...
Immediately this tells us that there are 696 rows and 9 columns. Column names are then listed with the data type and first few examples. This infomration is also available in the environment tab in the upper-right pane. Click the little blue arrow to have this infomration drop down.
Another way of expressing this is to simply print()
to the console. The output is displayed in table format, but is truncated to fit the console window (this prevents you from printing millions of rows to the console!).
## # A tibble: 696 x 9
## species combat_power hit_points weight_kg weight_bin height_m
## <chr> <int> <int> <dbl> <chr> <dbl>
## 1 krabby 51 15 5.82 normal 0.360
## 2 geodude 85 23 20.9 normal 0.370
## 3 venonat 129 38 20.4 extra_small 0.920
## 4 parasect 171 32 19.2 extra_small 0.870
## 5 eevee 172 37 4.18 extra_small 0.250
## 6 voltorb 131 320 11.2 normal 0.480
## 7 shellder 96 21 3.49 normal 0.270
## 8 staryu 11 10 36.4 normal 0.800
## 9 nidoran_male 112 30 9.49 normal 0.510
## 10 poliwag 156 35 11.2 normal 0.580
## # ... with 686 more rows, and 3 more variables: height_bin <chr>,
## # fast_attack <chr>, charge_attack <chr>
If you want to see the whole datset you could use the View()
function:
This opens up a read-only tab in the script window that displays your data in full. You can scroll around and order the columns by clicking the headers. This doesn’t affect the underlying data at all.
You can also access this by clickng the little image of a table to the right of the object in the environment pane (upper-right).
You can get very quick summary statistics with the summary()
function. The function provides a quick summary of each column depending on its data type (integer, character, etc). This is pretty basic, but we’ll do something more impressive later.
## species combat_power hit_points weight_kg
## Length:696 Min. : 10.0 Min. : 10.00 Min. : 0.050
## Class :character 1st Qu.: 76.0 1st Qu.: 23.00 1st Qu.: 2.795
## Mode :character Median : 160.0 Median : 33.00 Median : 6.440
## Mean : 206.1 Mean : 37.42 Mean : 15.053
## 3rd Qu.: 286.0 3rd Qu.: 47.00 3rd Qu.: 20.163
## Max. :1636.0 Max. :320.00 Max. :492.040
## weight_bin height_m height_bin fast_attack
## Length:696 Min. :0.2000 Length:696 Length:696
## Class :character 1st Qu.:0.3100 Class :character Class :character
## Mode :character Median :0.5050 Mode :character Mode :character
## Mean :0.6544
## 3rd Qu.:0.8900
## Max. :9.5200
## charge_attack
## Length:696
## Class :character
## Mode :character
##
##
##
We’re going to use a number of sensibly-named functions from the dplyr
package to do our data manipulation. These functions take verbs – not too dissimilar to SQL verbs – as their names. This makes it easy to understand what they’re doing.
dplyr
is part of a suite of packages within what is called ‘the Tidyverse’. These packages are all written with the same thoughts in mind (e.g. the first argument of all the functions is the data, function names are sensible and written in snake_case, the code is optimised to run quickly, etc).
The tidyverse aims to make things simpler and fasterfor R coders.
Firstly, we can select()
columns of interest. There’s the first sensible function name. You’ll notice that a lot of them are verbs to make it clear that the code is actively doing something.
# save as an object for later
pokemon_hp <- select(
pokemon, # the first argument is always the data
hit_points, # the other arguments are column names you want to keep
species
)
print(pokemon_hp)
## # A tibble: 696 x 2
## hit_points species
## <int> <chr>
## 1 15 krabby
## 2 23 geodude
## 3 38 venonat
## 4 32 parasect
## 5 37 eevee
## 6 320 voltorb
## 7 21 shellder
## 8 10 staryu
## 9 30 nidoran_male
## 10 35 poliwag
## # ... with 686 more rows
Note that the order you select the columns is the order they’ll appear in when they print.
And we can choose not to include certan columns by prefixing with -
(hyphen/minus).
select(
pokemon, # data frame first
-hit_points, -combat_power, -fast_attack, -weight_bin # columns to drop
)
## # A tibble: 696 x 5
## species weight_kg height_m height_bin charge_attack
## <chr> <dbl> <dbl> <chr> <chr>
## 1 krabby 5.82 0.360 normal vice_grip
## 2 geodude 20.9 0.370 normal rock_tomb
## 3 venonat 20.4 0.920 normal poison_fang
## 4 parasect 19.2 0.870 normal x-scissor
## 5 eevee 4.18 0.250 normal body_slam
## 6 voltorb 11.2 0.480 normal discharge
## 7 shellder 3.49 0.270 normal bubble_beam
## 8 staryu 36.4 0.800 normal bubble_beam
## 9 nidoran_male 9.49 0.510 normal body_slam
## 10 poliwag 11.2 0.580 normal body_slam
## # ... with 686 more rows
That can be quite laborious, so there are some special functions we can use inside the select function to help us out.
For example, selecting columns starting with a particular string:
## # A tibble: 696 x 2
## weight_kg weight_bin
## <dbl> <chr>
## 1 5.82 normal
## 2 20.9 normal
## 3 20.4 extra_small
## 4 19.2 extra_small
## 5 4.18 extra_small
## 6 11.2 normal
## 7 3.49 normal
## 8 36.4 normal
## 9 9.49 normal
## 10 11.2 normal
## # ... with 686 more rows
Or any columns containing a given string.
## # A tibble: 696 x 2
## weight_bin height_bin
## <chr> <chr>
## 1 normal normal
## 2 normal normal
## 3 extra_small normal
## 4 extra_small normal
## 5 extra_small normal
## 6 normal normal
## 7 normal normal
## 8 normal normal
## 9 normal normal
## 10 normal normal
## # ... with 686 more rows
CHALLENGE!
Create an object called my_selection
that uses the select()
function to store from pokemon
the species column and any columns that end with with "attack"
More infomration in the help file if you type ?select
.
Now for subsetting the data by its rows.
We’re going to make use of some common logical operators for subsetting our data by certain conditions:
==
– equals!=
– not equals%in%
– match to several things listed with c()
>
, <
, <=
, >=
– greater/less than (or equal to)&
– ‘and’|
– ‘or’Let’s start by filtering for one particular species.
## Warning: package 'bindrcpp' was built under R version 3.4.4
## # A tibble: 11 x 9
## species combat_power hit_points weight_kg weight_bin height_m
## <chr> <int> <int> <dbl> <chr> <dbl>
## 1 jigglypuff 221 93 7.04 extra_large 0.560
## 2 jigglypuff 156 80 6.83 normal 0.550
## 3 jigglypuff 349 119 3.57 extra_small 0.420
## 4 jigglypuff 10 22 4.92 normal 0.440
## 5 jigglypuff 188 94 6.56 normal 0.520
## 6 jigglypuff 33 39 7.14 extra_large 0.580
## 7 jigglypuff 56 51 5.55 normal 0.490
## 8 jigglypuff 66 51 8.13 extra_large 0.600
## 9 jigglypuff 289 111 5.02 normal 0.440
## 10 jigglypuff 348 119 4.91 normal 0.470
## 11 jigglypuff 486 146 4.90 normal 0.440
## # ... with 3 more variables: height_bin <chr>, fast_attack <chr>,
## # charge_attack <chr>
Now everything except for one species.
## # A tibble: 610 x 9
## species combat_power hit_points weight_kg weight_bin height_m
## <chr> <int> <int> <dbl> <chr> <dbl>
## 1 krabby 51 15 5.82 normal 0.360
## 2 geodude 85 23 20.9 normal 0.370
## 3 venonat 129 38 20.4 extra_small 0.920
## 4 parasect 171 32 19.2 extra_small 0.870
## 5 eevee 172 37 4.18 extra_small 0.250
## 6 voltorb 131 320 11.2 normal 0.480
## 7 shellder 96 21 3.49 normal 0.270
## 8 staryu 11 10 36.4 normal 0.800
## 9 nidoran_male 112 30 9.49 normal 0.510
## 10 poliwag 156 35 11.2 normal 0.580
## # ... with 600 more rows, and 3 more variables: height_bin <chr>,
## # fast_attack <chr>, charge_attack <chr>
Now filtering to include three species only.
## # A tibble: 39 x 9
## species combat_power hit_points weight_kg weight_bin height_m
## <chr> <int> <int> <dbl> <chr> <dbl>
## 1 staryu 11 10 36.4 normal 0.800
## 2 psyduck 97 26 26.0 extra_large 0.900
## 3 psyduck 41 17 23.6 normal 0.910
## 4 staryu 225 25 36.4 normal 0.730
## 5 staryu 154 23 18.8 extra_small 0.590
## 6 staryu 11 10 18.9 extra_small 0.680
## 7 staryu 260 29 44.2 extra_large 0.850
## 8 psyduck 44 19 23.4 normal 0.720
## 9 staryu 112 19 28.1 normal 0.780
## 10 staryu 144 23 50.4 extra_large 0.970
## # ... with 29 more rows, and 3 more variables: height_bin <chr>,
## # fast_attack <chr>, charge_attack <chr>
We can work with numbers too.
## # A tibble: 7 x 9
## species combat_power hit_points weight_kg weight_bin height_m height_bin
## <chr> <int> <int> <dbl> <chr> <dbl> <chr>
## 1 gyarad~ 955 94 177. normal 5.58 normal
## 2 magmar 936 70 40.4 normal 1.16 normal
## 3 magmar 991 73 31.1 extra_sma~ 1.23 normal
## 4 magmar 963 75 42.5 normal 1.28 normal
## 5 pinsir 1184 84 68.1 normal 1.61 normal
## 6 fearow 954 83 40.6 normal 1.20 normal
## 7 electa~ 962 74 39.0 extra_lar~ 1.26 normal
## # ... with 2 more variables: fast_attack <chr>, charge_attack <chr>
CHALLENGE!
Filter the pokemon
dataframe to include species rows that:
combat_power
(hint: you’ll need an &
)hit_points
How many Pokemon are in this subset?
Now to create new columns. We use mutate()
because we’re mutating our dataframe – we’re budding a new column where there wasn’t one before. Often you’ll be creating new columns based on the content of columns that already exist, or you can fill the entire column with one thing.
For now, we’re going to create column names without spaces. It’s easier.
# we're going to subset by columns first
pokemon_power_hp <- select( # create new object by subsetting our data set
pokemon, # data
species, combat_power, hit_points # columns to keep
)
# now to mutate with some extra information
mutate(
pokemon_power_hp, # our new, subsetted data frame
power_index = combat_power * hit_points, # new column from old ones
caught = 1, # new column will fill entirely with number
area = "kanto" # will fill entirely with this text
)
## # A tibble: 696 x 6
## species combat_power hit_points power_index caught area
## <chr> <int> <int> <int> <dbl> <chr>
## 1 krabby 51 15 765 1. kanto
## 2 geodude 85 23 1955 1. kanto
## 3 venonat 129 38 4902 1. kanto
## 4 parasect 171 32 5472 1. kanto
## 5 eevee 172 37 6364 1. kanto
## 6 voltorb 131 320 41920 1. kanto
## 7 shellder 96 21 2016 1. kanto
## 8 staryu 11 10 110 1. kanto
## 9 nidoran_male 112 30 3360 1. kanto
## 10 poliwag 156 35 5460 1. kanto
## # ... with 686 more rows
So we’ve created a new column, caught
, that’s filled for every row with 1
and another new column filled with kanto
for every row.
Note that if you pass a vector to mutate()
, the vector won’t be ‘recycled’ for each row of your datsaet. In other words putting mutate(new_column = c(1, 2))
won’t result in a ‘1’ in row 1, a ‘2’ in row 2 a ‘1’ in row 3 and so on. To do this, you can use transform(new_column = c(1, 2))
.
You can mutate a little more easily with an if_else()
statement:
mutate(
pokemon_hp,
common = if_else(
condition = species %in% c( # if this condition is met...
"pidgey", "rattata", "drowzee",
"spearow", "magikarp", "weedle",
"staryu", "psyduck", "eevee"
),
true = "yes", # ...fill column with this string
false = "no" # ...otherwise fill it with this string
)
)
## # A tibble: 696 x 3
## hit_points species common
## <int> <chr> <chr>
## 1 15 krabby no
## 2 23 geodude no
## 3 38 venonat no
## 4 32 parasect no
## 5 37 eevee yes
## 6 320 voltorb no
## 7 21 shellder no
## 8 10 staryu yes
## 9 30 nidoran_male no
## 10 35 poliwag no
## # ... with 686 more rows
And we can get more nuanced by using a case_when()
statement (you may have seen this in SQL). This prevents us writing nested if_else()
statements to specify multiple conditions.
mutate(
pokemon_hp, # data
common = case_when(
species %in% c("pidgey", "rattata", "drowzee") ~ "very_common",
species == "spearow" ~ "pretty_common",
species %in% c("magikarp", "weedle", "staryu", "psyduck") ~ "common",
species == "eevee" ~ "less_common",
TRUE ~ "no"
)
)
## # A tibble: 696 x 3
## hit_points species common
## <int> <chr> <chr>
## 1 15 krabby no
## 2 23 geodude no
## 3 38 venonat no
## 4 32 parasect no
## 5 37 eevee less_common
## 6 320 voltorb no
## 7 21 shellder no
## 8 10 staryu common
## 9 30 nidoran_male no
## 10 35 poliwag no
## # ... with 686 more rows
CHALLENGE!
Create a new datafrmae object that takes the pokemon
data and adds a column containing Pokemon body-mass index (BMI).
Hint: BMI is weight over height squared (you can square a number by writing ^2
after it).
Now use a case_when()
to categorise Pokemon:
Note that these are BMI groups for humans. And that BMI has many limitations!
This does what it says on the tin. This alters the order of the rows in your table according to some column specification.
## # A tibble: 696 x 9
## species combat_power hit_points weight_kg weight_bin height_m
## <chr> <int> <int> <dbl> <chr> <dbl>
## 1 diglett 79 10 0.790 normal 0.200
## 2 pidgey 254 44 0.820 extra_small 0.210
## 3 rattata 23 11 1.52 extra_small 0.220
## 4 pidgey 229 43 0.850 extra_small 0.220
## 5 weedle 17 13 2.25 extra_small 0.220
## 6 spearow 296 47 0.690 extra_small 0.220
## 7 spearow 89 26 1.06 extra_small 0.220
## 8 pidgey 256 46 0.820 extra_small 0.230
## 9 rattata 64 17 2.70 normal 0.230
## 10 diglett 64 10 1.05 extra_large 0.230
## # ... with 686 more rows, and 3 more variables: height_bin <chr>,
## # fast_attack <chr>, charge_attack <chr>
And in reverse order (tallest first):
## # A tibble: 696 x 9
## species combat_power hit_points weight_kg weight_bin height_m
## <chr> <int> <int> <dbl> <chr> <dbl>
## 1 onix 299 38 192. normal 9.52
## 2 gyarados 955 94 177. normal 5.58
## 3 pidgey 76 26 1.25 extra_small 2.50
## 4 ekans 206 35 11.6 extra_large 2.46
## 5 lapras 1636 161 163. extra_small 2.22
## 6 snorlax 300 85 492. normal 2.11
## 7 dratini 298 42 4.40 extra_large 2.08
## 8 dratini 332 44 4.75 extra_large 1.99
## 9 ekans 95 24 5.20 normal 1.93
## 10 dratini 316 40 3.13 normal 1.91
## # ... with 686 more rows, and 3 more variables: height_bin <chr>,
## # fast_attack <chr>, charge_attack <chr>
CHALLENGE!
What happens if you arrange by a column containing characters rather than numbers? For example, the species
column.
Again, another verb that mirrors what you can find in SQL. There are several types of join, but we’re going to focus on the most common one: the left_join()
. This joins information from one table – x
– to another – y
– by some key matching variable of our choice.
Let’s start by reading in a lookup table that provides some extra infomration about our species.
## Parsed with column specification:
## cols(
## species = col_character(),
## pokedex_number = col_integer(),
## type1 = col_character(),
## type2 = col_character()
## )
## Observations: 801
## Variables: 4
## $ species <chr> "bulbasaur", "ivysaur", "venusaur", "charmander...
## $ pokedex_number <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ type1 <chr> "grass", "grass", "grass", "fire", "fire", "fir...
## $ type2 <chr> "poison", "poison", "poison", NA, NA, "flying",...
Now we’re going to join this new data to our pokemon
data. The key for matching these in the species
column, which exists in both datasets.
pokemon_join <- left_join(
x = pokemon, # to this table...
y = pokedex, # ...join this table
by = "species" # on this key
)
glimpse(pokemon_join)
## Observations: 696
## Variables: 12
## $ species <chr> "krabby", "geodude", "venonat", "parasect", "ee...
## $ combat_power <int> 51, 85, 129, 171, 172, 131, 96, 11, 112, 156, 1...
## $ hit_points <int> 15, 23, 38, 32, 37, 320, 21, 10, 30, 35, 26, 38...
## $ weight_kg <dbl> 5.82, 20.88, 20.40, 19.20, 4.18, 11.20, 3.49, 3...
## $ weight_bin <chr> "normal", "normal", "extra_small", "extra_small...
## $ height_m <dbl> 0.36, 0.37, 0.92, 0.87, 0.25, 0.48, 0.27, 0.80,...
## $ height_bin <chr> "normal", "normal", "normal", "normal", "normal...
## $ fast_attack <chr> "mud_shot", "rock_throw", "confusion", "bug_bit...
## $ charge_attack <chr> "vice_grip", "rock_tomb", "poison_fang", "x-sci...
## $ pokedex_number <int> 98, 74, 48, 47, 133, 100, 90, 120, 32, 60, 46, ...
## $ type1 <chr> "water", "rock", "bug", "bug", "normal", "elect...
## $ type2 <chr> NA, "ground", "poison", "grass", NA, NA, NA, NA...
CHALLENGE!
Try right_join()
instead of left_join()
. What happens? And what about anti_join()
?
This document does not contain an exhaustive list of other functions within the same family as select()
, filter()
, mutate()
, arrange()
and *_join()
. There are other functions that will be useful for your work and other ways of manipulating your data. For example, the stringr
package helps with dealing with data in strings (text, for example).
Alright great, we’ve seen how to manipulate our dataframe a bit. But we’ve been doing it one discrete step at a time, so your script might end up looking something like this:
pokemon <- read_csv(file = "data/pokemon_go_captures.csv")
pokemon_select <- select(pokemon, -height_bin, -weight_bin)
pokemon_filter <- filter(pokemon_select, weight_kg > 15)
pokemon_mutate <- mutate(pokemon_filter, organism = "pokemon")
In other words, you might end up creating lots of intermediate variables and cluttering up your workspace and filling up memory.
You could do all this in one step by nesting each function inside the others, but that would be super messy and hard to read. Instead we’re going to ‘pipe’ data from one function to the next. The pipe operator – %>%
– says ‘take what’s on the left and pass it through to the next function’.
So you can do it all in one step:
pokemon_piped <- read_csv(file = "data/pokemon_go_captures.csv") %>%
select(-height_bin, -weight_bin) %>%
filter(weight_kg > 15) %>%
mutate(organism = "pokemon")
## Parsed with column specification:
## cols(
## species = col_character(),
## combat_power = col_integer(),
## hit_points = col_integer(),
## weight_kg = col_double(),
## weight_bin = col_character(),
## height_m = col_double(),
## height_bin = col_character(),
## fast_attack = col_character(),
## charge_attack = col_character()
## )
## Observations: 204
## Variables: 8
## $ species <chr> "geodude", "venonat", "parasect", "staryu", "ven...
## $ combat_power <int> 85, 129, 171, 11, 137, 256, 234, 157, 140, 246, ...
## $ hit_points <int> 23, 38, 32, 10, 38, 64, 33, 49, 56, 42, 45, 34, ...
## $ weight_kg <dbl> 20.88, 20.40, 19.20, 36.41, 41.23, 30.20, 73.81,...
## $ height_m <dbl> 0.37, 0.92, 0.87, 0.80, 1.26, 0.84, 1.52, 0.94, ...
## $ fast_attack <chr> "rock_throw", "confusion", "bug_bite", "water_gu...
## $ charge_attack <chr> "rock_tomb", "poison_fang", "x-scissor", "bubble...
## $ organism <chr> "pokemon", "pokemon", "pokemon", "pokemon", "pok...
This reads as:
pokemon_piped
, assign (<-
) the contents of a CSV file read with read_csv()
See how this is like a recipe?
Did you notice something? We didn’t have to keep calling the dataframe object in each function call. For example, we used filter(weight_kg > 15)
rather than filter(pokemon, weight_kg > 15)
because the data argument was piped in. The functions mentioned above all accept the data that’s being passed into them because they’re part of the Tidyverse. (Note that this is not true for all functions, but we can talk about that later.)
Here’s another simple example using the dataframe we built earlier:
my_df <- data.frame(
species = c("Pichu", "Pikachu", "Raichu"),
number = c(172, 25, 26),
location = c("Johto", "Kanto", "Kanto")
)
my_df %>% # take the dataframe object...
select(species, number) %>% # ...then select these columns...
filter(number %in% c(172, 26)) # ...then filter on these values
## species number
## 1 Pichu 172
## 2 Raichu 26
Nice and easy to read.
CHALLENGE!
Write a pipe recipe that creates a new dataframe called my_poke
that takes the pokemon
dataframe and:
select()
s only the species
and combat_power
columnsleft_join()
s the pokedex
dataframe by species
filter()
s by those with a type1
that’s ‘normal’Assuming we’ve now wrangled out data using the dplyr
functions, we can do some quick, readable summarisation that’s way better than the summary()
function.
So let’s use our knowledge – and some new functions – to get the top 5 pokemon by count.
pokemon %>% # take the dataframe
group_by(species) %>% # group it by species
tally() %>% # tally up (count) the number of instances
arrange(desc(n)) %>% # arrange in descending order
slice(1:5) # and slice out the first five rows
## # A tibble: 5 x 2
## species n
## <chr> <int>
## 1 pidgey 86
## 2 rattata 78
## 3 drowzee 64
## 4 spearow 42
## 5 zubat 35
The order of your functions is important – remember it’s like a recipe. Don’t crack the eggs on your cake just before serving. Do it near the beginning somewhere, I guess (I’m not much a cake maker).
There’s also a specific summarise()
function that allows you to, well… summarise.
pokemon_join %>% # take the dataframe
group_by(type1) %>% # group by variable
summarise( # summarise it by...
count = n(), # counting the number
mean_cp = round(mean(combat_power), 1) # and taking a mean to 2 dp
) %>%
arrange(desc(mean_cp)) # then organise in descending order of this column
## # A tibble: 16 x 3
## type1 count mean_cp
## <chr> <int> <dbl>
## 1 fire 16 510.
## 2 fairy 5 412.
## 3 <NA> 3 390.
## 4 electric 12 373.
## 5 fighting 1 358.
## 6 grass 17 357.
## 7 dragon 4 326.
## 8 psychic 70 301.
## 9 ice 7 275.
## 10 ground 7 214.
## 11 water 157 192.
## 12 rock 9 190.
## 13 bug 63 185.
## 14 ghost 12 170.
## 15 poison 59 168.
## 16 normal 254 157.
Note that you can group by more than one thing as well. We can group on the weight_bin
category within the type1
category, for example.
pokemon_join %>%
group_by(type1, weight_bin) %>%
summarise(
mean_weight = mean(weight_kg),
count = n()
)
## # A tibble: 40 x 4
## # Groups: type1 [?]
## type1 weight_bin mean_weight count
## <chr> <chr> <dbl> <int>
## 1 bug extra_large 29.1 9
## 2 bug extra_small 8.98 16
## 3 bug normal 10.6 38
## 4 dragon extra_large 4.58 2
## 5 dragon normal 2.95 2
## 6 electric extra_large 18.7 3
## 7 electric extra_small 5.74 2
## 8 electric normal 18.5 7
## 9 fairy extra_large 9.47 2
## 10 fairy normal 7.96 3
## # ... with 30 more rows
We’re going to keep this very short and dangle it like a rare candy in front of your nose. We’ll revisit this in more depth in a later session. For now, we’re going to use a package called ggplot2
to create some simple charts.
CHALLENGE!
Remember how to use packages? Install ggplot2
and load it from the library.
The ‘gg’ in ‘ggplot2’ stands for ‘grammar of graphics’. This is a way of thinking about plotting as having a ‘grammar’ – ‘elements that can be applied in succession to create a plot. This is ’the idea that you can build every graph from the same few components’: a data set, geoms (marks representing data points), a co-ordinate system and some other things.
The ggplot()
function from the ggplot2
package is how you create these plots. You build up the graphical elements using the +
rather than a pipe. Think about it as placing down a canvas and then adding layers on top.
ggplot
plays nicely with the pipe – it’s part of the Tidyverse – so we can create recipes that combine data reading, data manipulation and plotting all in one go. Let’s do some manipulation before plotting and then introduce some new elements to our plot that simplify the theme and change the labels.
pokemon_join %>%
filter(type1 %in% c("fire", "water", "grass")) %>%
ggplot() +
geom_violin(aes(x = type1, y = combat_power)) +
theme_bw() +
labs(
title = "CP by type",
x = "Primary type",
y = "Combat power"
)
How about a dotplot? Coloured by type1
?
pokemon_join %>%
filter(type1 %in% c("fire", "water", "grass")) %>%
ggplot() +
geom_point(aes(x = pokedex_number, y = height_m, colour = type1))
CHALLENGE!
Create a boxplot for Pokemon with type1
of ‘normal’, ‘poison’, ‘ground’ and ‘water’ against their hit-points
Simple, but relatively effective. We’ll look next time at plotting in more depth. For example, yes: you can use Pokemon sprites as your plotting points. And why stop there? You can also use specific Pokemon typing colours, sprite colour palettes and theme your barplot like a Pokemon first generation HP bar. Cool, eh?
readr
, dplyr
and ggplot2
)