class: center, middle, inverse <img src = 'img/drake-hex.png' width=20%> # Reproducible workflows with {drake} Bioinformatics London Meetup, 2020-01-30
[mattdray](https://twitter.com/mattdray)
[matt-dray](https://github.com/matt-dray)
[rostrum.blog](https://www.rostrum.blog/) ??? * Thank you for being here instead of the RStudio conference * I have an entomology MSc and a PhD on ecosystem processes under environmental change (watching leaves rot, if you want to get technical) * Over five years as a government analyst, much of it spent encouraging a switch to R from Excel-led processes * (I'm talking today as me, not on behalf of my employer) --- class: middle, center <img src="img/dre-snoop.jpg" height=90%/> ??? * An apology: I'm sorry if you came here expecting a different Dr Dray --- class: middle, center <img src="img/drake.jpg" height=100%/> ??? * Another apology: I'm not talking about Canada's best rapper --- class: middle, center <img src="img/drake-hex.png" width="45%"/> [{drake}](https://docs.ropensci.org/drake/) by [Will Landau](https://wlandau.github.io/) ??? * {drake}: the R package built by Will Landau for managing workflows and making your analysis pipeline more reproducible * It's symbols are a brain and a hammer; I think because {drake} stops you from hammering out your own brain --- class: middle 🙋 Do you: -- * use R? -- * use a workflow manager? -- * use {drake}? -- * have FOMO? --- class: middle, inverse # Reproducevangelism ??? * I don't want to condescend or dictate, I want to build from the general principle of 'reproducible is good' * So let's define 'reproducibility' * Very important in scientific fields given the reproducibility crisis --- class: middle, center ![](img/turing-way-reproducibility.jpg) From [The Turing Way](https://the-turing-way.netlify.com/introduction/introduction) by The Alan Turing Institute ??? * We could get into a heated debate about the definition * The book is a great free resource for reproduciblity in general * I think we care about both (i) recreating ouputs and (ii) updating with fresh data * I"m going to refer to this top row as reproducibility throughout --- class: middle Can I recreate what you did: -- * from scratch? -- * on a different machine? -- * in the future? -- * without you present? ??? * Do I have all the code, files and documentation? * Can I do it on another OS with dependencies missing? * Will the code still work next year, or will dependency chenages break it? * Did you document it well enough? Is everything stored somewhere remotely? --- class: middle, center [Reproducible Analytical Pipelines](https://ukgovdatascience.github.io/rap-website/) (RAP) <img src='img/rap_v5_hex.png' width='40%'> [Can {drake} RAP?](https://www.rostrum.blog/2019/07/23/can-drake-rap/) ??? * In government we recognise the benefits of reproducible analysis and publications * Literally this is a badge of pride in government * We want to make sure the whole end-to-end process from data to publication is repeatable * We want to build trust, but also make it easier to update publications as new data emerges * I wrote a thing about how {drake} might be useful for RAP in government --- class: middle Can I trust these outputs? -- Can _you_ trust these outputs? ??? * Working in a reproducible way benefits you, with the excellent by-product of also helping others --- class: middle Today's focus: 1. Make workflows reproducible 1. Use {drake} ??? * We could talk about this all day, but I'm focusing on one particular thing today * If these both mean nothing to you, then great * If you know about workflow reproducibility, maybe you haven't considered {drake} * If you do oth these things, then you should be here talking instead of me --- class: heading-slide, middle, inverse # 1. Make workflows reproducible ??? * I'm referring to analysis 'workflows', but you might say 'pipelines' * As in you read, wrangle, model, plot and report data in some fashion * The process of turning inputs to outputs * This talk assumes that you do this in R, but ideas are transferable --- class: middle R has [many reproducibility tools](https://annakrystalli.me/talks/r-in-repro-research-dc.html#1), like: * [RStudio Projects](https://swcarpentry.github.io/r-novice-gapminder/02-project-intro/) to keep everything together * [R Markdown](https://rmarkdown.rstudio.com/) for reproducible docs * [packages](https://www.hvitfeldt.me/blog/usethis-workflow-for-package-development/) for reusable functions * [{here}](https://github.com/jennybc/here_here) for relative filepaths * [{renv}](https://rstudio.github.io/renv/articles/renv.html) for dependency management ??? * Some general tools, but also packages for very specific things like filepath management * Of course, there's agnostic tools like Git and Docker too * There are similar tools in other languages --- class: middle What about your analytical workflow itself? How do you keep track of function, file and object relationships? ??? * Have you considered this before? * You might be asking why you might even do this -- is it really a big deal? --- class: middle What if: -- * you haven't recorded the steps? -- * the interdependencies become complex? -- * some steps are computationally intensive? -- * something changes? ??? * You might be thinking 'no, what's the point?' * What impact does a change to file 'A' have? * What if it's time intensive? * Do you re-run everything from scratch if there's a change? * Maybe this isn't a big deal for a small analysis, but things can get out of hand quickly --- class: middle, center You can't remember this <img src='img/frederik-aust.png' width='60%'> ??? * Here's a real analysis * All of the files, functions and objects that result in one final output on the right * The spaghetti is all the relationships * How could you possibly remember the exact things to re-run if something changes? --- class: middle Maybe: Maybe `01-read.R`, `02-wrangle.R`, etc? ??? * This doesn't scale well -- might be okay for smaller projects * Implies the analysis is perfectly sequential -- but some objects from file `01` might be applied in `03` rather than `02` * You'll still have to re-run from scratch sometimes --- class: middle, center Let a workflow manager handle it <img src='img/awesome.png' width=95%>
[pditommaso/awesome-pipeline](https://github.com/pditommaso/awesome-pipeline) ??? * There are many workflow managers -- some general, some language-specific * The one that most people might be familiar with is Make (capital 'M') * It's language agnostic, but has some limitations -- it operates on files, for example and requires you to learn how to use Make * This is a powerful thought technology --- class: heading-slide, middle, inverse # 2. Use {drake} ??? * If there's so many managers then why {drake}? --- class: middle {drake} is compelling because it's: -- * R-specific -- * free -- * part of [rOpenSci](https://ropensci.org/) -- * under active development -- * got great documentation and examples ??? * For {drake}, targets are any arbitrary values in memory (they must be files in Make) * Proprietary software might be okay when working alone or with specific collaborators who also have the tool, but what about the wider world? * rOpenSci is 'carefully vetted, staff- and community-contributed R software tools that lower barriers to working with scientific data sources and data that support research applications on the web' -- requires peer review * Will is active at responding to issues and developing new things * Will has provided a lot of materials to explain the use of {drake}, as well as learning materials like a Shiny app and installable example projects (I've listed these on a slide at the end) --- class: middle At its simplest: 1. `drake_plan()` 1. `make()` 1. Change stuff 1. Go to 2 ??? * The smallest amount of things you need to do is run two functions * You create a 'plan' object that' contains the functions, objects and data in your's a bit like a recipe for generating your analysis * Then you run `make()` on that object to run the steps and 'bake your cake' * Let's go through a simple, arbitrary, trivial example * We're not going to think too hard about folder structure in this example * In fact, this workflow will be running from within this {xaringan} (RMarkdown slides) --- class: middle, inverse #Small demo <img src='img/beaver.png'> ??? * Using the built-in `beaver1` and `beaver2` datasets * You can see something similar in Chapter 3 of the {drake} user manual * This example is run entirely within this {xaringan} presentation, giving you a flavour of its versatility --- class: middle ```r # Get from CRAN with install.packages() library(drake) library(dplyr) library(ggplot2) library(rphylopic) ``` ??? * The first couple of steps look like a 'normal' analysis * First we load the packages we want to use * {dplyr} and {ggplot2} from the tidyverse suite * {rphylopic} is a package by Scott Chamberlain and David Miller that taps into the Phylopic API so we can grab a royalty-free image to adorn our plot --- class: middle ```r b_plot <- function(data, image) { ggplot(data, aes(id, temp)) + geom_boxplot() + labs(title = "Weasel temperature") + add_phylopic(image) } b_table <- function(data) { group_by(data, id) %>% summarise( mean = mean(temp), sd = sd(temp), min = min(temp), max = max(temp) ) %>% ungroup() } ``` ??? * And then perhaps we've developed a couple of functions * Here we have one that plots to particular specifications * And one that produces a very simple table of descriptive statistics * The beady-eyed among you may have spotted an error in my functions; keep that in mind --- class: middle ```r plan <- drake_plan( b1 = mutate(beaver1, id = "A"), b2 = mutate(beaver2, id = "B"), beavers = bind_rows(b1, b2), uid = "be8670c2-a5bd-4b44-88e8-92f8b0c7f4c6", png = image_data(uid, size = "512")[[1]], plot = b_plot(beavers, png), table = b_table(beavers), report = rmarkdown::render( * knitr_in("beavers-report.Rmd"), quiet = TRUE, * output_file = file_out("beavers-report.html"), ) ) ``` ??? * We wrap the steps of the analysis in `drake_plan()` -- wrangling, fetching a phylopic image, building the outputs (a plot, table and RMarkdown report) * This doesn't look too dissimilar to a regular script file * `knitr_in()` and `file_out()` are special -- they ensure that reports are refreshed when a dependency changes --- class: middle ```r plan # it's a dataframe ``` ``` ## # A tibble: 8 x 2 ## target command ## <chr> <expr> ## 1 b1 mutate(beaver1, id = "A") … ## 2 b2 mutate(beaver2, id = "B") … ## 3 beavers bind_rows(b1, b2) … ## 4 uid "be8670c2-a5bd-4b44-88e8-92f8b0c7f4c6" … ## 5 png image_data(uid, size = "512")[[1]] … ## 6 plot b_plot(beavers, png) … ## 7 table b_table(beavers) … ## 8 report rmarkdown::render(knitr_in("beavers-report.Rmd"), quiet = TRUE, output_fil… ``` ??? * What does the plan object look like? * 'Dataframes in R for Make' == {drake} * Targets are the things being acted upon and commands are the tasks * This plan is the key to how {drake} 'remembers' your workflow: it knows the relationships between all the steps and the order they're in * So far we haven't built anything, we've just created the receipe --- class: middle ```r drake::make(plan) # build the targets ``` ``` ## target uid ## target b1 ## target b2 ## target png ## target beavers ## target plot ## target table ## target report ``` ??? * To run the workflow, you use `make()` --- class: middle The [rendered output]("https://matt-dray.github.io/drake-bioinformatics/weasel-report.html") <iframe src="https://matt-dray.github.io/drake-bioinformatics/weasel-report.html" width="100%" height="450"></iframe> ??? * Here's what the RMarkdown output looks like after rendering to HTML * The data was wrangled, a plot and table were produced and this was recorded in an R Markdown file * You'll need an internet connection to see this --- class: middle ```r config <- drake_config(plan) # a list config$graph # e.g. an igraph object ``` ``` ## IGRAPH 23d1bec DN-- 13 13 -- ## + attr: name (v/c), imported (v/l) ## + edges from 23d1bec (vertex names): ## [1] n-OJWWC4TLMRXXO3R2HJZGK3TEMVZA ->report ## [2] plot ->report ## [3] table ->report ## [4] p-MJSWC5TFOJZS24TFOBXXE5BOKJWWI->report ## [5] report ->p-MJSWC5TFOJZS24TFOBXXE5BONB2G23A ## [6] beavers ->plot ## [7] beavers ->table ## [8] b_table ->table ## + ... omitted several edges ``` --- class: middle, ```r vis_drake_graph(config, main = "", width = 800, height = 450, navigationButtons = FALSE, font_size = 15) ```
??? * {drake} has a function that takes advantage of {visNetwork} to display the dependency graph * You pass it the `config` object * Other arguments here are aesthetic * Targets: circle is object, square is fie, triangle is function * The arrows trace the dependencies --- class: middle ```r cached() # targets in hidden .drake/ folder ``` ``` ## [1] "b1" "b2" "beavers" "plot" "png" "report" "table" "uid" ``` ??? * You might be wondering where the targets 'live' * They get cached -- stored in a folder * That folder is hidden in the project directory as `.drake/` --- class: middle ```r readd(table) # read from the cache ``` ``` ## # A tibble: 2 x 5 ## id mean sd min max ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 A 36.9 0.193 36.3 37.5 ## 2 B 37.6 0.447 36.6 38.4 ``` ??? * How can you access objects in the cache? * `readd()` and `loadd()` do this * In fact, this is how the R Markdown file in this example fetches the plot and table --- class: middle ```r b_plot <- function(data, image) { ggplot(data, aes(id, temp)) + geom_boxplot() + * labs(title = "Beaver temperature") + # fix! add_phylopic(image) } ``` ```r outdated(config) # check what's impacted ``` ``` ## [1] "plot" "report" ``` ??? * Remember I said there was a problem with one of my functions? * Yes, the plot function used the word 'weasel' instead of 'beaver' in the title * Let's go back and update the function * This update means that the targets that depend on this function need to be updated * These can be checked with a call to `outdated()` --- class: middle ```r vis_drake_graph(config, main = "", width = 800, height = 450, navigationButtons = FALSE, font_size = 15) ```
??? * We can revisit the network graph to get a visual representation of the impact of our change * You can now see that the colour of some targets has turned black * These are the targets that are now out of date * {drake} remembered this to you didn't have to --- class: middle ```r drake::make(plan) # rebuild outdated targets ``` ``` ## target plot ## target report ``` ??? * Of course, we can now re-`make()` the plan * {drake} will only update the things that have been impacted by our change * This is a trivial example -- consider how helpful this would be if you had even more targets and more complicated relationships, especially if you have steps that are computationally-intensive --- class: middle The [rendered output]("https://matt-dray.github.io/drake-bioinformatics/beavers-report.html"), updated <iframe src="https://matt-dray.github.io/drake-bioinformatics/beavers-report.html" width="100%" height="450"></iframe> ??? * And we can peak at the rendered R Markdown once more to check that our change to the plot function had its desired effect * You'll need an internet connection to see this --- class: middle To recap: -- 1. `drake_plan()` -- 1. `make()` -- 1. Change stuff -- 1. Go to 2 ??? * Let's recap those steps again * Create a plan with targets and commands -- it's a dataframe * Make that plan to run the analysis * Something changes -- input data, function specifications, etc * (You can check what's outdated and check the network graph) * Now re-`make()` * And you can do this over and over * What gets done stays done --- class: inverse, middle # What now? ??? * We talked about reproduciblity, about workflow managers, about {drake} * What do I think you should do now? --- class: middle Try it ``` install.packages("drake") library(drake) drake_example("main") ``` ??? * You can install an example structure from the package itself * Otherwise there are many documents you can find on the web --- class: middle Check out official {drake} materials: * rOpenSci [{drake} website](https://docs.ropensci.org/drake/) * the [user manual](https://books.ropensci.org/drake/) * an [rOpenSci community call](https://ropensci.org/commcalls/2019-09-24/) * [learndrake](https://github.com/wlandau/learndrake) in the cloud * [drakeplanner](https://wlandau.shinyapps.io/drakeplanner/_w_7935044f/) Shiny app * launch [drake examples](https://github.com/wlandau/drake-examples) in the cloud * source [on GitHub](https://github.com/ropensci/drake) * the [CRAN listing](https://cloud.r-project.org/web/packages/drake/index.html) --- class: middle We didn't cover everything, like: * high-performance computing * target-creation history and recovery * static and dynamic branching * time prediction * memory strategies * trigger customisation ??? * Clearly there isn't enough time to go through all the great things about {drake} * Luckily {drake} has rich documentation --- class: middle Revisit today's materials: * [matt-dray.github.io/drake-bioinformatics/](https://matt-dray.github.io/drake-bioinformatics/#1) * [github.com/matt-dray/drake-bioinformatics](https://github.com/matt-dray/drake-bioinformatics) * [R script](https://github.com/matt-dray/drake-bioinformatics/blob/master/drake-beavers-workflow.R) of the code used Plus: * [Can {drake} RAP?](https://www.rostrum.blog/2019/07/23/can-drake-rap/) blog post * run this example in Binder [![Launch Rstudio Binder](http://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/matt-dray/drake-egg-rap/master?urlpath=rstudio) ??? * Slides and source on GitHub, along with a single script file of the code used in the presentation * Binder is a service that lets you launch a Notebook or RStudio instance in the cloud with data and code that you provide --- class: inverse, middle # Reproducible workflows with {drake} 1. Make workflows reproducible 1. Use {drake} What gets done stays done!
[mattdray](https://twitter.com/mattdray)
[matt-dray](https://github.com/matt-dray)
[rostrum.blog](https://www.rostrum.blog/) ??? * I hope I've convinced you to consider a workflow manager * And hopefully to take a look at {drake} as your tool of choice * At worst, you might have learnt about {rphylopic}! * I'm gong to finish here, but we have time for some words from our special guest, Will Landau --- class: middle Sources * Dr Not-Dray [via Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Snoop_Dogg_and_Dr._Dre.jpg) * Canadian rapper [via Wikimedia Commons](https://commons.wikimedia.org/wiki/Category:Drake) * {drake} [hex logo](https://camo.githubusercontent.com/44749362ca36c9e3295f2bcf18975d811564c121/68747470733a2f2f646f63732e726f70656e7363692e6f72672f6472616b652f7265666572656e63652f666967757265732f6c6f676f2e737667) * {drake} hairball by [Frederik Aust](https://twitter.com/FrederikAust/status/1205103780938833921?s=20) * _Castor canadensis_ from [PhyloPic](http://phylopic.org/image/b3dd721e-6084-4413-8300-44e10d8fd3ca/)