The quickest user guide imaginable

If you have an R file that you’re either executing by source-ing it or by calling it from the command line using Rscript, and you want to get access to the path of that file inside the script, you can use the scriptloc() function:

script_path <- scriptloc()

Read on if you want to read why you might want to do this. Otherwise, you’re all set, and that’s all there is to using scriptloc. Just heed the next warning and you’re good to go.

The only thing that can mess up scriptloc

The way in which scriptloc works will be described in a different vignette, but the short version is that if you define any variable called ofile AFTER a script is sourced but BEFORE scriptloc() is called, then it won’t work. I know this is an arbitrary restriction, but this is a current limitation of the package. So, to be on the safe side, if you want scriptloc() to work reliably, don’t define any variable called ofile.

The Project Path Problem

Handling file paths can be one of the most finicky parts of organizing a project. Consider a simple project structure that looks like so:

project
│
├── data.tsv   # (Raw data)
└── script.R   # (Script that does the analysis)

In the project directory, we have a file called data.tsv that contains information to be processed. We want our analysis to be reproducible, so we do the sensible thing and have a script.R file cataloging the steps we take to produce an output.tsv.

#' script.R
#' Parse our dataset and write out results
library(AwesomeRLib)
f    <- 'data.tsv'
x    <- read.table(f, sep = '\t', header = T, stringsAsFactors = F)
y    <- some_cool_function(x)
outf <- 'output.tsv'
write.table(y, outf, sep = '\t', row.names = F, quote = F)

Here’s a simple question: Where will the output be produced if we run script.R? This is a trick question because I’ve withheld crucial information — namely, the current working directory from which the script is being run. Consider the full view of the directory tree:

home
│
└── user
    │
    └── project
        │
        ├── data.tsv   # (Raw data)
        └── script.R   # (Script that does the analysis)

The absolute path of our project is /home/user/project/. If we start our R session from inside this folder, then it is implied that the data.tsv being read has the absolute path /home/user/project/data.tsv, and the output file being written will have the absolute path /home/user/project/output.tsv. But if our R session has a different working directory, it won’t be able to find data.tsv, or worse, it could be opening a completely different file than what we intended. Because of this, it is common to see R scripts in the wild that look like this:

#' script.R
#' Parse our dataset and write out results
setwd('/home/user/project')
library(AwesomeRLib)
f    <- 'data.tsv'
x    <- read.table(f, sep = '\t', header = T, stringsAsFactors = F)
y    <- some_cool_function(x)
outf <- 'output.tsv'
write.table(y, outf, sep = '\t', row.names = F, quote = F)

Notice that the first line now has a call to setwd(), so that the correct file is being read in and the correct file is being written out. This is one way to solve the problem. Another, more cumbersome way to do this would be as follows:

#' script.R
#' Parse our dataset and write out results
library(AwesomeRLib)
f    <- '/home/user/project/data.tsv'
x    <- read.table(f, sep = '\t', header = T, stringsAsFactors = F)
y    <- some_cool_function(x)
outf <- '/home/user/project/output.tsv'
write.table(y, outf, sep = '\t', row.names = F, quote = F)

In this solution, we are very explicit about the paths of the files so that no mistake can be made about what exactly is happening. Again, this script “gets the job done”.

Having said that, both of the previous solutions are inelegant:

  • If you change the project folder’s location, you must remember to update the absolute path.
  • When sharing the project folder with others, they must to update the path before they run it.

In the grand scheme of things, this seems like a minor problem — but we’ve only been considering a simple project. Consider even a slightly larger project that looks like so:

project-root
│
├── data
│   ├── dataset-01.tsv
│   ├── dataset-02.tsv
│   ├── dataset-03.tsv
│   ├── dataset-04.tsv
│   └── dataset-05.tsv
├── plot
│   ├── plot-01-dataset-description.png
│   ├── plot-02-interesting-variables.png
│   └── plot-03-cool-results.png
│
├── report.pdf
│
└── scripts
    ├── 01-clean-data.R
    ├── 02-process-data.R
    ├── 03-plot-data.R
    └── 04-generate-report.Rmd

The project is neatly organized so that the data and code are separated. At the very least, this implies that you need to update paths in 4 scripts. There are other ways to tackle this (such as using a project-specific .Rprofile), but that adds one more layer of complexity.

A solution from BASH

In general, the only way to avoid this mess is to avoid using absolute paths. But as we’ve seen, relative paths are at the mercy of the working directory. There’s a lovely solution to this problem in the BASH world, wherein we are given tools to dynamically access the path of the file being run from within the file itself. Consider a simple project again that looks like so:

project
│
├── data.tsv
└── script.sh
#' script.sh

script_path=${BASH_SOURCE[0]}
script_dir=$(dirname "$script_path")
data_path="${script_dir}/data.tsv"

some_cool_software "$data_path"

BASH_SOURCE[0] contains the path to the script being executed, and dirname allows for getting the path to the directory where it is stored. The script_dir variable is now the project root, and all paths can be expressed with respect to it, as seen with the data_path example. So, now, all our paths are relative to the script’s location, but we don’t manually specify it—BASH is smart enough to understand what we want without mentioning it explicitly.

Even in the case of a larger project with more subdirectories, this can still work:

project-root
│
├── data
│   ├── dataset-01.tsv
│   ├── dataset-02.tsv
│   ├── dataset-03.tsv
│   ├── dataset-04.tsv
│   └── dataset-05.tsv
├── plot
│   ├── plot-01-dataset-description.png
│   ├── plot-02-interesting-variables.png
│   └── plot-03-cool-results.png
│
├── report.pdf
│
└── scripts
    └── cool-bash-script.sh
#' cool-bash-script.sh
script_path=${BASH_SOURCE[0]}
script_dir=$(dirname "$script_path")
projroot="${script_dir}/.."
data_dir="${projroot}/data"

Here, we’ve conveniently defined a projroot path that can be used to define other paths based on it, making code easier to read once you get past the initial hump of looking at the boilerplate lines at the top of the script. This solution ensures that:

  • We don’t need to muck about with absolute paths.
  • It doesn’t matter where we invoke the script from. It’ll figure out how to orient itself based on its position with respect to the current working directory where we call it from automatically.
  • We can move the entire project folder anywhere, and the scripts will automatically identify the correct inputs and output.
  • We can share the project folder with anyone; they won’t have to muck about setting paths. The shared code is directly reproducible.

There are a few downsides to this system:

  1. This boilerplate needs to be at the top of every script we write.
  2. If we change the script’s location, we must update the other files’ relative paths.

Point [1] is unavoidable, and point [2] happens less often than the alternative of moving the project folder directly, making it less of a pain point in general. Nevertheless, I think both of these points are a small price to pay for complete reproducibility of path handling regardless of who runs the code and where it is run from. If you agree and want to implement a similar system in R, scriptloc can help you with it.

The scriptloc solution

You can do something extremely analogous with scriptloc:

library(scriptloc)
script_path <- scriptloc()
script_dir  <- dirname(script_path)

This works regardless of whether you’re executing your code using Rscript from the command line or source()-ing an R file from somewhere else. And that’s it—that’s all there is to using scriptloc! Now that you have access to script_dir, you can refer to other paths with respect to it (I recommend using the file.path function to build these paths—it works uniformly regardless of the OS on which the code is being run).

scriptloc works across any depth of execution

Assume that you have two files script-01.R and script-02.R in a project folder, and that the latter script calls the former:

project
│
├── script-01.R
└── script-02.R
#' script-01.R
writeLines("Output of scriptloc within first script:")
writeLines(scriptloc())
writeLines("---------")
#' script-02.R
library(scriptloc)
source("script-01.R")
writeLines("Output of scriptloc within second script:")
writeLines(scriptloc())

If we run script-02.R from within the project directory, either using Rscript on the command line or by sourcing it interactively, the output will be:

Output of scriptloc within first script:
script-01.R
---------
Output of scriptloc within second script:
script-02.R

When script-02.R was run, it called script-01.R by source-ing it, and the scriptloc() function correctly identified that it was within script-01.R then. When the control came back to script-02.R, it again correctly understands that the execution was being done from script-02.R. In theory, you can have any depth of scripts calling other scripts and scriptloc() will give you the correct path at every step.