The quickest user guide imaginable
If you have an R
file that you’re either executing by
source
-ing it or by calling it from the command line using
Rscript
, and you want to get access to the path of that
file inside the script, you can use the scriptloc()
function:
<- scriptloc() script_path
Read on if you want to read why you might want to do this. Otherwise,
you’re all set, and that’s all there is to using scriptloc
.
Just heed the next warning and you’re good to go.
The only thing that can mess up scriptloc
The way in which scriptloc
works will be described in a
different vignette, but the short version is that if you define any
variable called ofile
AFTER a script is sourced but BEFORE
scriptloc() is called, then it won’t work. I know this is an arbitrary
restriction, but this is a current limitation of the package. So,
to be on the safe side, if you want scriptloc() to work
reliably, don’t define any variable called
ofile.
The Project Path Problem
Handling file paths can be one of the most finicky parts of organizing a project. Consider a simple project structure that looks like so:
project
│
├── data.tsv # (Raw data)
└── script.R # (Script that does the analysis)
In the project directory, we have a file called data.tsv
that contains information to be processed. We want our analysis to be
reproducible, so we do the sensible thing and have a
script.R
file cataloging the steps we take to produce an
output.tsv
.
#' script.R
#' Parse our dataset and write out results
library(AwesomeRLib)
f <- 'data.tsv'
x <- read.table(f, sep = '\t', header = T, stringsAsFactors = F)
y <- some_cool_function(x)
outf <- 'output.tsv'
write.table(y, outf, sep = '\t', row.names = F, quote = F)
Here’s a simple question: Where will the output be produced if we run
script.R
? This is a trick question because I’ve withheld
crucial information — namely, the current working directory from which
the script is being run. Consider the full view of the directory
tree:
home
│
└── user
│
└── project
│
├── data.tsv # (Raw data)
└── script.R # (Script that does the analysis)
The absolute path of our project is /home/user/project/
.
If we start our R session from inside this folder, then it is implied
that the data.tsv
being read has the absolute path
/home/user/project/data.tsv
, and the output file being
written will have the absolute path
/home/user/project/output.tsv
. But if our R session has a
different working directory, it won’t be able to find
data.tsv
, or worse, it could be opening a completely
different file than what we intended. Because of this, it is common to
see R scripts in the wild that look like this:
#' script.R
#' Parse our dataset and write out results
setwd('/home/user/project')
library(AwesomeRLib)
f <- 'data.tsv'
x <- read.table(f, sep = '\t', header = T, stringsAsFactors = F)
y <- some_cool_function(x)
outf <- 'output.tsv'
write.table(y, outf, sep = '\t', row.names = F, quote = F)
Notice that the first line now has a call to setwd()
, so
that the correct file is being read in and the correct file is being
written out. This is one way to solve the problem. Another, more
cumbersome way to do this would be as follows:
#' script.R
#' Parse our dataset and write out results
library(AwesomeRLib)
f <- '/home/user/project/data.tsv'
x <- read.table(f, sep = '\t', header = T, stringsAsFactors = F)
y <- some_cool_function(x)
outf <- '/home/user/project/output.tsv'
write.table(y, outf, sep = '\t', row.names = F, quote = F)
In this solution, we are very explicit about the paths of the files so that no mistake can be made about what exactly is happening. Again, this script “gets the job done”.
Having said that, both of the previous solutions are inelegant:
- If you change the project folder’s location, you must remember to update the absolute path.
- When sharing the project folder with others, they must to update the path before they run it.
In the grand scheme of things, this seems like a minor problem — but we’ve only been considering a simple project. Consider even a slightly larger project that looks like so:
project-root
│
├── data
│ ├── dataset-01.tsv
│ ├── dataset-02.tsv
│ ├── dataset-03.tsv
│ ├── dataset-04.tsv
│ └── dataset-05.tsv
├── plot
│ ├── plot-01-dataset-description.png
│ ├── plot-02-interesting-variables.png
│ └── plot-03-cool-results.png
│
├── report.pdf
│
└── scripts
├── 01-clean-data.R
├── 02-process-data.R
├── 03-plot-data.R
└── 04-generate-report.Rmd
The project is neatly organized so that the data
and
code
are separated. At the very least, this implies that
you need to update paths in 4 scripts. There are other ways to tackle
this (such as using a project-specific .Rprofile
), but that
adds one more layer of complexity.
A solution from BASH
In general, the only way to avoid this mess is to avoid using
absolute paths. But as we’ve seen, relative paths are at the mercy of
the working directory. There’s a lovely solution to this problem in the
BASH
world, wherein we are given tools to dynamically
access the path of the file being run from within the file itself.
Consider a simple project again that looks like so:
project
│
├── data.tsv
└── script.sh
#' script.sh
script_path=${BASH_SOURCE[0]}
script_dir=$(dirname "$script_path")
data_path="${script_dir}/data.tsv"
some_cool_software "$data_path"
BASH_SOURCE[0]
contains the path to the script being
executed, and dirname
allows for getting the path to the
directory where it is stored. The script_dir
variable is
now the project root, and all paths can be expressed with respect to it,
as seen with the data_path
example. So, now, all our paths
are relative to the script’s location, but we don’t manually specify
it—BASH
is smart enough to understand what we want without
mentioning it explicitly.
Even in the case of a larger project with more subdirectories, this can still work:
project-root
│
├── data
│ ├── dataset-01.tsv
│ ├── dataset-02.tsv
│ ├── dataset-03.tsv
│ ├── dataset-04.tsv
│ └── dataset-05.tsv
├── plot
│ ├── plot-01-dataset-description.png
│ ├── plot-02-interesting-variables.png
│ └── plot-03-cool-results.png
│
├── report.pdf
│
└── scripts
└── cool-bash-script.sh
#' cool-bash-script.sh
script_path=${BASH_SOURCE[0]}
script_dir=$(dirname "$script_path")
projroot="${script_dir}/.."
data_dir="${projroot}/data"
Here, we’ve conveniently defined a projroot
path that
can be used to define other paths based on it, making code easier to
read once you get past the initial hump of looking at the boilerplate
lines at the top of the script. This solution ensures that:
- We don’t need to muck about with absolute paths.
- It doesn’t matter where we invoke the script from. It’ll figure out how to orient itself based on its position with respect to the current working directory where we call it from automatically.
- We can move the entire project folder anywhere, and the scripts will automatically identify the correct inputs and output.
- We can share the project folder with anyone; they won’t have to muck about setting paths. The shared code is directly reproducible.
There are a few downsides to this system:
- This boilerplate needs to be at the top of every script we write.
- If we change the script’s location, we must update the other files’ relative paths.
Point [1] is unavoidable, and point [2] happens less often than the
alternative of moving the project folder directly, making it less of a
pain point in general. Nevertheless, I think both of these points are a
small price to pay for complete reproducibility of path handling
regardless of who runs the code and where it is run from. If you agree
and want to implement a similar system in R, scriptloc
can
help you with it.
The scriptloc
solution
You can do something extremely analogous with
scriptloc
:
library(scriptloc)
<- scriptloc()
script_path <- dirname(script_path) script_dir
This works regardless of whether you’re executing your code using
Rscript
from the command line or source()
-ing
an R file from somewhere else. And that’s it—that’s all there is to
using scriptloc
! Now that you have access to
script_dir
, you can refer to other paths with respect to it
(I recommend using the file.path
function to build these
paths—it works uniformly regardless of the OS on which the code is being
run).
scriptloc
works across any depth of execution
Assume that you have two files script-01.R
and
script-02.R
in a project folder, and that the latter script
calls the former:
project
│
├── script-01.R
└── script-02.R
#' script-01.R
writeLines("Output of scriptloc within first script:")
writeLines(scriptloc())
writeLines("---------")
#' script-02.R
library(scriptloc)
source("script-01.R")
writeLines("Output of scriptloc within second script:")
writeLines(scriptloc())
If we run script-02.R
from within the project directory,
either using Rscript
on the command line or by
sourcing
it interactively, the output will be:
Output of scriptloc within first script:
script-01.R
---------
Output of scriptloc within second script:
script-02.R
When script-02.R
was run, it called
script-01.R
by source
-ing it, and the
scriptloc()
function correctly identified that it was
within script-01.R
then. When the control came back to
script-02.R
, it again correctly understands that the
execution was being done from script-02.R
. In theory, you
can have any depth of scripts calling other scripts and
scriptloc()
will give you the correct path at every
step.