The future package celebrates ten years on CRAN as of June 19, 2025. This is the second in a series of blog posts highlighting recent improvements to the futureverse ecosystem.
You can now use
my_fcn <- function(...) { with(plan(multisession), local = TRUE) ... }
to temporarily set a future backend for use in your function. This guarantees that any changes are undone when the function exits, even if there is an error or an interrupt.
But, I really recommend not doing any of that, as I’ll try to explain below.
The core design philosophy of futureverse is:
“The developer decides what to parallelize, the user decides where and how.”
This decoupling of intent (what to parallelize) and execution (how to do it) makes code written using futureverse flexible, portable, and easy to maintain.
Specifically, the developer controls what to parallelize by using
future()
or higher-level abstractions like future_lapply()
and
future_map()
to mark code regions that may run concurrently. The
code makes no assumptions about the compute environment and is
therefore agnostic to which future backend being used, e.g.
y <- future_lapply(X, slow_fcn)
and
y <- future_map(X, slow_fcn)
Note how there is nothing in those two function calls that specifies
how they are parallelized, if at all. Instead, the end user (e.g.,
data analyst, HPC user, or script runner) controls the execution
strategy by setting the future backend via plan()
, e.g., built-in
sequential, built-in multisession, future.callr, and
future.mirai backends. This allows the user to scale the same
code from a notebook to an HPC cluster or cloud environment without
changing the original code.
We can find this design of decoupling intent and execution also in
traditional R parallelization frameworks. In the parallel package
we have setDefaultCluster()
, which the user can set to control the
default cluster type when none is explicitly specified. For that to be
used, the developer needs to make sure to use the default cl = NULL
,
either explicitly as in:
y <- parLapply(cl = NULL, X, slow_fcn)
or implicitly1, by making sure all arguments are named, as in:
y <- parLapply(X = X, FUN = slow_fcn)
Unfortunately, this is rarely used - instead parLapply(cl, X, FUN)
is by far the most common way of using the parallel package,
resulting in little to no control for the end user.
The foreach package had greater success with this design philosophy. There the developer writes:
y <- foreach(x = X) %dopar% { slow_fcn(x) }
with no option in that call to specify which parallel backend to
use. Instead, the user typically controls the parallel backend via the
so called “dopar” foreach adapter,
e.g. doParallel::registerDoParallel()
, doMC::registerDoMC()
, and
doFuture::registerDoFuture()
. Unfortunately, there are ways for the
developer to write foreach()
with %dopar%
statements such that the
code works only with a specific parallel backend2. Regardless, it
is clear from their designs, that both of these packages shared the
same fundamental design philosophy of decoupling intent and
execution as is used in the futureverse. You can read more about
this in the introduction of my H. Bengtsson (2021) article.
When writing scripts or Rmarkdown documents, I recommend putting code
that controls the execution (e.g. plan()
, registerDoNnn()
, and
setDefaultCluster()
) at the very top, immediately after any
library()
statements. This is also where I, like many others, prefer
to put global settings such as options()
statements. This makes it
easier for anyone to identify which settings are available and used by
the script. It also avoids cluttering up the rest of the code with
such details.
One practical advantage of the above decoupling design is that there is only one place where parallelization is controlled, instead of it being scattered throughout the code, e.g. as special parallel arguments to different function calls. This makes it easier for the end user, but also for the package developer who does not have to worry about what their APIs should look like and what arguments they should take.
That said, some package developers prefer to expose control of
parallelization via special function arguments. If we search CRAN
packages, we find arguments like parallel = FALSE
, ncores = 1
, and
cluster = NULL
that then are used internally to set up the parallel
backend. If you write functions that take this approach, it is
critical that you remember to set the backend only temporarily,
which can be done via on.exit()
, e.g.
my_fcn <- function(xs, ncores = 1) { if (ncores > 1) { cl <- parallel::makeCluster(ncores) on.exit(parallel::stopCluster(cl)) y <- parLapply(cl = cl, xs, slow_fcn) } else { y <- lapply(xs, slow_fcn) } y }
If you use futureverse, you can use:
my_fcn <- function(xs, ncores = 1) { old_plan <- plan(multisession, workers = ncores) on.exit(plan(old_plan)) y <- future_lapply(xs, slow_fcn) y }
And, since future 1.40.0 (2025-04-10), you can achieve the same with a single line of code3:
my_fcn <- function(xs, ncores = 1) { with(plan(multisession, workers = ncores), local = TRUE) y <- future_lapply(xs, slow_fcn) y }
I hope that this addition lowers the risk of forgetting to undo any
changes done by plan()
inside functions. If you forget, then you may
override what the user intends to use elsewhere. For instance, they
might have set plan(batchtools_slurm)
to run their R code across a
Slurm high-performance-compute (HPC) cluster, but if you change the
plan()
inside your package function without undoing your changes,
then the user is up for a surprise and maybe also hours of
troubleshooting.
I still want to plead with package developers to avoid setting the
future backend, even temporarily, inside their functions. There are
other reasons for not doing this. For instance, if you provide users
with an ncores
arguments for controlling the amount of
parallelization, you risk locking in the user into a specific parallel
backend. A common pattern is to use plan(multisession, workers =
ncores)
as in the above examples. However, this prevents the user
from taking advantage of other closely related parallel backends,
e.g. plan(callr, workers = ncores)
and plan(mirai_multisession,
workers = ncores)
. The future.callr backend runs each parallel
task in a fresh R session that is shut down immediately afterward,
which is beneficial when memory is the limiting factor. The
future.mirai backend is optimized to have a low latency, meaning
it can parallelize also shorter-term tasks, which might otherwise not
be worth parallelizing. Also, contrary to multisession
, these
alternative backends can make use of all CPU cores available on modern
hardware, e.g. 192- and 256-core machines. The multisession
backend,
which builds upon parallel PSOCK clusters, is limited to a maximum
of 125 parallel workers, because each parallel worker consumes one R
connection, and R can only have 125 connections open at any
time. There are ways to increase this limit, but it still requires
work. See parallelly::availableConnections()
for more details on
this problem and how to increase the maximum number of connections.
You can of course add another “parallel” argument to allow your users
to control also which future backend to use, e.g. backend =
multisession
and ncores = 1
. But, this might not be sufficient -
there are backends that take additional arguments, which you then also
need to support in each of your functions. Finally, new backends will
be implemented by others in the future (pun intended and not), and we
can’t predict what they will require.
Related to this, I am working on ways for (i) futureverse to choose
among a set of parallel backends - not just one, (ii) based on
resource specifications (e.g. memory needs and maximum run times) for
specific future statements. This will give back some control to the
developer over how and where execution happens and more options for
the end user to scale out to different type of compute resources. For
instance, a future_map()
call with a 192-GiB memory requirement may
only be sent to “large-memory” backends and, if not available, throw
an instant error. Another example is a future_map()
call with a
256-MiB memory and 5-minute runtime requirement - that is small enough
to be sent to an AWS Lambda or GCS Cloud Functions backend, if the
user has specified such a backend.
In summary, I argue that it’s better to let the user be in full
control of the future backend, by letting them set it via plan()
,
preferably at the top of their scripts. If not possible, please make
sure to use with(plan(...), local = TRUE)
.
May the future be with you!
Henrik
If the argument cl = NULL
of parLapply()
had been the last
argument instead of the first, then parLapply(X, slow_fcn)
,
which resembles lapply(X, slow_fcn)
, would have also resulted
in the default cluster being used.
foreach()
takes backend-specific options
(e.g. .options.multicore
, .options.parallel
, .options.mpi
,
and .options.future
). The developer can use these to adjust
the default behavior of a given foreach adapter. Unfortunately,
when used - or rather, when needed - the code is no longer
agnostic to the backend - what will happen if a foreach adapter
is used that the developer did not anticipate?
The withr package has with_nnn()
and local_nnn()
functions for evaluating code with various settings temporarily
changed. Following this lead, I was very close to adding
with_plan()
and local_plan()
to future 1.40.0, but then
I noticed that mirai supports with(daemons(ncores), {
... })
. This works because with()
is an S3 generic
function. I like this approach, especially since it avoids
adding more functions to the API. I added similar support for
with(plan(multisession, workers = ncores), { ... })
. More
importantly, this allowed me to also add the with(..., local =
TRUE)
variant to be used inside functions, which makes it very
easy to safely switch to a temporary future backend inside a
function.