IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    Setting Future Plans in R Functions — and Why You Probably Shouldn’t

    JottR on R发表于 2025-06-25 00:00:00
    love 0
    [This article was first published on JottR on R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

    The 'future' hexlogo balloon wall

    The future package celebrates ten years on CRAN as of June 19, 2025. This is the second in a series of blog posts highlighting recent improvements to the futureverse ecosystem.

    TL;DR

    You can now use

    my_fcn <- function(...) {
      with(plan(multisession), local = TRUE)
      ...
    } 
    

    to temporarily set a future backend for use in your function. This guarantees that any changes are undone when the function exits, even if there is an error or an interrupt.

    But, I really recommend not doing any of that, as I’ll try to explain below.

    Decoupling of intent to parallelize and how to execute it

    The core design philosophy of futureverse is:

    “The developer decides what to parallelize, the user decides where and how.”

    This decoupling of intent (what to parallelize) and execution (how to do it) makes code written using futureverse flexible, portable, and easy to maintain.

    Specifically, the developer controls what to parallelize by using future() or higher-level abstractions like future_lapply() and future_map() to mark code regions that may run concurrently. The code makes no assumptions about the compute environment and is therefore agnostic to which future backend being used, e.g.

    y <- future_lapply(X, slow_fcn)
    

    and

    y <- future_map(X, slow_fcn)
    

    Note how there is nothing in those two function calls that specifies how they are parallelized, if at all. Instead, the end user (e.g., data analyst, HPC user, or script runner) controls the execution strategy by setting the future backend via plan(), e.g., built-in sequential, built-in multisession, future.callr, and future.mirai backends. This allows the user to scale the same code from a notebook to an HPC cluster or cloud environment without changing the original code.

    We can find this design of decoupling intent and execution also in traditional R parallelization frameworks. In the parallel package we have setDefaultCluster(), which the user can set to control the default cluster type when none is explicitly specified. For that to be used, the developer needs to make sure to use the default cl = NULL, either explicitly as in:

    y <- parLapply(cl = NULL, X, slow_fcn)
    

    or implicitly1, by making sure all arguments are named, as in:

    y <- parLapply(X = X, FUN = slow_fcn)
    

    Unfortunately, this is rarely used - instead parLapply(cl, X, FUN) is by far the most common way of using the parallel package, resulting in little to no control for the end user.

    The foreach package had greater success with this design philosophy. There the developer writes:

    y <- foreach(x = X) %dopar% { slow_fcn(x) }
    

    with no option in that call to specify which parallel backend to use. Instead, the user typically controls the parallel backend via the so called “dopar” foreach adapter, e.g. doParallel::registerDoParallel(), doMC::registerDoMC(), and doFuture::registerDoFuture(). Unfortunately, there are ways for the developer to write foreach() with %dopar% statements such that the code works only with a specific parallel backend2. Regardless, it is clear from their designs, that both of these packages shared the same fundamental design philosophy of decoupling intent and execution as is used in the futureverse. You can read more about this in the introduction of my H. Bengtsson (2021) article.

    When writing scripts or Rmarkdown documents, I recommend putting code that controls the execution (e.g. plan(), registerDoNnn(), and setDefaultCluster()) at the very top, immediately after any library() statements. This is also where I, like many others, prefer to put global settings such as options() statements. This makes it easier for anyone to identify which settings are available and used by the script. It also avoids cluttering up the rest of the code with such details.

    Straying away from the core design philosophy

    One practical advantage of the above decoupling design is that there is only one place where parallelization is controlled, instead of it being scattered throughout the code, e.g. as special parallel arguments to different function calls. This makes it easier for the end user, but also for the package developer who does not have to worry about what their APIs should look like and what arguments they should take.

    That said, some package developers prefer to expose control of parallelization via special function arguments. If we search CRAN packages, we find arguments like parallel = FALSE, ncores = 1, and cluster = NULL that then are used internally to set up the parallel backend. If you write functions that take this approach, it is critical that you remember to set the backend only temporarily, which can be done via on.exit(), e.g.

    my_fcn <- function(xs, ncores = 1) {
      if (ncores > 1) {
        cl <- parallel::makeCluster(ncores)
        on.exit(parallel::stopCluster(cl))
        y <- parLapply(cl = cl, xs, slow_fcn)
      } else {
        y <- lapply(xs, slow_fcn)
      }
      y
    }
    

    If you use futureverse, you can use:

    my_fcn <- function(xs, ncores = 1) {
      old_plan <- plan(multisession, workers = ncores)
      on.exit(plan(old_plan))
      y <- future_lapply(xs, slow_fcn)
      y
    }
    

    And, since future 1.40.0 (2025-04-10), you can achieve the same with a single line of code3:

    my_fcn <- function(xs, ncores = 1) {
      with(plan(multisession, workers = ncores), local = TRUE)
      y <- future_lapply(xs, slow_fcn)
      y
    }
    

    I hope that this addition lowers the risk of forgetting to undo any changes done by plan() inside functions. If you forget, then you may override what the user intends to use elsewhere. For instance, they might have set plan(batchtools_slurm) to run their R code across a Slurm high-performance-compute (HPC) cluster, but if you change the plan() inside your package function without undoing your changes, then the user is up for a surprise and maybe also hours of troubleshooting.

    But, please avoid switching future backends if you can

    I still want to plead with package developers to avoid setting the future backend, even temporarily, inside their functions. There are other reasons for not doing this. For instance, if you provide users with an ncores arguments for controlling the amount of parallelization, you risk locking in the user into a specific parallel backend. A common pattern is to use plan(multisession, workers = ncores) as in the above examples. However, this prevents the user from taking advantage of other closely related parallel backends, e.g. plan(callr, workers = ncores) and plan(mirai_multisession, workers = ncores). The future.callr backend runs each parallel task in a fresh R session that is shut down immediately afterward, which is beneficial when memory is the limiting factor. The future.mirai backend is optimized to have a low latency, meaning it can parallelize also shorter-term tasks, which might otherwise not be worth parallelizing. Also, contrary to multisession, these alternative backends can make use of all CPU cores available on modern hardware, e.g. 192- and 256-core machines. The multisession backend, which builds upon parallel PSOCK clusters, is limited to a maximum of 125 parallel workers, because each parallel worker consumes one R connection, and R can only have 125 connections open at any time. There are ways to increase this limit, but it still requires work. See parallelly::availableConnections() for more details on this problem and how to increase the maximum number of connections.

    You can of course add another “parallel” argument to allow your users to control also which future backend to use, e.g. backend = multisession and ncores = 1. But, this might not be sufficient - there are backends that take additional arguments, which you then also need to support in each of your functions. Finally, new backends will be implemented by others in the future (pun intended and not), and we can’t predict what they will require.

    Related to this, I am working on ways for (i) futureverse to choose among a set of parallel backends - not just one, (ii) based on resource specifications (e.g. memory needs and maximum run times) for specific future statements. This will give back some control to the developer over how and where execution happens and more options for the end user to scale out to different type of compute resources. For instance, a future_map() call with a 192-GiB memory requirement may only be sent to “large-memory” backends and, if not available, throw an instant error. Another example is a future_map() call with a 256-MiB memory and 5-minute runtime requirement - that is small enough to be sent to an AWS Lambda or GCS Cloud Functions backend, if the user has specified such a backend.

    In summary, I argue that it’s better to let the user be in full control of the future backend, by letting them set it via plan(), preferably at the top of their scripts. If not possible, please make sure to use with(plan(...), local = TRUE).

    May the future be with you!

    Henrik

    Reference

    • H. Bengtsson, A Unifying Framework for Parallel and Distributed Processing in R using Futures, The R Journal (2021) 13:2, pages 208-227 [abstract, PDF]

    1. If the argument cl = NULL of parLapply() had been the last argument instead of the first, then parLapply(X, slow_fcn), which resembles lapply(X, slow_fcn), would have also resulted in the default cluster being used.

      [return]
    2. foreach() takes backend-specific options (e.g. .options.multicore, .options.parallel, .options.mpi, and .options.future). The developer can use these to adjust the default behavior of a given foreach adapter. Unfortunately, when used - or rather, when needed - the code is no longer agnostic to the backend - what will happen if a foreach adapter is used that the developer did not anticipate?

      [return]
    3. The withr package has with_nnn() and local_nnn() functions for evaluating code with various settings temporarily changed. Following this lead, I was very close to adding with_plan() and local_plan() to future 1.40.0, but then I noticed that mirai supports with(daemons(ncores), { ... }). This works because with() is an S3 generic function. I like this approach, especially since it avoids adding more functions to the API. I added similar support for with(plan(multisession, workers = ncores), { ... }). More importantly, this allowed me to also add the with(..., local = TRUE) variant to be used inside functions, which makes it very easy to safely switch to a temporary future backend inside a function.

      [return]
    To leave a comment for the author, please follow the link and comment on their blog: JottR on R.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
    Continue reading: Setting Future Plans in R Functions — and Why You Probably Shouldn’t


沪ICP备19023445号-2号
友情链接