A new major version of the curl package has been released to CRAN. This release both brings internal improvements as well as new user-facing functionality, in particular with respect to concurrent downloads. From the NEWS file:
curl 5.0.0 - New function multi_download() which supports concurrent downloads and resuming download for large files, while giving detailed progress information. - Windows: updated libcurl to 7.84.0 + nghttp2 - Windows: default to CURLSSLOPT_NATIVE_CA when using openssl unless an ennvar with CURL_CA_BUNDLE is set. - Use the new optiontype API for type checking if available (libcurl 7.73.0)
The curl package is used by most other R packages for performing HTTP requests. Over 60% of rOpenSci packages directly or indirectly depend on curl for network interaction, hence improvements and bugs in curl have a big impact on the entire ecosystem.
The most exciting new feature is multi_download()
: an advanced alternative to curl_download()
. It can perform many requests concurrently, with nice progress updates and support for interrupting and resuming large files. This function does not error in case any of the individual requests fail; it returns a data frame with information about the status of each request.
pkg <- 'curl' mirror <- 'https://cloud.r-project.org' db <- available.packages(repos = mirror) packages <- c(pkg, tools::package_dependencies(pkg, db = db, reverse = TRUE)[[pkg]]) versions <- db[packages,'Version'] urls <- sprintf("%s/src/contrib/%s_%s.tar.gz", mirror, packages, versions) res <- curl::multi_download(urls) all.equal(unname(tools::md5sum(res$destfile)), unname(db[packages, 'MD5sum']))
Above a small example from the ?multi_download
manual, which downloads all reverse dependencies for a given CRAN package. It downloads 316 files, total 261.41 Mb. On a fast server, the multi_download()
part takes about 1 or 2 seconds.
The function scales well in terms of the number of requests. Below is an example, which downloads the DESCRIPTION file for the first 3000 CRAN packages. On a fast server (with HTTP/2 support) this again takes about 2 or 3 seconds.
mirror <- 'https://cloud.r-project.org' pkgs <- row.names(available.packages(repos = mirror))[1:3000] urls <- sprintf('%s/web/packages/%s/DESCRIPTION', mirror, pkgs) files <- sprintf('descriptions/%s.txt', pkgs) dir.create('descriptions', showWarnings = FALSE) res <- curl::multi_download(urls, files)
This second example will especially from HTTP/2 support because there are many small files that can be multiplexed, whereas with HTTP/1.1 these need to be requested one after another.
The Windows binaries are now using libcurl 7.84.0
with nghttp 1.51.0
. The latter brings support for HTTP/2, but only when using the OpenSSL TLS backend, which is not (yet) the default. You can change this by setting the CURL_SSL_BACKEND
environment variable in your ~/.Renviron
file and then restart R. The Windows vignette explains this in more detail.
To test if HTTP/2 is working you can perform a verbose request:
library(curl) multi_download('https://httpbin.org/get', tempfile(), verbose = TRUE)
And the output will show HTTP/2 200
somewhere in the response:
... * Connection state changed (MAX_CONCURRENT_STREAMS == 128)! < HTTP/2 200 ...
Right now OpenSSL is not the default, because Windows Native TLS back-end may be more robust, which has to do with the next topic.
As mentioned above, libcurl on Windows can use one of two SSL back-ends (for https): SecureChannel (the native Windows TLS implementation) or OpenSSL. OpenSSL is also used by most other operating systems and is therefore better tested and moreover it supports HTTP/2. However there was always a big limitation with OpenSSL Windows: it required us to ship a ca-bundle with root certificates, which gets outdated quickly and may not work well on corporate networks that use custom SSL certificates.
This has now changed because libcurl has gained a new experimental option CURLSSLOPT_NATIVE_CA
which lets OpenSSL import the root certificates from the native Windows certificate store, instead of a custom ca-bundle. The R package now enables this option by default when using the OpenSSL back-end. Thereby curl in R should support the same TLS connections, regardless of which SSL back-end is in use. This might make OpenSSL once again the preferable option, and if this works well we may make it the default in a future version of the R package.
The final topic is mostly an internal change, but I’m pretty proud of it because it is based on functionality in libcurl that I proposed myself, and is now finally widely available.
At the curl-up 2020 conference I gave a presentation 5 years of libcurl bindings for R, after which we had a discussion on potential improvements for language bindings, such as in the R package. Eventually this led to the proposal of a new API that exposes a list of supported libcurl options and their types, to the language binding. This is important such that when users in R set an option in new_handle()
, it can be verified that the option is valid and has the correct type (e.g. string, number, vector), because passing invalid types to libcurl will result in a crash.
The proposal was merged later in 2020, and is now (2 years later) available in the stable versions of most operating systems. Version 5.0.0 of the R package (conditionally) use this API if available, which makes the type bindings safer to use.
It's been a journey, but with help from friends like @opencpu, today we landed a new API to libcurl to query it for details about "easy options". This should allow for better libcurl bindings in the future.
— daniel:// stenberg:// (@bagder) August 27, 2020
Suitable staring point for reading up: https://t.co/OlAWuBuDaR