Interoperability was a key theme in open-source data languages in 2023. Ongoing innovations in Arrow (a language-agnostic in-memory standard for data storage), growing adoption of Quarto (the language-agnostic heir apparent to R Markdown), and even pandas creator Wes McKinney joining Posit (the language-agnostic rebranding of RStudio) all illustrate the ongoing investment in breaking down barriers between different programming languages and paradigms.
Despite these advances in technical interoperability, individual developers will always face more friction than state-of-the-art tools when moving between languages. Learning a new language is easily enough done; programming 101 concepts like truth tables and control flow translate seamlessly. But ergonomics of a language do not. The tips and tricks we learn to be hyper productive in a primary language are comfortable, familiar, elegant, and effective. They just feel good. Working in a new language, developers often face a choice between forcing their favored workflows into a new tool where they may not “fit”, writing technically correct yet plodding code to get the job done, or approaching a new language as a true beginner to learn it’s “feel” from the ground up.
Fortunately, some of these higher-level paradigms have begun to bleed across languages, enriching previously isolated tribes with the and enabling developers to take their advanced skillsets with them across languages. For any R users who aim to upskill in python in 2024, recent tools and versions of old favorites have made strides in converging the R and python data science stacks. In this post, I will overview some recommended tools that are both truly pythonic while capturing the comfort and familiarity of some favorite R packages of the tidyverse
variety.1
Just to be clear:
If you told me you liked the New York’s Museum of Metropolitan Art, I might say that you might also like Chicago’s Art Institute. That doesn’t mean you should only go to the museum in Chicago or that you should never go to the Louvre in Paris. That’s not how recommendations (by human or recsys) work. This is an “opinionated” post in the sense that “I like this” and not opinionated in the sense that “you must do this”.
The tools I highlight below tend to have two competing features:
The former is important because otherwise there’s nothing tailored about these recommendations; the latter is important so users actually engage with the python language and community instead of dabbling around in its more peripheral edges. In short, these two principles exclude tools that are direct ports between languages with that as their sole or main benefit.2
For example, siuba
and plotnine
were written with the direct intent of mirroring R syntax. They have seen some success and adoption, but more niche tools come with liabilities. With smaller user-bases, they tend to lack in the pace of development, community support, prior art, StackOverflow questions, blog posts, conference talks, discussions, others to collaborate with, cache in a portfolio, etc. Instead of enjoying the ergonomics of an old language or embracing the challenge of learning a new one, ports can sometimes force developers to invest energy into a “secret third thing” of learning tools that isolate them from both communities and facing inevitable snags by themselves.
When in Rome, do as the Romans do – but if you’re coming from the U.S. that doesn’t mean you can’t bring a universal adapter that can help charge your devices in European outlets.
WIth that preamble out of the way, below are a few recommendations for the most ergonomic tools for getting set up, conducting core data analysis, and communication results.
To preview these recommendations:
Set Up
Analysis
Communication
Miscellaneous
The first hurdle is often getting started – both in terms of installing the tools you’ll need and getting into a comfortable IDE to run them.
print("hello world")
, they face a range of options (system Python, Python installer UI, Anaconda, Miniconda, etc.) each with its own kinks. These decisions are made harder in Python since projects tend to have stronger dependencies of the language, requiring one to switch between versions. For both of these reasons, I favor the
pyenv
(or pyenv-win
for those on Windows) for easily managing python installation(s) from the command line. While the installation process of pyenv
may be technically different, it’s similar in that it “just works” with just a few commands:
pyenv install --list
: To see what python versions are available to installpyenv install <version number>
: To install a specific versionpyenv versions
: To see what python versions are installed on your systempyenv global <version number>
: The set one python version as a global defaultpyenv local <version number>
: The set a python version to be used within a specific directory/projectAs data practitioners know, we’ll spend most of our time on cleaning and wrangling. As such, R users may struggle particularly to abandon their favorite tools for exploratory data analysis like dplyr
and ggplot2
. Fans of those packages often appreciate how their functional paradigm helps achieve a “flow state”. Precise syntax may differ, but new developments in the python wrangling stack provide increasingly close analogs to some of these beloved Rgonomics.
pandas
is undoubtedly the best-known wrangling tool in the python space, I believe the growing
polars
project offers the best experience for a transitioning developer (along with other nice-to-have benefits like being dependency free and blazingly fast). polars
may feel more natural and less error-prone to R users for may reasons:
dplyr
) syntax such as select
, filter
, etc. and has demonstrated that the project values a clean API (e.g. recently renaming groupby
to group_by
)pandas
sometimes alters in-place) so code is more idempotent and avoids a whole class of failure modes for new usersdplyr
with high-level, semantic code like making the transformation of multiple variables at once fast with
column selectors, concisely expressing
window functions, and working with nested data (or what dplyr
calls “list columns”) with
lists and
structsdplyr
‘s many backends (e.g. dbplyr
), polars
can be used to write lazily-evaluated, optimized transformations and it’s syntax is reminiscent of pyspark
should users ever need to switch betweenggplot2
for visualization, both in terms of it’s intuitive and incremental API and the stunning graphics it can produce.
seaborn
‘s object interface seems to strike a great balance between offering a similar workflow (which
cites ggplot2
as an inspiration) while bringing all the benefits of using an industry-standard toolHistorically, one possible dividing line between R and python has been framed as “python is good at working with computers, R is good at working with people”. While that is partially inspired by reductive takes that R is not production-grade, it is not without truth that the R’s academic roots spurred it to overinvest in a rich “communication stack” and translating analytical outputs into human-readable, publishable outputs. Here, too, the gaps have begun to close.
gt
package. I’m more comfortable recommending this since it’s maintained by the same developer as the R version (to support long-term feature parity), backed by an institution not just an individual (to ensure it’s not a short-lived hobby project), and the design feels like it does a good job balancing R inspiration with pythonic practices.qmd
files which are more like their .rmd
predecessors, its renderer can also handle Jupyter notebooks to enable collaboration across team members with different preferencesA few more tools may be helpful and familiar to some R users who tend towards the more “developer” versus “analyst” side of the spectrum. These, in my mind, have even more varied pros and cons, but I’ll leave them for consideration:
virtualenv
, conda
, piptools
, pipenv
, poetry
, and that doesn’t even scratch the surface) with different pros and cons and phenomenal amount of ink/pixels have been spilled over litigating these trade-offs. Putting all that aside, lately, I’ve been favoring
pdm
because it prioritizes features I care most about (auto-updating pyproject.toml
, isolating dependencies from dependencies-of-dependencies, active development and error handling, mostly just works pretty undramatically)ruff
provides a range of linting and styling options (think R’s lintr
and styler
) and provides a one-stop-shop over what can be an overwhelming number of atomic tools in this space (isort
, black
, flake8
, etc.). ruff
is super fast, has a nice VS Code extension, and, while this class of tools is generally considered more advanced, I think linters can be a fantastic “coach” for new users about best practicesEach recommendation here itself could be its own tutorial or post. In particular, I hope to showcase the Rgonomics of polars
, seaborn
, and great_tables
in future posts.
Of course, languages have their own subcultures too. The tidyverse
and data.table
parts of the R world tend to favor different semantics and ergonomics. This post caters more to the former. ︎
There is no doubt a place for language ports, especially for earlier stage project where no native language-specific standard exists. For example, I like Karandeep Singh’s lab work on
a tidyverse for Julia and maintain my own
dbtplyr
package to port dplyr
‘s select helpers to dbt
︎
If anything, the one challenge of VS Code is the sheer number of set up options, but to start out, you can see these excellent tutorials from Rami Krispin on recommended
python and
R configurations ︎