I have been using R since the 2010s. My background is in software engineering, and I've studied (and forgotten) as much statistics as required for a Master's degree in engineering. Therefore, I view R primarily from a programmer's perspective. Data analysis has always been an important part of my work: I worked as a researcher until 2016 and also after that as a software engineer I have used data analytics in my projects, from traditional statistical analysis to AI and machine learning.
Initially, R was a mixed bag for me. While it excelled in statistical mathematics, its syntax and tools felt clumsy compared to general-purpose languages like Python. For example, simple things like filtering data required cumbersome syntax. I always felt that the language was designed by statisticians rather than programmers, making it excel in its own domain, but being a terrible programming language.
In the spring of 2024, I had to delve into R more deeply and was pleasantly surprised by the direction the language and its tools had taken. There have been several changes to the language itself but more importantly to its libraries and ecosystem, that have been steadily improving during the past years. The most important ones providing quality of life to a programmer are the Pipe operator and Tidyverse.
The Tidyverse, a collection of R packages designed for data science, has significantly improved the user experience. These packages share high-level design principles and similar programming interfaces, making R more cohesive and user-friendly.
The introduction of the native pipe operator |>, similar to Unix pipes and Java Streams, simplifies the chaining of functions and makes the code more readable and easier to reason about.
Example: Pipe Operator and Tidyverse
result = data | >
filter(condition) | >
mutate(new_var = old_var * 2) | >
summarize(mean_value = mean(new_var))
ggplot2, part of the Tidyverse, uses the "grammar of graphics" to create complex and flexible visualizations with simple syntax, enhancing the quality and ease of data visualization.
Tidymodels is Tidyverse's modeling and machine learning package. Its design principle is guiding the user towards best practices, making model training more straightforward and reliable.
Shiny is a powerful tool for building interactive web applications quickly, akin to Python's Dash. It allows for low-code rapid prototyping and deployment of data-driven applications.
Comparing R to Other Languages: A Programmer’s Perspective
For larger projects, I still prefer Python, especially for machine learning or neural networks. However, R's modern tools, particularly with the Tidyverse, have made it a viable and pleasant environment for data analysis.
A programming language is a tool for transferring thoughts to a computer. While I write Python faster due to familiarity, R's improvements have made it nearly as efficient for data-centric tasks. The paradigms offered by R and Pandas are quite similar, focusing on data frames and vectorized operations.
Overall, modern R, enhanced with Tidyverse, has proven to be a pleasant and efficient environment even for a programmer. Tools like RStudio and support in Visual Studio Code have improved, making the R ecosystem more robust and user-friendly. The improvements in package management and virtual environments have elevated R from just a scripting language to a legitimate software development environment.