This will be the first of hopefully a long series of posts of me learning the R programming language by working through the examples in “An Introduction to Statistical Learning“, which is free in PDF form. Before we get too deep into code and math, keep reading if you want a high level overview of Python (which I am mostly comfortable with) and R (which I know very little of).
It seems likely that most people continuing to read this probably want to know more about my opinions of R vs Python. I can’t really help you there yet, seeing as how I haven’t really learned much R. I can tell you a little of what I know about the language itself though.
I can say that I have several reasons for taking on this project. First and foremost, I hope to learn a new programming language that is relevant to my field. I don’t really consider one language to be distinctly advantageous to the other. Performance, features, and support are near enough to each other that saying one is better than the other is a good way to get into fights with random people on the internet.
Second, while I feel comfortable applying all of these concepts in Python, the thought process is not complete without articulation, and mastery of a concept is best demonstrated by teaching others. So solidifying and presenting my knowledge of statistics and statistical learning is kind of the final step of fully understanding them for myself, and the next step towards full mastery.
Now for a brief, woefully incomplete, and hopefully not too technical breakdown of R and Python. Both R and Python are interpreted languages, as opposed to compiled languages. This means that code is executed as each line is entered. Interpreted languages are typically slower to execute than compiled languages, although this is mitigated by packages that contain precompiled code.
R is primarily a procedural language, whereas Python is an object-oriented language. Procedural languages tend to require fewer lines of code to achieve the same result, although this is not always the case. Procedural languages also tend to execute tasks somewhat more quickly, although once again, this is not always the case. Object-oriented languages tend to be more flexible, but this is not often a consideration in pure data workflows. R does contain some ability to write object-oriented code for some generic functions.
Both R and Python are dynamically typed. This means that the data types of variables are assigned by a set of internal rules and can be easily changed. R has an advantage here, in that R natively supports arrays and data frames. Python requires the Numpy and Pandas packages, respectively, to handle these data types. Dynamically typed languages tend to perform worse than statically typed languages, but also allow for more flexibility in programming. This is particularly important in data science, especially for Data Cleaning and Exploratory Data Analysis (EDA). However, data cleaning and EDA are increasingly being performed on graphical software such as Tableau and Microsoft BI.
There are some things that you just can’t easily do in Python, like certain types of time series analysis, that are fairly trivial in R. Similarly, there are certain things you can do with Python that are challenging in R. Specific to data science, deep learning packages tend to be more featured and more mature for Python than R. Beyond that, both languages enjoy widespread user-bases and community support.
Ultimately, which you choose depends first and foremost on whether or not you need features that one offers but not the other. These tend to be niche, and increasingly rare as both communities continue to develop features and packages for their respective languages. In other words, it’s highly unlikely that one platform will not include the features you need, and if you do need something, you probably won’t have to wait too long for a package that does it.
Performance between the two isn’t compellingly different, especially if you use the Tidyverse set of packages for R. These extremely popular packages make your code a lot more organized, at the cost of performance in some cases. If you really need more optimization or fast on the fly manipulations of large datasets, then you’re probably gonna end up doing a proof of concept in R or Python, and then handing it off to a dev to re-write it in Scala or Julia or Fortran.
This means that for the vast majority of people, the decision between R and Python is mostly one of aesthetics. Python’s entire design philosophy was based on readable, or “pythonic,” code. If you write decent Python, you can expect others to be able to easily see what it is you are doing. R seems to be a little more pithy, and sometimes it can be a little more challenging to follow along. I’ll be able further expound on this as I write more of these articles. As mentioned at the beginning of this post, I know very little about the actual utilization of R. Hopefully, I’ll be able to form some more firm opinions about the actual merits of each as I learn more.
Next up: An Introduction to “An Introduction to Statistical Learning.”