When it comes to the ‘lingua franca’ of data science, there seems to be a face-off between R and Python. R has long been established as the language of researchers and statisticians but Python has come up quickly as a bona-fide challenger, helping embed analytics as a necessity for businesses and other organizations in 2017.
If a tech war does exist between the two languages, it’s a battle fought not so much on technical features but instead on the wider changes within modern business and technology. R is a language purpose-built for statistics, for performing accurate and intensive analysis. So, the fact that R is being challenged by Python — a language that is flexible, fast, and relatively easy to learn — suggests we are seeing a change in who’s actually doing data science, where they’re doing it, and what they’re trying to achieve.
Python versus R — A Closer Look
Let’s make a quick comparison of the two languages on aspects important to those working with data and see what we can learn about the two worlds where R and Python operate.
Python is the easier language to learn. While R certainly isn’t impenetrable, Python’s syntax marks it as a great language to learn even if you’re completely new to programming. The fact that such an easy language would come to rival R within data science indicates the pace at which the field is expanding. More and more people are taking on data-related roles, possibly without a great deal of programming knowledge — Python makes the barrier to entry much lower than R. That said, once you get to grips with the basics of R, it becomes relatively easier to learn the more advanced stuff. This is why statisticians and experienced programmers find R easier to use.
Packages and libraries
Many R packages are in-built. Python, meanwhile, depends upon a range of external packages. This obviously makes R much more efficient as a statistical tool — it means that if you’re using Python you need to know exactly what you’re trying to do and what external support you’re going to need.
R is well-known for its excellent graphical capabilities. This makes it easy to present and communicate data in varied forms. For statisticians and researchers, the importance of that is obvious. It means you can perform your analysis and present your work in a way that is relatively seamless. The ggplot2 package in R, for example, allows you to create complex and elegant plots with ease and as a result, its popularity in the R community has increased over the years.
Python also offers a wide range of libraries which can be used for effective data storytelling. The breadth of external packages available with Python means the scope of what’s possible is always expanding. Matplotlib has been a mainstay of Python data visualization. It’s also worth remarking on upcoming libraries like Seaborn. Seaborn is a neat little library that sits on top of Matplotlib, wrapping its functionality and giving you a neater API for specific applications.
So, to sum up, you have sufficient options to perform your data visualization tasks effectively — using either R or Python!
Analytics and Machine Learning
Thanks to libraries like scikit-learn, Python helps you build machine learning systems with relative ease. This takes us back to the point about barrier to entry. If machine learning is upending how we use and understand data, it makes sense that more people want a piece of the action without having to put in too much effort. But Python also has another advantage; it’s great for creating web services where data can be uploaded by different people. In a world where accessibility and data empowerment have never been more important (i.e., where everyone takes an interest in data, not just the data team), this could prove crucial.
With packages such as caret, MICE, and e1071, R too gives you the power to perform effective machine learning in order to get crucial insights out of your data. However, R falls short in comparison to Python, thanks to the latter’s superior libraries and more diverse use-cases.
Both R and Python have libraries for deep learning. It’s much easier and more efficient with Python though — most likely because the Python world changes much more quickly, new libraries and tools springing up as quickly as the data science world hooks on to a new buzzword. Theano, and most recently Keras and TensorFlow have all made a huge impact on making it relatively easy to build incredibly complex and sophisticated deep learning systems. If you’re clued-up and experienced with R it shouldn’t be too hard to do the same, using libraries such as MXNetR, deepr, and H2O — that said, if you want to switch models, you may need to switch tools, which could be a bit of a headache.
With Python, you can write efficient MapReduce applications with ease, or scale your R program on Hadoop to work with petabytes of data. Both R and Python are equally good when it comes to working with Big Data, as they can be seamlessly integrated with Big Data tools such as Apache Spark and Apache Hadoop, among many others. It’s likely that it’s in this field that we’re going to see R moving more and more into industry as businesses look for a concise way to handle large datasets. This is true in industries such as bioinformatics which have a close connection with the academic world and necessarily depend upon a combination of size and accuracy when it comes to working with data.
So, where does this comparison leave us?
Ultimately, what we see are two different languages offering great solutions to very different problems in data science.
In Python, we have a flexible and adaptable language with a vibrant community of developers working on a huge range of problems and tasks, each one trying to find more effective and more intelligent ways of doing things. In R, we have a purely statistical language with a large repository of over 8000 packages for data analysis and visualization. While Python is production-ready and is better suited for organizations looking to harness technical innovation to its advantage, R’s analytical and data visualization capabilities can make your life as a statistician or data analyst easier.
Recent surveys indicate that Python commands a higher salary than R — that is because it’s a language that can be used across domains; a problem-solving language. That’s not to say that R isn’t a valuable language; rather, Python is the language that just seems to fit the times at the moment.
In the end, it all boils down to your background, and the kind of data problems you want to solve. If you come from a statistics or research background and your problems only revolve around statistical analysis and visualization, then R would best fit your bill. However, if you’re a Computer Science graduate looking to build a general-purpose, enterprise-wide data model which can integrate seamlessly with the other business workflows, you will find Python easier to use.
R and Python are two different animals. Instead of comparing the two, maybe it’s time we understood where and how each can be best used and then harnessed their power to the fullest to solve our data problems.
One thing is for sure, though — neither is going away anytime soon. Both R and Python occupy a large chunk of the data science market-share today, and it will take a major disruption to take either one of them out of the equation completely.