I know what you’re thinking, “not ANOTHER post telling me which programming language to use!“
Don’t worry, I plan to discuss the opposite actually i.e. why you shouldn’t be so concerned about which ‘tool’ you’re using, and in addition, I’ll offer three pieces of essential advice.
There’s a great deal of obsession throughout the Data Science community about which language is The Best, and how all other languages are inferior.
Such viewpoints are simply misguided…
Firstly, all the languages are great, and excel in certain ways, so learn as many as possible. Yes, I know, some are better than others when it comes to hard-core statistical analysis, or Deep Learning, or …, but realistically, there’s quite a few languages to choose from these days that are basically on par for most of what you will do most of the time in your role ie ingest data, interrogate data, clean data, plot data, run some descriptive statistics and produce some models.
Secondly, you sometimes simply don’t have a choice as to which language you can use for a particular role/project – so learn more than one!
1. Know a language from at least one programming paradigm
Simply knowing R & Python is insufficient, as they both stem from the same class of computation, as shown below:
The above image shows the categorisation of some of the most popular programming languages used in Data Science. A more exhaustive formalisation can be found here.
As the image shows, some popular languages sit across more than one paradigm (such as R and Python), so understanding the fundamental principles of each paradigm will greatly help you improve as a developer.
I’ve fortunately had the opportunity to learn a number of languages both formally (through university studies), and informally, by teaching myself for various roles/projects, and to apply most of them in production environments. This has helped me become a much stronger developer, and subsequently, makes it far easier to learn new languages and make the most of them.
Here’s how my own journey basically looks, for the core coding languages. This list is not exhaustive, and excludes SQL and it’s parallelised variants, Fortran, which I used for part of my PhD, and Scala (Spark), which I’ve used for production level code, and other older languages and scripting tools, such as Haskell, Prolog and Perl:
2. Learn good software engineering practices
Beyond knowing how to code across different paradigms, it’s imperative to adopt strong software engineering practices, such as:
- Code reuse principles
- Adequate commenting of code
- Unit testing
- Source control
- Good code design and readability
This is all especially important when working on production level systems, and as part of a team.
This has become a key point of differentiation between candidates when hiring Data Scientists, so if you post your code to a public repo, make sure you highlight how you follow these principles!
3. Understand the fundamentals
When I’m building a new team, or looking to expand an existing one, I focus much more on the fundamental skills that an applicant possesses, because I know they can easily pick up what they lack.
For instance, if I’m recruiting for a position that will develop production level code in Python, for instance, a candidate with good working knowledge of just R and/or Python won’t be as favourable as someone who has production level experience in say C/C++, but only limited knowledge of R/Python.
So rather than just learning how to just use a particular language (after all, any one can do a Python tutorial or two and then profess to be an expert), make sure you know how to actually code, including practising aforementioned software engineering principles, and understand the fundamental principles of the language, what it’s limitations are, how to extract the most value out of it, and how it compares to others.
In addition, the same applies when I’m looking at their background in Machine Learning/Deep Learning – someone with a strong fundamental understanding of maths/stats is in a much better position to learn a new algorithm, compared to a person who may have already used the algorithm, but lacks the depth of knowledge of how it actually works.
Knowing the fundamentals is the essence of being a successful Data Scientist. It’s what will make you stand out from other candidates when applying for your next role/promotion, and will make you a much better and confident Data Scientist. Technical fields such as Data Science and IT are all about continual learning, and to do this effectively, you need to be smart about how you learn by leveraging the latest research.
One final piece of advice – gain experience in distributed data processing. Many Data Science practices, in both the government and private sector, are moving towards distributed processing, so skill up! It’s often what helps me distinguish between candidates when hiring.
Btw, if you’d like to test your coding expertise, match the below code to the language used (in yellow) 🙂