Learn Any Programming Language as a Data Scientist

Introduction

It's important that a programmer not only know multiple languages, but know how to learn languages. Learning languages is a good skill to have for two reasons: first, learning new programming languages is a useful job credential; and second: learning new programming languages improves your understanding of---and therefore your ability to program in---the languages you already know.

Imagine you're a Python data scientist and everything is going great. You're working on social media streaming analysis and your pipeline is flawless. But over time, the data starts coming in faster and faster. Python can only do so much. At some point you're going to need to switch to Scala. If you know how to learn languages, learning Scala and switching your code over is a much easier task.

Convinced? Great. So how do we go about learning new programming languages as a data scientist? Easy. By doing data science.

Learn Any Programming Language: The Process

My process for learning new programming languages is simple: I assemble a simple pipeline and build the core algorithms for doing data science in the new language. This typically takes the following steps:

  1. Write a KNN algorithm
  2. Write a Naive Bayes algorithm
  3. Outline a pipeline in the new languages
  4. Try ingesting data in key formats (e.g., CSV, XML, JSON, web data)
  5. Try outputting results in the same formats.

I start by writing two algorithms: KNN and Naive Bayes. I use these algorithms for two reasons. First, they're pretty simple. Neither takes that much code, so they're digestible first projects in a new language, and two, they introduce us to a lot of the fundamental data types. KNN, for example, is going to get us working with arrays, doing some arithmetic, finding a max value, etc. Naive Bayes, similarly, is a simple algorithm, that is just complex enough to get our brain thinking about the new language.

Second, I outline a simple pipeline. This involves data ingestion, data transformation, classification, and printing the output. This gets me used to writing a larger sequence in the new language. Where before you were writing isolated pieces, here you really start thinking about the hand-offs between sections of your code and the best way to structure your code.

Third, I iterate through several data input and data output formats, and apply the classification algorithms I developed on the data. I'll typically start with CSV, because its the simplest, and then move to JSON, then XML, and finally to scraping my own web data. This gets me familiar with how to handle external files, how to communicate with the system and with the web, and---usually---serves as an introduction to the package management system. I prefer to use the best available packages for CSV, XML, JSON and web scraping here, rather than writing my own versions (although, sometimes I will do that for CSV data). Its an important skill as a developer to be able to find and integrate other people's software into your code and to understand the software echosystem for a given language.

By the time you're done with this process, you should feel pretty comfortable with the new language. You've cut your teeth on the fundamentals, worked in some external libraries, and built real classification pipelines.

So what languages should you learn?

The next obvious question, once you've convinced yourself to learn new languages and have some idea about how to do that, is which programming languages should you learn. For data science, there is a pretty clear hierarchy.

  1. Python
  2. Scala*
  3. R*
  4. Everything else

Almost all data scientists learn Python, so assuming you know that already, learn Scala and then R, or whichever of the two you do not already know. From there, it really depends. Frankly, you should learn whichever languages you are interested in. Interest matters and most languages are theoretically motivated in some fashion.

Some languages to consider, in no specific order:

C, C++ are important languages for working close to the hardware. There is not too much data science happening at this level, but learning these languages will give you a better understanding of the nuts and bolts of all code you write for the rest of eternity. Certainly worth the time, but it will take time.

Java is the most popular language and its worth learning simply for that reason. It's also the language Scala compiles down into, so learning Java is going to help out your Scala projects. You'll also learn some of the philosophy behind the JVM: write once, run anywhere. It's a pretty powerful idea that I'm strongly persuaded by. Also: if you haven't done real object-oriented programming before, the philosophy of OOP is worth understanding.

Clojure, Scheme or Common Lisp are great because they are a totally different style of programming. They're all Lisp dialects, so they use a completely different syntax than you're probably used to. They also have some powerful ideas (e.g., macros) that don't exist in other programming languages. These---like C and C++---fall into the "expand your mind" category because you most likely won't be coding in these while doing data science in practice.

Haskell, OCaml, Erlang are functional languages and, if you've not programmed in a functional way before, I would definitely take the time to learn one. Functional programming is a natural fit with data science, as I've expressed here before. Haskell is considered the "most pure", but you really can't go wrong with any of the three. Two other functional languages to consider would be Coconut or (Eta)[https://eta-lang.org/]. Coconut is a functional language that compiles to Python code and Eta is Haskell that compiles to the JVM. These would be for pure practice---neither language is industry tested---but if you've fallen in love with Python or the JVM, they may offer a shorter learning curve.

Rust and Go are newer programming languages, but they're both making noise. Go was created at Google, Rust was sponsored by Mozilla. Both have their roots in C/C++ and are designed for concurrency. If you like to live on the cutting edge, these are certainly worth paying attention to.

Show me the code

Really don't want to code the algorithms from a blank page? That's fine. Here is a minimalist implementation of KNN in Python. Note that I heavily use of the Python standard library in this. That is not by accident. An important part of knowing a programming language is knowing its standard library and the software ecosystem for that language.

KNN in Python

from heapq import nsmallest
from math import sqrt
from operator import itemgetter
from collections import Counter
from functools import partial

def euc_dist(x,y):
  return sqrt(sum([(x[i]-y[i])**2 for i in range(len(x))]))

def knn_classify(z,xs,ys,k,distance=euc_dist):
  _dist = partial(distance,y=z)
  return Counter([x[1] for x in
           nsmallest(k,zip(map(_dist,xs),ys),
                     key=itemgetter(0))]).most_common()[0][0]

def train_KNN(xs,ys,k):
  _classify = partial(knn_classify,xs=xs,ys=ys,k=k)
  def classify(zs):
    return list(map(_classify, zs))
  return classify

I would typically test this first by initializing some data by hand and classifying it, before moving on to classifying data from a .csv file. If you're into test-driven-development, this is a great chance to start getting used to the language's approach to testing.

X = [(1,1,1),(2,2,2),(10,10,10),(15,15,15)]
Y = [0,0,1,1]
Z = [(1.5,1.5,1.5),(12,12,12)]
knn = train_KNN(X,Y,k=3)
knn(Z)
#res: [0,1]

Conclusion

Knowing and programming languages and knowing how to learn programming languages are important skills for any developer. I suggest data scientists learn new languages by building a simple data science pipeline in the language they're interested in learning. Data scientists should focus on learning Python, Scala and R first---the core data science languages---but then they should branch out to develop a holistic understanding of programming. New-language learners should also focus on learning the standard library of the language they're interested in as well as the software echosystem, including the associated package management systems.

More

  1. There's a list of programming languages by type on Wikipedia. If you're looking for new languages to learn, this is a good place to check out.
  2. GitHut visualizes the most popular languages on GitHub. Language popularity is certainly something you should keep in mind when thinking about languages to learn.
  3. Project Euler is a long list of math problems that require computational solutions. A good way to test your math and programming skills at the same time.
  4. ThinkPython is an intro to Python book, but it has exercises at the end of every chapter. You can try completing these in the language you're learning as extra practice.

Footnotes

*Some will argue that R is a better data science langauge than Scala; however, I'd counter that R is really a data analyst's language not a data scientists language. R has serious shortcomings in terms of performance, even more so than Python. Working with even hundreds of thousands of observations in R becomes prohibitively time consuming. To say nothing of millions, billions or trillions. R is excellent when N is close to the standard statistical size.


Mastering Large Datasets

My new book, Mastering Large Datasets, is in early release now. Head over to Manning.com and buy a copy today.


Subscribe