Programming for data science is not wholly unlike other types of programming, but it is certainly more focused on one thing: data. Data science exists only to manipulate data (especially large amounts of data) -- so its important that the way we approach programming coincides with this goal. Here, I argue that functional programming is the paradigm best suited for data science and that
reduce specifically should be the entry point for novice programmers.
Functional programming is a programming paradigm that eschews side-effects---state-changes and mutable data---and imagines the art of programming like mathematics. That is, the things we do with code should be mathematical functions. There is a philosophy behind this style that I find compelling (especially for big-data computing, distributed systems, etc.); however, none of that is necessary for a beginning functional programmer / functional data scientist.
For the beginner, what is important is that functional programming is based around writing functions that take data in and return other data out---just like in math. And this is handy because in data science we tend to do a lot of rudimentary math.
It is also important to know that in functional programming functions are "first class" data. That is, we can put functions anywhere we want. Why can we do this? Because the function will eventually evaluate into whatever the output data type is, we can write as if the function is its output type. (Again, I'm trying not to go too deep here---some books are listed at the end for those interested in learning more functional ideas.)
reduce then, become our way of taking simple functions and applying them to large amounts of data. There are some perks related to these functions---map and reduce are themselves equations which take data in and return data back---for parallel computing which big-data data scientists should find even more compelling.
I'd argue that
map should be one of the first ideas taught to young data scientists. Why? Because data science is very often about transforming data, and
map is the quintessential transformer.
Imagine a regression problem: we have an new observation and we want to predict the target value based on our regression model. A simple way to do that could be:
predicted_value = predict(new_observation)
(def predicted-value (predict new-observation))
Here we're assuming that
predict is our prediction function,
new_observation is a collection of data in some desirable form, and
predicted_value is going to hold our result.
What if we want to predict a whole bunch of new data, and not just one new observation? We're SOL---our function only works on one observation... Here's where
map comes in hand.
map can transform a whole bunch of data all at once.
#Predict one v = predict((1,.54,10.13)) #Predict many using map vs = map(predict, [(1,.54,10.13), (2,.88,9.74), (1,.91,11.25)])
;Predict one (def v (predict [1 .54 10.13])) ;Predict many (def vs (map predict [[1 .54 10.13] [2 .88 9.74] [1 .91 11.25]]))
map makes it trivial to take concepts we've built around a single value or input and apply them to a large number of inputs. Additionally, because
map expects pure functions which take input and return output (not changing state along the way) we can run these options in parallel very, very easily.
# Predict many using map # Same as above vs = map(predict, [(1,.54,10.13), (2,.88,9.74), (1,.91,11.25)]) #Predict many using parallel map import multiprocessing with multiprocessing.Pool(4) as p: vs2 = p.map(predict [(1,.54,10.13), (2,.88,9.74), (1,.91,11.25)])
;Predict many -- same as above (def vs (map predict [[1 .54 10.13] [2 .88 9.74] [1 .91 11.25]])) ;Predict many using parallel map (def vs (pmap predict [[1 .54 10.13] [2 .88 9.74] [1 .91 11.25]]))
Indeed, a lot of what we do in data science, especially when we're applying classification or regression algorithms, can be done in parallel. Having a parallel map handy for this can be a big time saver. ##
While map returns as many outputs as it received inputs,
reduce returns one output no matter how many inputs it receives. For this reason, we can think of
reduce as an aggregator, an accumulator, or a reducer (i.e., it reduces a list of values to a single value).
Most people are already familiar with the idea of
reduce because they are familiar with things like: max, min and sum. Imagine a summation function: it takes an array of numbers and returns their value all added together. Most programming languages will have this built in. This is a reduce operation.
NB: I'll use only Python for the examples here because with Clojure, its unnecessarily difficult to do it in a non-functional way
# Standard sum s1 = sum(range(10)) # Reduce sum from functools import reduce from operator import add s2 = reduce(add, range(10)) s1==s2 # returns True
The reduce operation here is a little clunkier than the built-in; however, like in the previous example using
map, we have trivial access to parallelization without really changing our code (see the fold method from the toolz library for more.)
Ultimately, the reason that I argue
reduce should be taught early on to young data scientists is this: teaching
reduce starts data scientists thinking about their operations in terms of data types.
map, we know, takes a sequence and returns a sequence.
reudce, we know, takes a sequence and returns a single value. Having these two tricks up our sleeve allows us to do a whole host of interesting things.
Further, as one introduces more complex data types,
reduce (but especially
reduce) gain more power.
reduce combined with maps, dicts or associative data structures allows for easy implementations of "group by", "count by" methods.
And as we structure our code around core functions (
reduce), our code becomes increasingly readable. We design functions for the singular case and apply it across the plural or sequential case (Clojure's core library works this way by default; Python's does not: compare
max in Clojure to
max in Python).
Anytime we can emphasize to young data scientists that its important to think about the data they have, we should be emphasizing they thinking about the data. Data science begins with data and so should the programming paradigm data scientists use.
Books on Functional Programming or functional Data ScienceMay 10, 2018