Map and Reduce in Data Science Programming

Introduction

Programming for data science is not wholly unlike other types of programming, but it is certainly more focused on one thing: data. Data science exists only to manipulate data (especially large amounts of data) -- so its important that the way we approach programming coincides with this goal. Here, I argue that functional programming is the paradigm best suited for data science and that map / reduce specifically should be the entry point for novice programmers.

Functional programming

Functional programming is a programming paradigm that eschews side-effects---state-changes and mutable data---and imagines the art of programming like mathematics. That is, the things we do with code should be mathematical functions. There is a philosophy behind this style that I find compelling (especially for big-data computing, distributed systems, etc.); however, none of that is necessary for a beginning functional programmer / functional data scientist.

For the beginner, what is important is that functional programming is based around writing functions that take data in and return other data out---just like in math. And this is handy because in data science we tend to do a lot of rudimentary math.

It is also important to know that in functional programming functions are "first class" data. That is, we can put functions anywhere we want. Why can we do this? Because the function will eventually evaluate into whatever the output data type is, we can write as if the function is its output type. (Again, I'm trying not to go too deep here---some books are listed at the end for those interested in learning more functional ideas.)

map and reduce then, become our way of taking simple functions and applying them to large amounts of data. There are some perks related to these functions---map and reduce are themselves equations which take data in and return data back---for parallel computing which big-data data scientists should find even more compelling.

map

I'd argue that map should be one of the first ideas taught to young data scientists. Why? Because data science is very often about transforming data, and map is the quintessential transformer.

Imagine a regression problem: we have an new observation and we want to predict the target value based on our regression model. A simple way to do that could be:

In Python

predicted_value = predict(new_observation)

In Clojure

(def predicted-value (predict new-observation))

Here we're assuming that predict is our prediction function, new_observation is a collection of data in some desirable form, and predicted_value is going to hold our result.

What if we want to predict a whole bunch of new data, and not just one new observation? We're SOL---our function only works on one observation... Here's where map comes in hand. map can transform a whole bunch of data all at once.

In Python

#Predict one
v = predict((1,.54,10.13))

#Predict many using map
vs = map(predict, [(1,.54,10.13),
(2,.88,9.74),
(1,.91,11.25)])

In Clojure

;Predict one
(def v (predict [1 .54 10.13]))
;Predict many
(def vs (map predict
[[1 .54 10.13] [2 .88 9.74] [1 .91 11.25]]))

map makes it trivial to take concepts we've built around a single value or input and apply them to a large number of inputs. Additionally, because map expects pure functions which take input and return output (not changing state along the way) we can run these options in parallel very, very easily.

In Python

# Predict many using map
# Same as above
vs = map(predict, [(1,.54,10.13),
(2,.88,9.74),
(1,.91,11.25)])

#Predict many using parallel map
import multiprocessing
with multiprocessing.Pool(4) as p:
vs2 = p.map(predict [(1,.54,10.13),
(2,.88,9.74),
(1,.91,11.25)])

In Clojure

;Predict many -- same as above
(def vs (map predict
[[1 .54 10.13] [2 .88 9.74] [1 .91 11.25]]))

;Predict many using parallel map
(def vs (pmap predict
[[1 .54 10.13] [2 .88 9.74] [1 .91 11.25]]))

Indeed, a lot of what we do in data science, especially when we're applying classification or regression algorithms, can be done in parallel. Having a parallel map handy for this can be a big time saver. ## reduce (or fold)

While map returns as many outputs as it received inputs, reduce returns one output no matter how many inputs it receives. For this reason, we can think of reduce as an aggregator, an accumulator, or a reducer (i.e., it reduces a list of values to a single value).

Most people are already familiar with the idea of reduce because they are familiar with things like: max, min and sum. Imagine a summation function: it takes an array of numbers and returns their value all added together. Most programming languages will have this built in. This is a reduce operation.

NB: I'll use only Python for the examples here because with Clojure, its unnecessarily difficult to do it in a non-functional way

# Standard sum
s1 = sum(range(10))

# Reduce sum
from functools import reduce
from operator import add
s2 = reduce(add, range(10))

s1==s2 # returns True

The reduce operation here is a little clunkier than the built-in; however, like in the previous example using map, we have trivial access to parallelization without really changing our code (see the fold method from the toolz library for more.)

"Worldview" of functional data science

Ultimately, the reason that I argue map and reduce should be taught early on to young data scientists is this: teaching map and reduce starts data scientists thinking about their operations in terms of data types. map, we know, takes a sequence and returns a sequence. reudce, we know, takes a sequence and returns a single value. Having these two tricks up our sleeve allows us to do a whole host of interesting things.

Further, as one introduces more complex data types, map and reduce (but especially reduce) gain more power. reduce combined with maps, dicts or associative data structures allows for easy implementations of "group by", "count by" methods.

And as we structure our code around core functions (map and reduce), our code becomes increasingly readable. We design functions for the singular case and apply it across the plural or sequential case (Clojure's core library works this way by default; Python's does not: compare max in Clojure to max in Python).

Anytime we can emphasize to young data scientists that its important to think about the data they have, we should be emphasizing they thinking about the data. Data science begins with data and so should the programming paradigm data scientists use.

More

Books on Functional Programming or functional Data Science

Tags
Mastering Large Datasets

My new book, Mastering Large Datasets, is in early release now. Head over to Manning.com and buy a copy today. Subscribe