Naive Bayes in Clojure with Smile

Introduction

One of the things that keeps data scientists from using the same programming languages as traditional developers (e.g., C++ and Java) is the lack of tools on those platforms for data science. Sure, they exist, but none offer the simplicity of scikit-learn. When you're doing data science, you really want to focus on doing data science and not so much on fussing with your compiler. Today we're going to take a look at how easy it is to implement a Naive Bayes classifier in Smile, an fast, high-level machine learning library for the JVM.

What is Smile?

I believe that data scientists are giving up a lot by restricting themselves to interpreted languages or non-compiled languages. Specifically, I'm a big fan of doing data science with Clojure on the JVM. Haifeng Li (Github, blog) and the folks behind Smile agree.

Smile is a fast and comprehensive machine learning library for the JVM. It includes popular algorithms for classification (e.g., SVM, Naive Bayes, Gradient Boosting, Random Forests), regression (e.g., Ridge and LASSO, regression trees/forests), and clustering (K-Means,minimum entropy clustering, hierarchical clustering), as well as a host of tools for data preprocessing. It even has some basic NLP and data preparation tools basked in. What's more, is that it has simple APIs for Java and Scala.

And while there is no official Clojure support, I'm going to show you just how easy it is to call Smile from Clojure by demonstrating a quick Naive Bayes analysis.

Getting started

The first thing we'll want to do is add the Smile dependency to our project dependencies. Smile is on Maven, so this is easy as looking up the latest version and adding it to our project.clj file. If you're use Smile past this tutorial, you may also want to check out smile-nlp and smile-netlib.

Once we've got Smile added to our project, we can import it into our Clojure source like any other Java library using Clojure's Java interop features.

(:import (smile.classification.NaiveBayes$Trainer)
         (smile.classification.NaiveBayes$Model))

Note here how we're using the $ to import the nested classes Trainer and Model. Definitions of the NaiveBayes class---and the nested Trainer and Model classes---can be found over at Smile's Javadocs. For now, it's enough to know that we'll use the Model nested class to define what type of Naive Bayes model we want (e.g., Bernoulli or multinomial) and we'll use the Trianer nested class to train our model.

Training

Toy Data

Now that we've got the classes imported, the next thing we'll need to do is set up some toy data.

(def X (into-array (map double-array [[0 0 1]
                                      [0 1 1]
                                      [1 1 0]])))
;; training labels
(def y (int-array [0 0 1]))
;; testing data
(def X2 (into-array [(double-array [1 0 0])]))
(def X3 (into-array [(double-array [1 1 1])]))
(def X4 (into-array [(double-array [0 0 1])]))

Here I set up two variables to hold my training data, X and y. X is holding the "features" and y is holding the "labels". X2, X3, and X4 are holding test observations that I'll classify later.

Pay attention to how I use double-array and into-array above. Because Smile is built in Java, it's expecting Java data types, so we need to tell Clojure to convert the Clojure data types---persistent vectors of doubles---into something Java recognizes as double[][]. Similarly with our training labels, Smile is expecting a int[] type object, so we need to make sure we coerce whatever we want to classify into that.

If you expect to be doing this often, I'd invest in some little wrapper functions to make the conversion invisible, or even a slightly bigger wrapper to handle all of the training.

Training the model

With the data set up, we're ready to train the model.

(let [temp-model (smile.classification.NaiveBayes$Trainer.
                  smile.classification.NaiveBayes$Model/BERNOULLI 2 3)]
  (def model (.train temp-model X y)))

The first thing I do here is set up the Trainer subclass. You'll note that I do that using the Java interop syntax for instantiating a class, e.g., (Trainer.). I then pass along the three parameters: the model type---which in our case is Bernoulli---along with the number of possible labels (2) and the number of features (3). Note that the model type is passed in as a NaiveBayes$Model field.

To finish off the training, I just call the .train method of the model. I pass my training features and labels into the train method, just as one would expect, and a trained model is returned.

I use a let statement for this so I don't have my Trainer floating around when the model is done. Naive Bayes classifiers are considered online learners, and Smile allows us to update the final model after the initial training. So the Trainer is obsolete at this point.

Classifying new data

It took three lines of code to train the model, it is even faster to classify new data.

(doseq [x [X2 X3 X4]]
  (println (seq (.predict model x)))))

If you've been following along, you should be able to run this and get the following output: (1) (0) (0)

These numbers correspond to the predicted label of the data.

We can also run classify several observations at once, like so.

(println (seq (.predict model X)))

Here we're classifying the original training data, and we get back (0 0 1). Conveniently, these were our original training labels as well.

It is worth noting here how in both cases I used seq to convert the returned object (an array of integers) to a Clojure-native data type.

Closing comments

In this tutorial, we performed a quick and dirty classification with a Naive Bayes model in Clojure by way of the Smile library. The training step took three lines of code and the prediction step took one or two. This goes to show that Clojure can be competitive with Python when easy access to machine learning models is necessary.

Clojure makes Java interop easy and that's awesome because it allows Clojure developers and data scientists to take advantage of libraries like Smile. Smile is a high-level, fast and comprehensive library for machine learning. I strongly recommend checking it out if you're interested in data science on the JVM.

The ability to pull models off the shelf and train them has been an advantage for Python over languages that are (probably) better suited for data science, e.g., Clojure. Packages like Smile and the increase in Scala big-data libraries may see data science moving towards the JVM in the near future.

More

The full code example for this tutorial is over the WonderfulCoyote repo with the rest of my tutorials.

If you're familiar with the Python-sklearn-Jupyter workflow, head over to gorilla-repl.org and check out Gorilla, the notebook style REPL for Clojure. Combining this with Smile, you can easily replace the Python stack for a Clojure stack.


Mastering Large Datasets

My new book, Mastering Large Datasets, is in early release now. Head over to Manning.com and buy a copy today.


Subscribe