Loading Data for Data Science in Clojure

Introduction

This post is an tutorial that covers working with three popular data formats in Clojure: CSV, JSON, and XML. CSV is the ubiquitous Comma Seperated Value data format, where observations are placed on newlines and variables are separated on those lines by commas; JSON is JavaScript Object Notation, a popular and simple data-interchange format; and XML (eXtensible Markup Language) is a tree-based document markup language, increasingly the data format of choice for the life, health and biological sciences.

The goal of this tutorial is to introduce you to three libraries for working with CSV, JSON, and XML in Clojure. Loading data is the first step in any data cleaning or analysis effort. Once finished with this tutorial, you should be set to start data analysis of your own.

CSV

CSV is one of the most popular data formats, but at the same time, one of the most difficult to work with. Because CSV data does not have a "standard", like JSON and XML do, we can come across several files, all with .csv extensions, but all with different de facto formats. For example one CSV file could have quotes around all the fields, one may only quote text, one may not use quotes at all. Wikipedia does a good job of pointing out several possible variations of CSV file.

Luckily for us, Clojure has ultra-csv, a library for reading CSV data that handles these discrepencies for us. ultra-csv lists among its key features:

  • heuristics reason about delimiter, quotes, headers
  • handling of various encodings and escape quote escapes
  • numerical data recognition and advanced type coercion
  • line-by-line read option for big data

For this tutorial, we're going to focus on the basics: reading data from a CSV file. For this example, we'll read in the Iris dataset---which I suspect that many are familiar with---and demonstrate some ways to work with that data in Clojure.

Using ultra-csv

To work with ultra-csv in our Clojure code we have to require it. Here, I'm going to require it as csv for easy reference.

Clojure (:require [ultra-csv.core :as csv]

From this point on, we'll have all of ultra-csv's functions loaded in under csv.

Loading CSV data

Next we need to load the data from the Iris file. The file is located here: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

To load the file, we can use ultra-csv's read-csv function like this:

(def iris-data (csv/read-csv "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"))

In future code, we can now call iris-data to work with our data.

Working with CSV data

We can use first to take a look at our data.

(first iris-data)
=> [5.1 3.5 1.4 0.2 "Iris-setosa"]

As we can see, each of our rows of data is now a native Clojure vector. Additionally, all of the variables were correctly coerced into the ideal data type. The first four variables are doubles. The last variable is a string.

Additionally, each of our observations (a vector) are inside a native Clojure sequence, so we can use our standard sequence-processing techniques on it. For example, if we want the class name of all observations where sepal width (the second variable) is greater than 3.5, we can do this like so:

(map last (filter #(> (get % 1) 3.5) iris-data))

Here, we're using map and filter on our data, just like we would on any standard Clojure sequence.

JSON

Now let's turn our attention to JSON. JSON is an important format for data scientists to be familiar with, especially those who plan to work with web data, social data, or "big" data. JSON is a popular format for browsers and servers to exchange data because browsers are natively fluent in Javascript. Additionally, JSON is the inspiration for MongoDB's BSON (Binary JSON) format. MongoDB being a big-data oriented NoSQL database used by some of the world's largest companies, including telecom giant Comcast. This makes JSON a popular format for big data work.

To work with JSON in Clojure, I reccomend using clj-json. clj-json is a no-nonesense encoding and decoding library for JSON. It takes JSON in as a string and turns it into the appropriate Clojure datatypes.

To use it, I recommend requiring it :as json for easy access.

[clj-json.core :as json]

Loading JSON data

Once we have clj-json ready, all we need to do is read in our file(s) as a string and then use the parse-string function from clj-json to turn them into Clojure objects. We can do that like so:

(def papers-data (json/parse-string (slurp "http://archive.ics.uci.edu/ml/machine-learning-databases/00410/reviews.json")))

Here, we've read the "paper reviews" JSON data, hosted by the UCI Machine Learning Repository. The dataset contains meta data about papers and reviews of those papers. slurp reads the JSON file in as a string and then json/parse-string parses the string into Clojure objects.

We can take a look at the first paper in the dataset to get a sense of what our data looks like.

;; Show first paper
(-> (first papers-data)
  (get 1) ;; Get papers
  (get 0)) ;; Get first papers

Notice we're treating papers-data like a standard Clojure sequence. That's because it is. parse-string has done that for us.

And the output will be something like this:

{"id" 1, "preliminary_decision" "accept", "review" [{"confidence" "4", "evaluation" "1", "id" 1, "lan" "es", "orientation" "0", "remarks" "", "text" "- El artículo aborda un problema contingente y muy relevante, e incluye tanto un diagnóstico nacional de uso de buenas prácticas como una solución (buenas prácticas concretas). - El lenguaje es adecuado.  - El artículo se siente como la concatenación de tres artículos diferentes: (1) resultados de una encuesta, (2) buenas prácticas de seguridad, (3) incorporación de buenas prácticas. - El orden de las secciones sería mejor si refleja este orden (la versión revisada es #2, #1, #3). - El artículo no tiene validación de ningún tipo, ni siquiera por evaluación de expertos.", "timespan" "2010-07-05"} {"confidence" "4", "evaluation" "1", "id" 2, "lan" "es", "orientation" "1", "remarks" "", "text" "El artículo presenta recomendaciones prácticas para el desarrollo de software seguro. Se describen las mejores prácticas recomendadas para desarrollar software que sea proactivo ante los ataques, y se realiza un análisis de costos de estas prácticas en desarrollo de software. Todo basado en una revisión de prácticas propuestas en la bibliografía y su contraste con datos obtenidos de una encuesta en empresas. Finalmente se recomienda una guía.  Sería ideal aplicar la guía propuesta a empresas no involucradas en la encuesta que sirvió para originarla de modo de poder evaluar su efectividad en forma independiente.", "timespan" "2010-07-05"} {"confidence" "5", "evaluation" "1", "id" 3, "lan" "es", "orientation" "1", "remarks" "", "text" "- El tema es muy interesante y puede ser de mucha ayuda una guía para incorporar prácticas de seguridad. - La presentación (descripción, etapa y uso) de las 9 prácticas para el desarrollo de software seguro.  - El “estado real del desarrollo de software en Chile” (como lo indica en su paper) no se puede lograr con solamente 22 encuestas de un total de 50. - Presenta nueve tablas que corresponden a las prácticas para el desarrollo de software seguro, pero la guía presenta 10 prácticas. ¿explica por qué? - Sugiero mejorar la guía, el mayor aporte está en la secuencia de incorporación que propone.  Además, no debería explicar la práctica en Observaciones ni diferenciarla con otras prácticas en esa columna, sino que debería dar sugerencias de cómo aplicarla. - En el texto indica “Más adelante, se presentan además tres prácticas extras…” ¿cuáles son o no leí correctamente? - De acuerdo a formato, poner como mínimo 5 palabras clave. - Sugiero mencionar las prácticas antes de mostrar cada tabla. - Algunas referencias están incompletas, por ejemplo, falta año en referencia 17, falta año y tipo de evento en referencia 11, falta editorial en referencia 19 (¿es un libro?) - Algunos títulos llevan una coma dentro de las comillas, ejemplo, referencia 1", "timespan" "2010-07-05"}]}

And, notice again, our output is a Clojure object, not a JSON object, just like we want. Our data is now Clojure all the way through.

Before we leave JSON behind, let's do one last thing. Let's count how many papers were accepted, rejected, or otherwise. To do that, we can simple reduce the papers to the sums of their preliminary_decision counts like so:

(reduce #(update %1 (get %2 "preliminary_decision")
        (fnil inc 0))
        {}
        (get (first papers-data) 1))

=> {"accept" 115, "probably reject" 7,
     "reject" 48, "no decision" 2}

Here, we're updating the a map of decisions and counts for every decision we encounter. We get that decision using get %2 "preliminary_decision". For example, every time we encounter an "accept", we'll increment accept. fnil creates a function that will increment, if the decision exists, otherwise, it will set the count equal to one (although, more accurately, it sets the count to 0 and then increments it 1).

XML

Lastly, that brings us to XML. If you're spending a lot of time working with XML, I'm sorry. It is a technology that made big promises. XML promised to be the foundation of the semantic web, where all data could be linked together, across domains. And while there are efforts underway to do this in the life sciences (and indeed, in some subdomains it has been relatively successful), for the most part, working with XML just means working with nasty, idiosyncratic trees of data.

The best way to work with XML data is through XPath, a regex-like system for expressing the parts of an XML tree that you would like to retrieve. Luckily for us, there's a convenient library for using XPath through Clojure, aptly named clj-xpath. ADditionally, this library, like those we've already looked at, allows us to read a XML string into Clojure.

Again the first thing that we'll want to do is require the library. I like to require it :as xpath like so:

(:require [clj-xpath.core :as xpath])

From there, again, we'll slurp in our data and convert it to an XML document with clj-xpath's xml->doc function.

(def build-xml (xpath/xml->doc (slurp "https://raw.githubusercontent.com/clojure/clojure/master/build.xml")))

Here, we're using the Clojure master build file as our data source. Hopefully you'd use something more interesting for your own XML adventures.

Now that we've got our file all loaded up, we can use XPath to search it. For example, if we wanted the text of all the name attributes of all the third level leafs, we could get them like so:

```Clojure (xpath/$x:text* ".///@name" build-xml)

```

xpath/$x:text* says "get all the text from the following XPath and the following XML document". We could also use $x:tag* if we wanted all the tags. But that wouldn't be very interesting---they would all just be name.

Our output should look like this:

("src" "test" "jsrc" "jtestsrc" "cljsrc" "cljscript" "test-script" "test-generative-script" "compile-script" "target" "build" "test-classes" "dist" "clojure.version.label" "version.properties" "clojure_jar" "clojure_noversion_jar" "directlinking" "init" "compile-java" "compile-clojure" "compile-tests" "test-example" "test-generative" "test" "build" "jar" "javadoc" "all" "clean" "local")

Feel free to compare this to the Clojure build file for validation.

Summary

In this tutorial, we went over three libraries:

  1. ultra-csv
  2. clj-json
  3. clj-xpath

All of these libraries share a common core function: they take a string representation of data and convert it into useful Clojure objects of that same data. As data scientists, this is really useful for us because we often have to work with data in a lot of disparate formats. Getting data out of their native format and into a format where we can manipulate them is a task we really don't want to be spending a lot of time on.

More

As always, the full code example for this tutorial is over the WonderfulCoyote repo with the rest of my tutorials. The Leinengen project file for this tutorial is in a comment at the head.

If you plan on using a lot of XML, I strongly reccomend picking up a book on XPath.

Once you've loaded your data, check out my tutorial on Smile in Clojure to run some machine learning algorithms on it!


Mastering Large Datasets

My new book, Mastering Large Datasets, is in early release now. Head over to Manning.com and buy a copy today.


Subscribe