Stanford CoreNLP in Clojure

Introduction

StanfordCoreNLP is the gold standard in language processing and a good enough reason for anyone serious about natural language processing, computational linguistics or text mining to consider a JVM language. This tutorial introduces how to get a minimal StanfordCoreNLP parse function up and running with Clojure, the functional Lisp for JVM.

Setup

Because StanfordCoreNLP is not just a single .jar file, but a whole set of them, I recommend the use of lein-resource-expand for this project. This allows you to point to the directory in which all the StanfordCoreNLP .jar files are located and use glob notation (i.e.: mypath/CoreNLP/*.jar) to select them all. Otherwise, the first step is to add the StanfordCoreNLP files to your resource path.

In Leinengen, that looks something like this:

(comment with resource expand!)
 :resource paths ["mypath/myjars/stanford-corenlp-full-2018"]

First steps

Once you have the project set up and the .jars on your resource path, its time get CoreNLP up and running.

The CoreNLP approach is to set up analysis pipelines, and then apply those pipelines to annotations. So our job is to get those Java classes into Clojure.

To do that, we just need to initiate the StanfordCoreNLP class with some properties telling it which annotations to apply.

 (def myProps (java.util.Properties.))
 (. myProps setProperty "annotators" "tokenize, ssplit, pos")
 (def nlPipe
   (edu.stanford.nlp.pipeline.StanfordCoreNLP. myProps))

Here, we've created properties myProps, set it to handle tokenization, sentence splitting, and part-of-speech tagging--a full list of annotators is avilable here--aand used it to initiate nlPipe, our NLP pipeline.

Then, to analyze text, we just create an annotation object and annotate it with our pipeline, nlPipe.

(def sentences
  (edu.stanford.nlp.pipeline.Annotation.
  "I'll meet you by the Sample Gates. Do you know where that is?"))

(.annotate nlPipe sentences)
(.jsonPrint nlPipe sentences *out*)

If you've gotten this far, you should see a bunch of JSON printed in your terminal. We've told the the pipeline to annotate our annotation object and print the results in JSON. If you don't want your data in JSON, you can check out the documentation for how to return this data in different ways.

Getting more functional

Now, this might have been a quick and easy way of getting some text processed, but it wasn't a functional way. Plus, we don't want our data in the terminal, we want it in a format we can pass around inside Clojure. Let's create a wrapper to do all of this.

(defn pipe2JSON [pipeline t]
  (let [txt (.process pipeline t)]
    (let [buffer (java.io.StringWriter.)]
      (.jsonPrint pipeline txt buffer)
      (json/parse-string (.toString buffer)))))

And then let's show it off by printing the part of speech tags for the first sentence:

(let [a (pipe2JSON nlpipe "I'll meet you by the Sample Gates. Do you know where that is?") ]
(println (map (fn [x] (get x "pos"))
  (get (get (get a "sentences") 0)"tokens"))))

Here, our pipe2JSON function takes a pipeline and applies it to a string, JSON-prints that string into an object, which we then turn into a familiar Clojure PersistentArrayMap.

More...

StanfordCoreNLP has a lot more features than covered here. If you're parsing modest amounts of data, this should be plenty; if you want to do advanced NLP or NLP on a larger scale, read the documentation. A good place to start is understanding the StanfordCoreNLP class

The code for this tutorial is available on my GitHub


Mastering Large Datasets

My new book, Mastering Large Datasets, is in early release now. Head over to Manning.com and buy a copy today.


Subscribe