Twitter Scraping with Clojure


Twitter is a popular resource for social media analysis and natural language processing. If we want to do these things in Clojure, it makes sense that we'll want a way to access Twitter. This is a simple example of how to handle Twitter data.


The first step to working with Twitter is setting up a developer account. I'm not going to go into detail on this. Twitter has great developer documentation that you should definitely check out and its pretty straightforward to create an developer account (hint: go here).

I like twitter-api as my Clojure twitter library of choice and that's what we'll be using here. Add the latest version to your Leinengen project description or download the .jar. Whatever floats your boat. That's all you'll need for this demo.

First steps

Once you have the project set up with twitter-api, you're ready to write some code. The first thing we'll want to do is bring in the libraries and authenticate with your credentials:

(ns WonderfulCoyote-twitterDemo
    (:use [twitter.oauth]
  (:import [twitter.callbacks.protocols SyncSingleCallback])

(def my-creds (make-oauth-creds "cnsmr-key"

Obviously, use your actual credentials instead of the dummy credentials.

[NB: If you're running this in the REPL, use the REPL versions of use and import, e.g., (use '(twitter.oauth))]

Our first Twitter request

For our first request, let's look at the POTUS' Twitter account:

(users-show :oauth-creds my-creds :params {:screen-name "POTUS"})

Pretty easy, huh? The twitter-api library gives us convient functions for most of the endpoints. (Want to know everything that's covered? Read the docs!)

Here, we're just passing out credits as a named-argument to the users-show function and specifying the username we want as a parameter. You'll want to read the Twitter docs to figure out all the parameters you can use to specify your requests.

A little bit more...

Often, we'll only be interested in some of the metadata. Here, I demonstrate how we can get the text of the tweet, the hashtags used in that tweet, and the number of times the tweet has been retweeted and favorited (along with the tweet ID). I always recommend saving the ID so you can go back and get the full data in case you need it.

  ;; Notice that get/get-in can be used to easily and clearly access nested pieces of the nested map/JSON
  (fn [x] [(x :text) (map #(get % :text) (get-in x [:entities :hashtags])) (x :retweet_count) (x :favorite_count) (x :id)])
  ;; We use get-in body/statuses to get the tweets (Twitter calls these "statuses") in the "body" of the HTTP response
  (get-in (search-tweets :oauth-creds my-creds :params {:q "#friday" :count 100}) [:body :statuses])

Hashtag count distributions

Maybe we're not interested in individual Tweets, we're interested in the distribution of a behavior, like use of hashtags. Here, we group the Tweets by the number of hashtags they use. In this case, we're using the search term "friday" (Twitter search is case insensitive, FYI, so this matches "friday", but also "Friday", "FRiday", "friDAY", etc...)

;; Start with let, just to take the search business out of the way
(let [tweets   (get-in (search-tweets :oauth-creds my-creds :params {:q "friday" :count 100}) [:body :statuses])]
 ;; Return first element as first elem and count number of tweets for second elem
 (map (fn [x] [(get x 0) (count (get x 1))])
 ;; Group tweets by the number of hashtags they use   
 (group-by (fn [x] (count (get-in x [:entities :hashtags]))) tweets))

This will return a neat little sequence of integer pairs. The first is the number of hashtags and the second is the number of tweets from our pull that had that many hashtags. My results were:

([0 10] [7 5] [1 18] [4 9] [6 7] [3 12] [12 2] [2 26] [11 1] [5 5] [8 2])

This seems like something we may be able to model with a Poisson distribution...

Repeated sampling

Say we are interested in testing the hashtag frequency of different topics, we're going to want to do this grouping again and again. That calls for a function. We can build one like so:

(defn countByHashtagSearch
  "Takes string and returns a 100-sample hashtag-use distribution"
  ;; Notice that we've overloaded the function definition here
  ;; You *can* specify the number of tweets you'd like -- otherwise it defaults to 10
  ([my-string] (countByHashtagSearch my-string 10))
  ([my-string n]
  (let [tweets   (get-in (search-tweets :oauth-creds my-creds :params {:q my-string :count n}) [:body :statuses])]
  (map (fn [grp] [(get grp 0) (count (get grp 1))])
  (group-by (fn [twt] (count (get-in twt [:entities :hashtags]))) tweets)))))

Note the function overloading here. Here, we're just using it to allow an optional parameter.

Now let's put use our function to use:

(countByHashtagSearch "clojure" 100)
(countByHashtagSearch "scala" 100)
(countByHashtagSearch "haskell" 100)
(countByHashtagSearch "lisp" 100)

And we get back:

;; clojure
([0 49] [1 33] [2 9] [3 5] [5 1] [9 2] [12 1])
;; scala
([0 82] [1 8] [3 2] [2 7] [4 1])
;; haskell
([0 85] [1 11] [10 1] [2 2] [3 1])
;; lisp
([0 88] [1 7] [2 3] [4 2])

Looks to me like Clojurians favor hashtags more than the users of some similar languages (although, we probably want to be careful -- "lisp" may not mean what we want it to mean here.) That said, though, our Poisson distribution thought would definitely fall apart here. 12 is pretty far out in the tail for a distribution with λ ≅ 1.


For collecting lots of Twitter data, I recommend saving the data in full as .json files or .txt files where each line is a tweet. I like clj-json for JSON handling in Clojure.

The code for this tutorial is available on my GitHub

Mastering Large Datasets

My new book, Mastering Large Datasets, is in early release now. Head over to and buy a copy today.