Social Media Gender Prediction


Often times when we're working with social media data we want to know more about the users than what the text explicitly tells us. That is, we want to infer demographic variables from the text data. One of the most common demographic variables to want is gender. Social media researchers have been able to successfully discriminate between male and female social media users, given decent sized portions of text. Today, we're going to take a look at a simple implementation of one of those algorithms.

Developing Age and Gender Predictive Lexica over Social Media

Sap et al. (2014) released a lexicon for just this purpose: gender prediction on general purpose social media text. The algorithm they trained had better than 80-percent accuracy in a variety of social media contexts (blogs, Facebook, Twitter).

In addition to their work on gender prediction, they also released a lexicon for age prediction.

Both lexicons and algorithms work similarly. They score words as either male-likely or female-likely and combine the scores. The algorithm for the gender prediction is:

  1. Tokenize the words
  2. Assign scores to each token
  3. Sum the scores
  4. Add an intercept value
  5. Take the sine

Both lexicons can be found here, at the Penn World Well Being project website.


For such a desirable task, implementing the algorithm is rather simple. I'm going to implement it here in Clojure, but a Python implementation is linked below.

The first thing we have to do is bring in the libraries we'll need. I'm going to use ultra-csv to bring in the lexicon and, since we're doing text-processing, clojure.string.

(ns wonderfulcoyote.gndrprdct
  (:require [ultra-csv.core :as ucsv]
            [clojure.string :as str]))

Next we want to load the lexicon. I'm going to store it as a global variable: SapWeights, in appreciation of original the authors.

 (def SapWeights
    (fn [a b] (assoc a (b :term) (b :weight)))
(ucsv/read-csv "./resources/gender_lex.csv")))

Note that the data comes in as a sequence of maps with :term and :weight keys -- we reduce that to a single map where we can use :term to lookup :weight.

Next, we can setup are gender prediction function.

(defn predictgender
 "Predict-tweet author gender from text. 0 is male, 1 is female"
 (let [g
 ;; Take the sine (step 5)
   ;; Subtract the intercept (step 4)
     (- 0.067242152
     ;; Sum them (step 3)
     (reduce +
     ;; Lookup values for each word (step 2)
       (map #(get SapWeights % 0)
       ;; Tokenization (step 1)
         (str/split (str/lower-case t) #"\s+")))))]
 ;; Assign appropriate gender  
 (if (<= g 0) 0 1)))

Above, I've commented each step to correspond with the algorithm outline above. We'll go through it step by step below.

Step 1 - Tokenization

;; Tokenization (step 1)
(str/split (str/lower-case t) #"\s+")

Here, we're simply lower-casing the string and splitting on whitespace. This is tokenization, but only in the most rudimentary fashion and should probably be improved. Feel free to use your favorite social-media aware tokenizer of choice beforehand and pass a string of whitespace separated tokens into the function.

Step 2 - Lookup values

;; Lookup values for each word (step 2)
(map #(get SapWeights % 0) xs)

Here, we're simply get-ing every token as a key in the SapWeights map. We're specifying 0 for the default in case the token is not in our lexicon. The resulting data structure will be a sequence of numbers.

xs is a placeholder for the tokens returned by Step 1.

Step 3, 4 and 5 - Sum the values and subtract the intercept, then take the sine

(Math/sin (- 0.067242152 (reduce + xs)))

Nothing fancy going on here -- just arithmetic. Per the algorith instructions, we take the sum, add (subtract) an intercept, and take the sine. Note that we're using Math/sin from Java here.

Also, again xs is a placeholder for the tokens returned by Step 2.

Python version and .jar executable

You should feel free to embed this code into whatever Clojure projects you are working on; however, I also have a Python implementation available for the Pythonistas out there.

Use of that is simple (and is covered on the github page.)

# Step 1. Import the class
from SapGenderPrediction import GndrPrdct

# Step 2. Initiate the class and some data
Classifier  = GndrPrdct()
tweets = ["This is a tweet.",
          "I'm another tweet!",
          "Hey, @realDonaldTrump, I'm yet another tweet!"]

# Step 3. Classify
Classifier.predict_gender(" ".join(tweets))

You can easily apply this to lots of data like so:

# Save the call for performance
predict_gender = Classifier.predict_gender
# Map to all our Tweets
genders = map(predict_gender, my_users_tweets)

One advantage of the Python implementation is that it uses Christopher Potts' happyfuntokenizing, just like the author's original implementation. No prior tokenization is required.

For those who don't want to code, I also have a standalone .jar file. The input format expectation is one user's tweets per line and you can run the file with the following command:

java -jar ./TwitterGenderPrediction.jar /path/to/your/tweets.txt


As always... the code for this tutorial is available on my GitHub

Mastering Large Datasets

My new book, Mastering Large Datasets, is in early release now. Head over to and buy a copy today.