Stop using NLTK, start using spaCy

The most popular natural-language processing library in Python for beginners is NLTK. It's popular for beginners for a reason: it's lightweight and is easy to learn. It's also not up to par compared to modern NLP approaches.

The better choice is the library spaCy. spaCy is an excellent NLP library that can compete with state-of-the-art NLP tools. It also has built-in word vectors: crucial for most modern NLP.

Here's a taste of spaCy, direct from the spaCy documentation.

# pip install spacy
# python -m spacy download en_core_web_sm

import spacy

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")

# Process whole documents
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")
doc = nlp(text)

# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

Open source lovers rejoice: spaCy is licensed under the MIT license!


Mastering Large Datasets

My new book, Mastering Large Datasets, is in early release now. Head over to Manning.com and buy a copy today.


Subscribe