j.t. wolohan

big data, natural language processing, social identity, web mining, text analytics
jwolohan@indiana.edu
Buy my book: Python for Big Datasets

It's in early release now. Head over to Manning.com and buy a copy today.

 

Use more dicts.

data science python data structures

Introduction

Something that rarely comes up in discussions about data science (and something that should come up a lot more) is what data structures are being used. Data scientists spend a lot of time thinking about algorithms. Data scientists spend comparatively little time thinking about data. This is a mistake.

Using the right data structures makes our lives much easier as data scientists. In an increasing number of situations: the right data structure is a dictionary.

What are dictionaries anyway?

Dictionaries, dict objects, as you probably know are key-value pair based data structures. In run-of-the-mill Python these are implemented in C as hash tables.

Hash tables are arrays that use hashes (cryptographically obscured strings) for lookup---so your key can be any value you wish and we don't need to use that for lookup; we use a hash. Long story short: this makes dicts fast. Most of the operations you'll want to do with a dict under most conditions you'll want to use a dict for are O(1).

Besides their speed, dicts are great for two primary reasons:

  1. They make our code readable
  2. They are always the right size

Dicts make your code readable

Dicts make your code readable. When working with other developers---whether in person on a dedicated dev. team or remotely on an open-source project---we will often have to take time to make sense of the data structures we're being passed by preceding operations. Even when we know a tuple is going to contain three elements---for example: user id, order id, and order price---we generally have to confirm these three are in the right order. With a dictionary, we can just call them by name.

d = not_our_code()
d['userid'], d['orderid'], d['price']
# 120314, 144, 22.49

Dicts also don't require us assume that other developers' code is working right. Which brings us to point two...

Dicts are always the right size

If we're missing the price in the above scenario, our tuple-based code is going to break. However, if we a dict instead we can work around that. Dictionaries have a method called get which allows us to provide a default value in the event we find an empty key. For example, we might want an empty price to default to False or even 0.

d = not_out_code()
# d = {"userid":120314, "oderid": 144}
d.get('userid')
# 120314
d.get('price', False)
# False

Another thing we may want to do is use get to lookup the value if it's something we can calculate or find. For example, we may be able to look up the price value in a database.

d.get('price', lookup_price(d.get('orderid')))

Here we're using a function--lookup_price()--that finds the price if we're missing it.

Where else are things done this way?

Lots of places, for example: JSON or Javascript Object Notation. Part of the reason that the JSON format is so popular is because it uses a dict-like syntax that makes it both highly human-readable and language agnostic. In fact, dicts and JSON are so similar that Python dicts turn into JSON objects when we write Python data to JSON. Similarly, JSON objects are always read in as dicts.

Why are they better than pandas/numpy?

Well, they're not... not always. But there are a few situations where you're probably better off working with basic objects than with numpy arrays or pandas DataFrames.

  1. You're comfortable writing your own functions
  2. You don't need to do lots of floating point arithmetic
  3. You might not want to bring all your data into memory

The bottom line is that pandas and numpy are great for numerical analysis, but fall short anywhere else. If your data doesn't come in a nice, neat tabular format or you want to do something besides lots of multiplication (oversimplification warning!)-- you're probably better off using lists and dictionaries for your work.

Their versatility and widespread usefulness is why they're in the language as core data types.

Conclusion

Data scientists spend a lot of time focusing on algorithms and hardly enough time talking about data types. Lots of times, data scientists will default to the data preferred by their preferred tools: such as numpy arrays and pandas DataFrames. This can be a mistake. Data scientists should use dicts more often. Dicts are fast, robust, scalable and can model a variety of complex scenarios.

More

  1. Python language reference: dictionaries
  2. JSON - Javascript Object Notation
  3. Princeton course on hash tables
  4. numpy - your go to resource for numerical Python