Big data processing systems

Big-data systems series - Part 1

Setting up big data systems is a knowledge gap for most developers. They know how to write the code that analyzes the data--maybe how to use Spark, Hadoop, Hive or Pig to process it--but setting up a system so that they can do that is a bridge to far.

In this series, we'll review three common big data architectures in order from least complex to most complex.

  • Base analytics system -- link
  • Web analytics -- link
  • Integrated recommendation system -- link

In the simplest type of big data system, all we want to do is process a large dataset. This dataset can be static, or it could be updated by another system -- but for our purposes, that's someone else's problem. We can focus on the processing and analytics.

For this system we only need two pieces:

  • Storage -- to hold our data and serve the results
  • Processing machines -- to process the data

For the storage system, we'll typically want to use something like S3 or Azure Blob storage. These services let us store an unlimited amount of data, categorized any way we'd like using metadata. We can also turn around after our analysis to use them to serve static websites with our output reports.

For processing, we'll want to use either a managed cluster service--such as AWS EMR, Google Cloud DataProc, or Azure HDInsight--or just some cloud compute resources. I prefer the managed clusters, because they take care of installing the distribtued frameworks we'll want.

From here, we use the compute resources to process the data in object storage--such as S3--and send the results back to a static object. Others in our organization, or even publicly can access this object to see the results of our analysis. We can do this with any type of data processing: ETL, analytics, even machine learning.

In my book, Mastering Large Datasets with Python, I detail how you can set up a base analytics system in chapters 11 and 12. You can buy that book at

Mastering Large Datasets

My new book, Mastering Large Datasets, is in early release now. Head over to and buy a copy today.