Integrated recommender systems

Big-data systems series - Part 3

Setting up big data systems is a knowledge gap for most developers. They know how to write the code that analyzes the data--maybe how to use Spark, Hadoop, Hive or Pig to process it--but setting up a system so that they can do that is a bridge to far.

In this series, we'll review three common big data architectures in order from least complex to most complex.

  • Base analytics system -- link
  • Web analytics -- link
  • Integrated recommendation system -- link

In this post, we're going to look at a common, complex analytics set up: the integrated recommendation system. This type of system is found on the best e-commerce platforms and most social-media platforms, such as the webpages of Amazon, Walmart, Target, Facebook, Yelp, Netflix, etc. The goal of these systems is to analyze customer behavior and serve customers the conent--products or other media--that is most likely to engage them.

The reference architecture provided by AWS imagines an e-commerce site that sends marketing emails to its customers based on users behavior on the site. If you take a look at the reference architecture, the first thing you'll notice is that there are three web applications:

  • An e-commerce application where users can purchase products / interact with content
  • A marketing email system that sends users updates on products they are most interested in
  • A recommendation engine that provides dynamic, personalized content to the e-commerce application

The lynchpin to our data analytics process--as has been the case with both of our simpler setups--is going to be a managed cluster, like ElasticMapReduce or HDInsight. This cluster will analyze web logs, which we can use to look at how users are behaving on the site, a database of actual orders, which we can use determine which products users will likely buy, and user profiles, which we use to form recommendations and inform content personalization.

You'll also notice that none of the three systems directly interact with our cluster. The cluster consumes the data output by these systems--mainly the e-commerce platform--and then produces data that these systems can use: especially the marketing email application and the recommendation application. When we loosley couple our designs in this way, we can modify our systems and remain confident that we won't break out analtics system. Similarly, we don't have to embed all of our data in any of our applications. That data can stay in a data lake that our managed cluster works on, and the marketing and recommendation applications can deal with the condensed user profiles.

You'll notice that again, while there is a lot of stuff going on around our analytics system, the analytics system at its core is a data lake and a managed cluster. In my book, Mastering Large Datasets with Python, I detail how you can set up a base analytics system in chapters 11 and 12. You can buy that book at

Mastering Large Datasets

My new book, Mastering Large Datasets, is in early release now. Head over to and buy a copy today.