It's in early release now. Head over to Manning.com and buy a copy today.
Big data is all the rage. And everyone has it. Except you know what the saying is: when something is everything, it's nothing. And that's what big data is: nothing. In this post, I'm going to talk about a better way to think about data---one that goes beyond the "four Vs" of big data and gets to the core of what really causes us problems when we're talking about big data: its size.
The traditional way of categorizing big data is to describe it along four different axes: velocity, variety, veracity and volume. There are a few problems here---my personal favorite being that only one of them (volume) has anything to do with size.
Of course, this method of categorizing big data was developed so that everyone could consider their data big and everyone could join the "big data" party. Your data is messy? It's big data. Your data comes in a variety of formats? Big data. Users are visiting your site at regularly intervals? Big data.
This is useful as a marketing gimmick, but it's not great as a tool for thinking about development problems. After all, the solution to a problem involving small amounts of data stored in twenty different formats (Variety) is much different than the solution to a problem involving a petabyte of data all in one format (Volume).
That's why when I think about the challenges imposed by "big data", I think about the challenges that come with dealing with datasets of differing sizes. I consider there to be three broad sizes of datasets:
Problems where the data can both fit on and be processed on a personal computer
Problems where the solution for the problem can be executed from a personal computer but the data cannot be stored on a personal computer
Problems where the solution for the problem can neither be executed on a personal computer nor can the data be stored on a personal computer
Because there are three sizes, it takes a lot of restraint not to give them silly names (small, medium, large datasets; tall, grande, venti datasets; etc.) The point here, though, is that there is a continuum along which problems exist, and that the solutions to problems close together on that continuum will involve solving similar sub-problems because they are inherently challenged by the size of the data in addition to whatever challenges they would normally face.
This continuum is not static. It is constantly being stretched out by the fact that more and more data is being produced. Additionally, the dense space in the size 1 and size 2 part of the dataset becomes ever denser as computing power increases.
However, given these three categories, most developers should be able to identify where there problem fits and then be able to select an appropriate set of tools for the problem. This is not the case with big data.
The observation that the size of data itself poses a problem to many developers led directly to my upcoming book: Mastering Large Datasets. The book is currently in-press by Manning Publishing and it teaches the skills developers need to progress along the continuum---starting with size 1 dataset problems and moving all the way up to size 3. The idea is simple: the size of data doesn't have to stand in the way of implementing a solution if you have the right toolkit.
The book is also heavily influenced by one of my earliest blog posts on map and reduce in data science programming. Go check that out if you've never read it.