The data science and data engineering worlds are fast moving. Developers release new technologies at a rapid cadence and each new tool (or algorithm, or framework) comes with a new promise: with the new tool things will be easier, or faster, or just better. But in reality, there is no magic pill.
New tools are designed with the solution to specific problems in mind. The best of them succeed in their narrow endeavor. A small fraction of those will later become foundational tools. In this post we're going to take a look at some magic pill software, the supposedly outdated software that it was designed to replace, and circumstances where that outdated software still bests its replacement.
The first tools we'll tackle are Apache Hadoop and its supposed successor: Apache Spark. Spark and Hadoop are two big data, distributed computing tools that are both widely used. Spark offers substantial performance (speed) improvements over Hadoop, and has libraries for machine learning. However, this hardly means that Hadoop is an out of date tool.
The benefit of Spark is that it processes data in memory as opposed to writing data to files (as Hadoop does). This speeds up the compute process at the cost of running with a higher memory overhead. The result is that if you have a cluster where the computers have reasonable memory, you can achieve much higher performance (faster distributed compute) with Spark than with Hadoop.
All that said Hadoop, the technology Spark supposedly replaces, is still widely used - especially for truly big data use cases. The largest Spark job I've heard of is under 100TB. There are countless examples of Hadoop at the scale, including many that are petabyte-scale (1000s of terabytes).
That Hadoop is used at the largest of scales also means that Hadoop is trusted by the largest of players in the technology space (because only the largest players have the largest scales). Facebook, Yahoo, and Microsoft are all believed to have dozesn of petabyte scale Hadoop clusters. Indeed, Microsoft is one of the largest contributors to some of the latest versions of the Hadoop software.
Why do these companies use Hadoop? Its a trusted, tried and true method for big data jobs. Hadoop is robust and its had its tires kicked, its flaws exposed and improved upon over the past decade.
Where Spark offers speed and is tailored towards modern medium-data uses, Hadoop offers reliability and is tailored for massive-data uses.
The next two technologies we'll look at are in the database space. Let's compare the new kid on the block, document databases, with the old standby: relational databases.
Document databases are great for a lot of reasons. As a developer, you don't have to anticipate what your schema will be: most of these databases are schema-less. Additionally, lots of document databases offer horizontal scaling baked in. That is: you can add more machines as your data gets bigger. With relational databases one typically needs to add more machines. Together, this has led to document databases being the favored child of the modern web.
All that being said, there are still a lot of reasons to love relational databases.
First, relational databases can be accessed using SQL - which almost everyone knows. From a talent and hiring standpoint, that's a huge advantage.
Second, relational databases are easier to normalize. If we have rules and structure we want to impose upon the data, and this structure is important, we may be better off with a relational database.
Third, joins. If we want to do lots of joins, well, most NoSQL (document database) query languages don't support joins. SQL does. And join operations are usually reasonably well optimized in SQL.
Last, let's take a look at two families of machine learning algorithms: regression and deep learning.
Deep learning is hot right now. Developers are applying convolution neural networks to unprecedented success in image classification, including detecting faces, people, and objects in live video. Similarly, many of the advances in machine translation and speech recognition are due to recurrent neural networks. And because these networks can be reduced down to lots and lots and lots of arithmetic, they can be computed rapidly on graphical processing units.
At the same time, in many domains, regression performs just as well (sometimes better) and regression has the added benefit of being interpretable. That means we can take our regression model -- which for most applications will be within a few percentage points of accuracy as our deep learning model -- and actually understand how its working.
We can run these models by our subject matter experts to see if they make sense. Are our models picking up things that plausibly matter?
We can also easily prune these models to avoid over fitting. Machine learning algorithms are all prone to learning too much about the data we provide. We can combat this by chopping away at features that provide little or no value to our models.
New technologies are great, but they don't always make the old technologies obsolete. There is still a place in the world for supposedly superseded technologies, like Hadoop, relational databases, and regression models.
In the programming langauge world, this is taken for granted. C has some issues and we don't use it for everything, but it has a time and a place that it truly shines. Java has drawbacks, but anyone who dismisses Java out of hand has to reconcile that with Java having perahps the largest, most influential open source software community.December 01, 2018