In the simplest type of big data system, all we want to do is process a large dataset. This dataset can be static, or it could be updated by another system -- but for our purposes, that's someone else's problem. We can focus on the processing and analytics.
For this system we only need two pieces:
For the storage system, we'll typically want to use something like S3 or Azure Blob storage. These services let us store an unlimited amount of data, categorized any way we'd like using metadata. We can also turn around after our analysis to use them to serve static websites with our output reports.
For processing, we'll want to use either a managed cluster service--such as AWS EMR, Google Cloud DataProc, or Azure HDInsight--or just some cloud compute resources. I prefer the managed clusters, because they take care of installing the distribtued frameworks we'll want.
From here, we use the compute resources to process the data in object storage--such as S3--and send the results back to a static object. Others in our organization, or even publicly can access this object to see the results of our analysis. We can do this with any type of data processing: ETL, analytics, even machine learning.July 13, 2019