I'm trying to get to grips with Big Data, and mainly with how Big Data is managed.
I'm familiar with the traditional form of data management and data life cycle; e.g.:
However, in the case of Big Data, I'm confused about the equivalent version of points 2 and 3, mainly because I'm unsure about whether or not every Big Data "solution" always involves the use of a NoSQL database to handle and store unstructured data, and also what the Big Data equivalent is of a Data Warehouse.
From what I've seen, in some cases NoSQL isn't always used and can be totally omitted - is this true?
To me, the Big Data life cycle goes something on the lines of this:
But I have a feeling that this isn't always the case, and point 3 may be totally wrong altogether. Can anyone shed some light on this?
When we talk about Big Data, we talk in most cases about huge amount of data that in many cases is constantly written. Data can have a lot of variety as well. Think of a typical data source for Big Data as a machine in a production line that produces all the time sensor data on temperature, humidity, etc. Not the typical kind of data you would find in your DWH.
What would happen if you transform all this data to fit into a relational database? If you have worked with ETL a lot, you know that extracting from the source, transforming the data to fit into a schema and then to store it takes time and it is a bottle neck. Creating a schema is too slow. Also mostly this solution is too costly as you need expensive appliances to run your DWH. You would not want to fill it with sensor data.
You need fast writes on cheap hardware. With Big Data you store schemaless as first (often referred as unstructured data) on a distributed file system. This file system splits the huge data into blocks (typically around 128 MB) and distributes them in the cluster nodes. As the blocks get replicated, nodes can also go down.
If you are coming from the traditional DWH world, you are used to technologies that can work well with data that is well prepared and structured. Hadoop and co are good for looking for insights like the search for the needle in the hay stack. You gain the power to generate insights by parallelising data processing and you process huge amount of data.
Imagine you collected Terabytes of data and you want to run some analytical analysis on it (e.g. a clustering). If you had to run it on a single machine it would take hours. The key of big data systems is to parallelise execution in a shared nothing architecture. If you want to increase performance, you can add hardware to scale out horizontally. With that you speed up your search with huge amount of data.
Looking at a modern Big Data stack, you have data storage. This can be Hadoop with a distributed file system such as HDFS or a similar file system. Then you have on top of it a resource manager that manages the access on the file system. Then again on top of it, you have a data processing engine such as Apache Spark that orchestrates the execution on the storage layer.
Again on the core engine for data processing, you have applications and frameworks such as machine learning APIs that allow you to find patterns within your data. You can run either unsupervised learning algorithms to detect structure (such as a clustering algorithm) or supervised machine learning algorithms to give some meaning to patterns in the data and to be able to predict outcomes (e.g. linear regression or random forests).
This is my Big Data in a nutshell for people who are experienced with traditional database systems.