bigdatadata-lake

Is Data Lake and Big Data the same?


I am trying to understand all if there is a real difference between data lake and Big data if you check the concepts both are like a Big repository which saves the information until it becomes necessary, so, When can we say that we are using big data or data lake?


Solution

  • I can't say I've come across the term 'big repository' before, but to answer the original question, no, data lake and big data are not the same, although in fairness they are both thrown around a lot and the definitions vary depending who you ask, but I'll try to give it a shot:


    Big Data

    Is used to describe both the technology ecosystem around, and to some extent the industry that deals with, data that is in some way too big or too complex to be conveniently stored and/or processed by traditional means.

    Sometimes this can be a matter of sheer data volume: Once you get into the 100s of terabytes or petabytes, your good old fashioned RDBMS databases tend to throw in the towel, and we are forced to spread our data across many disks, not just one large one. And at those volumes we'll want to parallellize our workloads, leading to things like MPP databases, the Hadoop ecosystem, and DAG-based processing.

    However, volume alone does not tell the whole story. A popular definition of Big Data is described by the so-called '4 Vs': Volume, Variety, Velocity, and Veracity. In a nutshell:

    In this definition, 'big data' is data which, due to the particular challenges associated with the 4 V's, is unfit for processing with traditional database technologies; while 'big data tools' are tools which are specifically designed to deal with those challenges.


    Data Lake

    In contrast, Data Lake is generally used as a term to describe a certain type of file or blob storage layer that allows storage of practically unlimited amounts of structured and unstructured data as needed in a big data architecture.

    Some companies will use the term 'Data Lake' to mean not just the storage layer, but also all the associated tools, from ingestion, ETL, wrangling, machine learning, analytics, all the way to datawarehouse stacks and possibly even BI and visualization tools. As a big data architect however, I find that use of the term confusing and prefer to talk about the data lake and the tooling around it as separate components with separate capabilities and responsibilities. As such, the responsibility of the Data Lake is to be the central, high-durability store for any type of data that you might want to store at rest.

    By most accounts, the term 'data lake' was coined by James Dixon, Founder and CTO of Pentaho, who describes it thus:

    “If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

    Amazon Web Services defines it on their page 'What Is A Data Lake':

    A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

    From Wikipedia:

    A data lake is a system or repository of data stored in its natural format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics and machine learning.

    And finally Gartner:

    A data lake is a collection of storage instances of various data assets additional to the originating data sources. These assets are stored in a near-exact, or even exact, copy of the source format. The purpose of a data lake is to present an unrefined view of data to only the most highly skilled analysts, to help them explore their data refinement and analysis techniques independent of any of the system-of-record compromises that may exist in a traditional analytic data store (such as a data mart or data warehouse).

    On on-premises clusters, the data lake usually refers to the main storage on the cluster, in the distributed file system, usually HDFS, though other file systems exist, such as GFS used at Google or the MapR File system on MapR clusters.

    In the cloud, data lakes are generally not stored on clusters, since it's just not cost effective to keep a cluster running at all times, but rather on durable cloud storage, such as Amazon S3, Azure ADLS, or Google Cloud Storage. Compute clusters can then be launched on demand and connect seamlessly to the cloud storage to run transformations, machine learning, analytical jobs, etc.