A Data Lake has flexible definitions. At its core, it is a data storage and processing repository in which all of the data in an organization can be placed so that every internal and external systems’, partners’, and collaborators’ data flows into it and insights spring out.
Data Lake’s capability called Data as a Service (DaaS) could be a solution.
The following list details out in a nutshell what a Data Lake is:
- Data Lake is a huge repository that holds every kind of data in its raw format until it is needed by anyone in the organization to analyze.
- Data Lake is not Hadoop. It uses different tools. Hadoop only implements a subset of functionalities.
- Data Lake is not a database in the traditional sense of the word. A typical implementation of Data Lake uses various NoSQL and In-Memory databases that could co-exist with its relational counterparts.
- A Data Lake cannot be implemented in isolation. It has to be implemented alongside a data warehouse as it complements various functionalities of a DW.
- It stores large volumes of both unstructured and structured data. It also stores fast-moving streamed data from machine sensors and logs.
- It advocates a Store-All approach to huge volumes of data.
- It is optimized for data crunching with a high-latency batch mode and it is not geared for transaction processing.
- It helps in creating data models that are flexible and could be revised without database redesign.
- It can quickly perform data enrichment that helps in achieving data enhancement, augmentation, classification, and standardization of the data.
- All of the data stored in the Data Lake can be utilized to get an all-inclusive view. This enables near-real-time, more precise predictive models that go beyond sampling and aid in generating multi-dimensional models too.
- It is a data scientist’s favorite hunting ground. He gets to access the data stored in its raw glory at its most granular level, so that he can perform any ad-hoc queries, and build an advanced model at any time—Iteratively. The classic data warehouse approach does not support this ability to condense the time between data intake and insight generation.
- It enables to model the data, not only in the traditional relational way, but the real value from the data can emanate from modeling it in the following ways:
- As a graph to find the interactions between elements; for example, Neo4J
- As a document store to cluster similar text; for example, MongoDB
- As a columnar store for fast updates and search; for example, HBase
- As a key-value store for lightning the fast search; for example, Riak
A well-built metadata layer will allow organizations to harness the potential of the Data Lake and deliver the following mechanisms to the end users to access data and perform analytics:
- Self-Service BI (SSBI)
- Data as a Service (DaaS)
- Machine Learning as a Service (MLaaS)
- Data Provisioning (DP)
- Analytics Sandbox Provisioning (ASP)