#onenote# datalake

From book:

Data Lake Development with Big Data

A Data Lake has flexible definitions. At its core, it is a data storage and processing repository in which all of the data in an organization can be placed so that every internal and external systems’, partners’, and collaborators’ data flows into it and insights spring out.

 Data Lake’s capability called Data as a Service (DaaS) could be a solution.

The following list details out in a nutshell what a Data Lake is:

      • Data Lake is a huge repository that holds every kind of data in its raw format until it is needed by anyone in the organization to analyze.
      • Data Lake is not Hadoop. It uses different tools. Hadoop only implements a subset of functionalities.
      • Data Lake is not a database in the traditional sense of the word. A typical implementation of Data Lake uses various NoSQL and In-Memory databases that could co-exist with its relational counterparts.
      • A Data Lake cannot be implemented in isolation. It has to be implemented alongside a data warehouse as it complements various functionalities of a DW.
      • It stores large volumes of both unstructured and structured data. It also stores fast-moving streamed data from machine sensors and logs.
      • It advocates a Store-All approach to huge volumes of data.
      • It is optimized for data crunching with a high-latency batch mode and it is not geared for transaction processing.
      • It helps in creating data models that are flexible and could be revised without database redesign.
      • It can quickly perform data enrichment that helps in achieving data enhancement, augmentation, classification, and standardization of the data.
      • All of the data stored in the Data Lake can be utilized to get an all-inclusive view. This enables near-real-time, more precise predictive models that go beyond sampling and aid in generating multi-dimensional models too.
      • It is a data scientist’s favorite hunting ground. He gets to access the data stored in its raw glory at its most granular level, so that he can perform any ad-hoc queries, and build an advanced model at any timeIteratively. The classic data warehouse approach does not support this ability to condense the time between data intake and insight generation.
      • It enables to model the data, not only in the traditional relational way, but the real value from the data can emanate from modeling it in the following ways:
        • As a graph to find the interactions between elements; for example, Neo4J
        • As a document store to cluster similar text; for example, MongoDB
        • As a columnar store for fast updates and search; for example, HBase
        • As a key-value store for lightning the fast search; for example, Riak
      • Machine generated alternative text:
Intake
Tier
Data Lake
[
Information Lifecycle Management Layer
]
[
Metadata Layer
]
L
Security and Governance
Layer
j
E
Source System
Zone
[
Integration
[
J
Transient Zone
L
I
Data Discovery
[
J
Enrich ment
I
[ Data Provisioning]
L
Raw Zone
External
[
[ Data Hub J
Access
Internal
Management Tier
Processing
Consumption
Tier
Data Lake end state architecture

Machine generated alternative text:
Detect cyber attack and
prevent fraud
? Monitor Privileged users
Detailed audit report
Centralized audit repository
across Databases and
Hadoop
Built in compliance workflow
SIEM integration
? Prevent cyber attack
? Block unauthorized access
? Enforce change control
? Quarantine suspicious users
? Mask sensitive data
? Control firewall IDS
Static and behavioral
vulnerabilities
Access data vulnerabilities
. Monitor entitlements and
credentia Is
. Alert on configuration
changes
? Discover data sources Find and
? Classify sensitive data \ Classify
Automate security policies
? Automate compliance reports
Assess and
Harden
The Data Governance and Security layer

Machine generated alternative text:
The Metadata Layer

      • A well-built metadata layer will allow organizations to harness the potential of the Data Lake and deliver the following mechanisms to the end users to access data and perform analytics:
        • Self-Service BI (SSBI)
        • Data as a Service (DaaS)
        • Machine Learning as a Service (MLaaS)
        • Data Provisioning (DP)
        • Analytics Sandbox Provisioning (ASP)

The In-take tier

The zones in the Intake tier are as follows:

  • Source System Zone
  • Transient landing zone
  • Raw Zone

Machine generated alternative text:
Real time streams
Micro Batch Intake
Tier
Batch
The timeliness of Data

The Source System Zone

The processing services that are needed to connect to external systems are encapsulated in the Source System Zone. This zone primarily deals with the connectivity and acquires data from the external source systems.

The Transient Zone

A Transient landing zone is a predefined, secured intermediate location where the data from various source systems will be stored before moving it into the raw zone

The Raw Zone

The Raw Zone is a place where data lands from the Transient Zone. This is typically implemented as a file-based storage (HadoopHDFS). It includes a “Raw Data Storage” area to retain source data for the active use and archival. This is the zone where we have to consider storage options based on the timeliness of the data.

Machine generated alternative text:
Slow moving data Batch Raw
> Storage
(Hadoop)
_____________ Management Tier
Fast moving data Real-time Raw Data Provisioning
Storage
In-Memory
DBs
Raw Zone
Raw Zone capabilities

The Data Management tier

The Management tier has three zones: the data flows sequentially from the Raw Zone to the Integration Zone through the Enrichment Zone and then finally after all the processes are complete, the final data is stored in a ready-to-use format in the Data Hub that is a combination of relational or NOSQL databases. The zones in the Management tier are as follows:

  • The Integration Zone
  • The Enrichment Zone
  • The Data Hub Zone

As the data moves into the Management Zone, metadata is added and attached to each file. Metadata is a kind of watermark that tracks all changes made to each individual record

The Integration Zone

The Integration Zone’s main functionality is to integrate various data and apply common transformations on the raw data into a standardized, cleansed structure that is optimized for data consumers

Machine generated alternative text:
Validation I
Raw Zone [_Quality Checks Data Hub Zone
Integrity check Enrichment Zone
Metadata log
Integration Zone
Integration Zone capabilities

The Enrichment Zone

The Enrichment Zone provides processes for data enhancement, augmentation, classification, and standardization

Machine generated alternative text:
Raw Zone
-
Integration Zone
Enrichment Zone
Data Hub Zone
Enhancement
I
Augmentation
Classification
Standardizing
The Enrichment Zone 憇 capabilities

The Data Hub Zone

The Data Hub Zone is the final storage location for cleaned and processed data. After the data is transformed and enriched in the downstream zones, it is finally pushed into the Data Hub for consumption.

The Data Hub is governed by a discovery process that is internally implemented as search, locate, and retrieve functionality through tools such as Elasticsearch or Solr/Lucene. A discovery is made possible by the extensive metadata that has been collected in the previous zones.

Machine generated alternative text:
r
Relational Data
>
Enrichment and
integration
Non-Relational Data
>
Data Hub Zone capabilities
Consumption
o
Q
(j)
O
Data Hub Zone

The Data Consumption tier

The Data Discovery Zone

The Data Discovery Zone is the primary gateway for external users into the Data Lake. The key to implement a functional consumption tier is the amount and quality of Metadata that we would have collected in the preceding zones and the intelligent way in which we will expose this metadata for search and data retrieval

The Data Provisioning Zone

Data Provisioning allows data consumers to source/consume the data that is available in the Data Lake. This tier is designed to allow you to use the metadata that specify the “publications” that need to be created, the “subscription” specific customization requirements, and the end delivery of the requested data to the “data consumer.” 

Machine generated alternative text:
Security and Governance
Data Hub Zone
.9
Q) C
> 0
0
: External Users
Ci) >
? 2 >
Raw Zone Internal Apps
?,
I Metadata
Consumption Tier
Consumption Zone capabilities

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s