This Tutorial Explains all about Data Lake including its Need, Definition, Architecture, Benefits & Differences Between Data Lake vs Data Warehouse:
The term ‘Data Lake’ is used quite often in today’s IT world. Have you ever wondered what is it and where the term exactly comes from?
In the information technology age where data is amplifying day and night in numerous forms, the concept of data lake becomes certainly important and useful.
Let’s explore what a data lake is and what are its benefits, uses, etc in detail here.
What You Will Learn:
- What Is A Data Lake And How Does It Work?
- Analogy Of Data Lake
- Data Lake Market – Growth, Trends & Predictions
- Why Is Data Lake Required?
- Difference Between Data Warehouse Vs Data Lake
- Data Lake Architecture
- Key Characteristics Of Data Lake
- Challenges And Risks
- Data Lake Vendors
What Is A Data Lake And How Does It Work?
A data lake is a system or centralized repository of data that lets you store all your structured, semi-structured, unstructured, and binary data in its natural/native/raw format.
Structured data may include tables from RDBMSs; semi-structured data include CSV files, XML files, logs, JSON, etc.; unstructured data may include PDFs, word documents, text files, emails, etc.; and binary data may include audio, video, image files.
It follows a flat architecture for storing data. Generally, data is stored in the form of object blobs or files.
With a data lake, you can store all your enterprise as it is at a single place, with no need to first structure the data. You can directly execute the various types of analytics on it including machine learning, real-time analytics, on-premises data movement, real-time data movement, dashboards, and visualizations.
It keeps all the data in it in the original form and presumes that the analysis will happen later, on-demand.
Analogy Of Data Lake
The term Data Lake was coined by James Dixon, the then CTO at Pentaho. He defines a data mart (a subset of a data warehouse) as similar to a water bottle filled with cleansed, distilled water, packaged and structured for direct and easy use.
On the other hand, it is analogous to a body of water in its natural form. Data flows from the streams (various business functions/source systems) to the lake. Consumers of the data lake i.e. users have access to the lake in order to analyze, examine, collect samples and dive in.
Just like the water in the lake caters to different needs of people like fishing, boating, providing drinking water, etc., similarly, the data lake architecture serves multiple purposes.
A data scientist can use it to explore the data and create a hypothesis. It offers an opportunity for data analysts to analyze data and discover patterns. It provides a mode to the business users and stakeholders to explore data.
It also offers an opportunity for reporting analysts to design reports and present them to the business. On the contrary, the data warehouse has packaged data for well-defined purposes just like a bisleri bottle that can be used only for drinking water.
Data Lake Market – Growth, Trends & Predictions
The data lake market is divided on the basis of product (solution or service), deployment (on-prem or cloud), clients’ industry (Retail, banking, utility, insurance, IT, Healthcare, Telecom, Publishing, Manufacturing), and geographic regions.
As per the report published by Mordor Intelligence, below is the market snapshot for the data lake:
#1) Market Summary
The Data Lakes Market was assessed at USD 3.74 billion in 2019 and is anticipated to touch USD 17.60 billion by 2025, at a CAGR (Compound Annual Growth Rate) of 29.9% across projection period 2020 – 2025.
These data reservoirs are increasingly turning out as an economical option for a lot of organizations over data warehouses. Contrasting to data lakes, Data warehousing requires additional processing of data before ingoing the warehouse.
The expense of managing a data lake is lesser when compared to a data warehouse because of a lot of processing and space is required to create the database for warehouses.
#2) Major Players
It is predicted that the Data Lake market will be a consolidated market dominated by the five key players as seen in the image below.
#3) Key Trends
- Its usage is expected to grow considerably in the banking sector. Banks are adopting data lakes to deliver on-the-go analytics. Also, it is helping to dissolve many silos in the banking sector.
- As there is a huge increase in digital payments/use of mobile wallets across the globe, the scope for big data analytics and thereby the opportunity for them is increasing.
- It is anticipated that North America will have high adoption of data lakes. A study done by Capgemini says that over 60% of financial organizations in the U.S. think that big data analytics acts as a differentiator for businesses and gives them a competitive edge. Over 90% of organizations feel that investing in big data projects increases the chances of success in the future.
- They are required for the use of smart meter applications and in the U.S., it is expected that around 90 million smart meters will be installed in 2021. Hence, there is a predicted high demand for them.
Why Is Data Lake Required?
The purpose of a data lake is to give an unprocessed view of data (data in its purest form).
Nowadays, many big companies including Google, Amazon, Cloudera, Oracle, Microsoft and few more have data lake offers.
Many organizations are using cloud storage services like Azure Data Lake or Amazon S3. Companies are also using a distributed file system like Apache Hadoop. The concept of a personal data lake that lets you manage and share your own big data has also evolved.
If we talk about industrial uses, then it is a very suitable fit for the healthcare domain. Because of the unstructured format of a lot of data in healthcare (For Example, Physician notes, clinical data, patient disease history, etc.) and the requirement for real-time insights, a data lake is a great option over data warehouse.
It offers flexible solutions in the education sector as well where the data is very vast and very raw.
In the transportation sector, mainly in supply chain management or logistics, it aids in making predictions and realizing cost-cutting benefits.
Aviation and Electrical power industries are also using data lakes.
An example of its implementation is GE Predix (developed by General Electric) which is an industrial data lake platform offering strong data governance competencies to create, deploy and govern industrial applications that links to industrial assets, gather and analyze data, and provide real-time insights for improving industrial infrastructure and processes.
Difference Between Data Warehouse Vs Data Lake
Often people find it difficult to understand how a lake is different from a data warehouse. They also argue that it is the same as the data warehouse. But this is not the reality.
The only commonality between the data lake and data warehouse is that both are data storage repositories. Rest, they are different. They have different use cases and purposes.
The differences are clarified below:
|Data Lake||Data Warehouse|
|Data||A Data Lake will keep in it all the raw data. |
It may be structured, unstructured or semi-structured. It might be possible that some of the data in the data lake shall never be used.
|A Data Warehouse incorporates only that data that is processed and refined i.e. structured data that is required for reporting and solving specific business problems.|
|Users||Generally, the users of a data lake are data scientists and data developers.||Generally, the users of the data warehouse are business professionals, operational users, and business analysts.|
|Accessibility||The data lake is highly accessible and easy & quick to update because they don’t have any structure.||In the data warehouse, updating the data is a more complicated and costly operation because data warehouses are structured by design.|
|Schema||Schema-on-write. Designed before the DW implementation.||Schema-on-read. Written at the time of analysis.|
|Architecture||Flat architecture||Hierarchical architecture|
|Purpose||The purpose of raw data stored in data lakes is not fixed or is undetermined.|
At times, the data can flow into a data lake with some specific future use in mind or just to have the data handy.
The data lake has less organized and less filtered data.
|The processed data stored in the Data warehouse has a specific and definite purpose.
A DW has organized and filtered data.
Hence, it requires less storage space than the data lake.
|Analytics||A data lake can be used for machine learning, data discovery data profiling, and predictive analysis.||A data warehouse can be used for Business Intelligence, visualizations and batch reporting.|
|Storage||Designed for low-cost storage.|
The hardware of the data lake is very different from the hardware of the data warehouse.
It uses off-the-shelf servers combined with cheap storage. This makes the data lake fairly economical and highly scalable to terabytes and petabytes.
This is done to keep all the data in a data lake so that you can go back to the time at any point to do analysis.
|Expensive for large data volumes.
The data warehouse has expensive disk storage to make it highly performant.
Therefore, in order to conserve the space, the data model is simplified and only the data which is really required to make business decisions is kept on the data warehouse.
|Support for data types||A Data Lake supports very well the non-traditional data types like server logs, sensor data, social network activity, text, images, multimedia,etc. |
All the data is kept irrespective of the source and structure.
|Generally, a data warehouse consists of data fetched from transactional systems.
It does not support very well the non-traditional data types. Storing and consuming the non-traditional data can be expensive and difficult with the data warehouse.
|Security||Security of data lakes is at ‘maturing’ stage since this is a relatively new concept than the data warehouse.||The security of data warehouses is at the ‘matured’ stage.|
|Agility||Highly agile; configure and reconfigure as required.||Less agile; fixed configuration.|
Data Lake Architecture
These data sources are combined into a raw data store that uses up data in its raw form i.e. data without any transformations. This is low-cost, permanent and scalable storage.
Next, we have analytical sandboxes that can be used for data discovery, exploratory data analysis, and predictive modeling. Basically, this is used by data scientists to explore data, build new hypotheses and define use cases.
Then there is a batch processing engine that processes the raw data into consumer usable form i.e. in a structured format that can be used for reporting to end-users.
Then we have a real-time processing engine that is taken in streaming data and transforms it.
Key Characteristics Of Data Lake
To be classified as Data Lake, a big data repository should possess the following three attributes:
#1) A single common repository of data usually housed within a Distributed File System (DFS).
Hadoop data lakes uphold data in its native form and capture changes to data and relative semantics during the data lifecycle. This approach is particularly beneficial for compliance checks and internal audits.
This is an enhancement above the conventional Enterprise Data Warehouse in which when data goes through transformations, aggregations, and modifications, it is difficult to put data as a whole when required, and companies strive to find out the source/origin of data.
#2) Incorporates planning and job scheduling capabilities (For Example, through any scheduler tool like YARN, etc.).
Workload execution is an essential need for enterprise Hadoop and YARN offers resource management and a central platform to provide constant processes, security, and data governance tools throughout Hadoop clusters, making sure that analytic workflows possess the required level of data access and computing power.
#3) Comprises the set of utilities and functions required to consume, process, or work with the data.
Easy and quick accessibility for users is one of the key traits of a data lake, the reason organizations store the data in its native or pure form.
In whatever form the data is i.e. structured, unstructured or semi-structured, it is inserted as it is in the data lake. It allows data owners to combine customer, supplier, and operations data, by getting rid of any technical or political barriers to sharing data.
[image source ]
- Versatile: Competent enough to store all kinds of structured/unstructured data ranging from CRM data to social network activities.
- More Flexibility of Schema: Does not need planning or prior knowledge of data analysis. It stores all the data as it is in original form and presumes that the analysis will happen later, on-demand. This is very useful for OLAP. For Example, the Hadoop data lake permits you to be schema-free wherein you can decouple schema from data.
- Real-time Decision Analysis: They enjoy the benefit of a huge amount of consistent data and deep learning algorithms to reach real-time decision analytics. Capable of obtaining value from unlimited data types.
- Scalable: They are far more scalable than traditional data warehouses and, they are also less costly.
- Advanced Analytics / Compatibility with SQL and Other Languages: With data lakes, there are numerous ways to query the data. Unlike traditional data warehouses that support only SQL for simple analytics, they give you a lot of other options and language support to analyze data. They are also compatible with machine learning tools like Spark MLlib.
- Democratize Data: Democratized access to data through a single, integrated view of data throughout the whole organization while utilizing an effective data management platform. This ensures the all-around availability of data.
- Better quality of Data: Overall you get better quality of data with data lakes through technological benefits such as data storage in native format, scalability, versatility, schema flexibility, SQL and other languages support, and advanced analytics.
Challenges And Risks
Data lakes offer a lot of advantages. But yes, there are also a few challenges and risks associated with them that an organization needs to address carefully.
- If not properly designed, they can turn into data swamps. Sometimes, organizations just end up in keep on dumping limitless data in these lakes without any strategy and purpose in mind.
- At times, the analysts who want to use the data have no knowledge about how to do so as it is quite challenging to do mining in data lakes. Thus, they lose relevance and momentum after some time. Organizations need to work on removing this barrier for analysts.
- As we have a lot of disorganized data in data lakes, it is not fresh or current enough to be used in production. Hence, the data in these lakes remain in the pilot mode and are never put into production.
- Unstructured data may lead to unusable data.
- Sometimes, organizations experience that it is not making a significant impact on business with respect to the investments made. This requires a change in mindset. For impacts to occur, companies need to encourage managers and leaders to make decisions based on the analytics derived from these data reservoirs.
- Security and access control are also one of the risks when you are working with them. Some of the data which may have privacy and regulations required gets placed in data lakes without any oversight.
In an enterprise, it is quite sensible to do the data lake implementation in an agile manner.
That is, to first implement a Data Lake MVP gets it tested by the users with respect to quality, ease of access, storage, and analytical capabilities, receive feedback and then add on the complex requirements and features to add value to the Lake.
Generally, an organization goes through the below four basic stages of implementation:
The Basic Data Lake: At this stage, the team settles down on the basic architecture, technology (cloud-based or legacy) and security & governing practices for the data lake. It is made capable of storing all the raw data coming from various enterprise sources and combining the internal & external data to deliver enriched information.
The Sandbox: Analytical Ability Enhancement: At this stage, the data scientists access the data reservoir to execute preliminary experiments for utilizing raw data and design analytical models to meet business needs.
Data Warehouses and Data Lake Collaboration: At this stage, the organization starts using a data lake in synergy with the existing data warehouses. The low priority data is sent to them so that the storage limit of data warehouses does not get exceeded.
It presents a prospect to produce insights from cold data or query it to discover information that are not indexed by conventional databases.
End to End adoption of Data Lake: This is the last and maturity acquisition stage in which it turns into a key element of the organization’s data architecture and effectively direct search operation. By this time, the data lake would have substituted EDW and they become the sole source of all the enterprise data.
An organization can do the following through the data lake:
- Create complex data modeling and analytics solutions for different business needs.
- Design interactive dashboards that consolidate comprehensions from the data lake plus various applications and data sources.
- Implement advanced analytics or robotics programs, as it handles computational operations.
By this point, it is having strong security and governing measures as well.
Data Lake Vendors
There are different vendors providing data lake tools in the industry.
If we look at the big companies:
- Informatica is providing an intelligent data lake tool. BDM (Big Data Management) 10.2.2 is the latest version available.
- There is a vendor called looker who is also providing the tool.
- The company Talend which is popular for its ETL tools also provides the Data Lake tool.
- Then, we have an open-source tool called Kylo from the Teradata company. The team called ‘Think Big’ team in Teradata company has developed this tool.
- The company Cask Data Inc also provides these services.
- From Microsoft, you can find Azure data lake available in the industry.
- Hvr-software also provides data lake consolidation solutions.
- Podium data, a Qlik company is providing tool products like data lake pipelines, and multi-zone data lakes.
- Snowflake also has a data lake product.
- Zaloni is a data lake company that is handling huge data using Big Data.
So, these all are the popular service providers as well as vendors for such tools.
If you are looking for practicing and building your knowledge about data lakes, then you can go for Informatica or Kylo. If you are looking for a cloud-based service, then you can opt for Looker, Informatica, and Talend. These three vendors are providing AWS cloud data lakes. You can also get a 1-month free trial from Kylo.
In this tutorial, we discussed the concept of the data lake in detail. We went through the basic idea behind data lake, its architecture, key characteristics, benefits, along with its examples, use cases, etc.
We also saw how a data lake is different from the data warehouse. We also covered the top vendors providing related services.