Hadoop Vs Data Lake
Asked Answered
D

7

18

I heard a new term Data Lake. I googled and got that

A data lake is a large-scale storage repository and processing engine. A data lake provides "massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs"

The term data lake is often associated with Hadoop-oriented object storage. In such a scenario, an organization's data is first loaded into the Hadoop platform, and then business analytics and data mining tools are applied to the data where it resides on Hadoop's cluster nodes of commodity computers.

Same thing is done by Hadoop. We have HDFS for Storage and MapReduce for Computation. I am little bit confuse about Hadoop and Data lake. What is difference between both. If they are same that why this term arise. Or how to define a data lake.

Draconic answered 14/3, 2016 at 12:24 Comment(2)
a more select and use framework for business analytics? hadoop need more understanding on how to integrate external analytics algorithm into MapReduce if i'm not mistakenBunnybunow
poor me I heard about it today. LOLProlongation
W
18

Data Lake is an abstract "idea". Hadoop is specific technology/software. You can implement a data lake using hadoop or using different tool.

Wisecrack answered 14/3, 2016 at 12:42 Comment(5)
It mean HDFS and data lake may be the sameDraconic
@KishoreKumarSuthar HDFS is just a filesystem. So no.Gavra
@Gavra According with Wikipedia , Yes. "One example of a data lake is the distributed file system used in Apache Hadoop."Adalbertoadalheid
Wikipedia can be edited by anyone. FAT32 can be used to store stuff too.Gavra
So you're saying FAT32 is a data lake?Blasphemous
Q
8

Data Lake is a methodology of storing data within a system that facilitates the collation of data in variant schemas and structural forms, usually object blobs or files.

The concept of a data lake is closely tied to Apache Hadoop and its ecosystem of open source projects. All discussions of the data lake quickly lead to a description of how to build a data lake using the power of the Apache Hadoop ecosystem. It’s become popular because it provides a cost-effective and technologically feasible way to meet big data challenges. Organizations are discovering the data lake as an evolution from their existing data architecture.

Following whitepaper will serve as an execellent example for building data lake with Hadoop.

Quondam answered 14/3, 2016 at 12:25 Comment(0)
H
3

The easiest way to think of Data Lake is to think of this large container that has like a real lake with rivers coming into the river you never know where the rivers are coming from (or what "type" of river).

Data Lake is able to stored the mass different types of data (Structured data, unstructured data, log files, real-time, images, etc. ) and to blend that together, to correlate many different data types. The key thing here is as we are moving from traditional way to the modern tools (like Hadoop, Cassandra, NoSQL DB, etc).

There's a whole bunch of data being created that we might get some value out of if we could analyze it. We can use the the Cloud to take that data, get it together in a store, and analyze it. In Azure, we have the Azure Data Lake Store. And we can take all of that data, and we can go and store that in Azure Data Lake Store. Azure Data Lake Store is like a cloud-based file service or file system that is pretty much unlimited in size.

We can run services on top of the data that's in that store. So you could use Hadoop or Spark in an HDInsight cluster, or you could use the Azure Data Lake analytic service, which is a complement to the Azure Data Lake Store. And what that service will let you do is to run jobs that effectively query the data you have stored in the Azure Data Lake store and generate output results.

Azure Data Lake Store is something where we could store all the data that we wanna analyze. Azure Data Lake Analytics as a service where we can run jobs that query that data to generate some sort of output for analysis. Hadoop is specific technology/ (open source distributed data processing cluster technology). You can implement a data lake using hadoop or using different tool.

Honduras answered 9/5, 2018 at 15:10 Comment(0)
J
2

You've confused the concept (data lake) with a framework that can be used to implement them (Hadoop), but it's understandable because these terms are so closely associated with one another.

Hadoop is often associated with data lakes because some of the first data lakes were built using on-premises Hadoop. However, a data lake is just an architectural design pattern - data lakes can be built outside of Hadoop using any kind of scalable object storage (like Azure Data Lake or AWS S3 for example).

This site does a pretty good job of giving an overview of data lakes, including a history of data lakes that discusses Hadoop alongside other implementations. Here's another article that addresses how these terms get tied up together as well.

Jezebel answered 20/2, 2020 at 19:35 Comment(0)
A
1

I´d say that question is too much like.

"Oracle vs Database".

A data lake is a method of storing data within a system or repository. Hadoop reference the technology, Hadoop is an open-source software framework for storing data. So one example of a data lake is the distributed file system used in Hadoop.

Adalbertoadalheid answered 10/7, 2017 at 13:21 Comment(1)
I'd say a Data Lake is one of the things you can do with Hadoop or another technology, but not all Hadoop applications are a Data Lake.Leucotomy
D
0

to handle a data lake we can use any technology that support different kind of data, in addition to our volume. In this context Apache Hadoop we have this features, so we can use hadoop to implement data lake. But hadoop is never means a data lake, beacause data lake is a large concept contains a lot of implementation. In developpement jargon we say that "data lake is a specification contains a lot of implementation like hadoop, microsoft azure, aws, etc"

Dichotomous answered 17/11, 2020 at 9:39 Comment(0)
I
0

Actually, when you ask this question, you are assuming that the Hadoop and the data fall into the same category of technologies, but that is not the case.

Hadoop is just one technology that can be used to build a data lake. So if you look into the architecture data lake is an architecture. While Hadoop is one component in this architecture, it can be used as a storage for data like in other words, Hadoop can be a storage platform for the data lake. So the relationship is complementary, not competitive. So in the future, both data and Hadoop can continue to grow.

But again, the data lake is not restricted to Hadoop. So data lake can use, let's say, Hadoop or any other technology for economical storage of large files or data lake can use Apache Kafka to manage real-time data. Or maybe they can use a nonsecular database for transaction-oriented workloads or maybe data lake use some kind of modern data warehouse like Apache KUDU, for example, which makes sense for other types of large-scale analytic workloads. So basically, Hadoop is just one technology that can be used as a part of the overall data lake structure.

Ita answered 24/8, 2021 at 19:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.