How does the CAP Theorem apply on HDFS?
Asked Answered
P

2

6

I just started reading about Hadoop and came across the CAP Theorem. Can you please throw some light on which two components of CAP would be applicable to a HDFS system?

Priedieu answered 11/11, 2019 at 5:55 Comment(0)
R
11

Argument for Consistency

The document very clearly says: "The consistency model of a Hadoop FileSystem is one-copy-update-semantics; that of a traditional local POSIX filesystem."

(One-copy update semantics means the file contents seen by all of the processes accessing or updating a given file would see as if only a single copy of the file existed.)

Moving forward, the document says:

  • "Create. Once the close() operation on an output stream writing a newly created file has completed, in-cluster operations querying the file metadata and contents MUST immediately see the file and its data."
  • "Update. Once the close() operation on an output stream writing a newly created file has completed, in-cluster operations querying the file metadata and contents MUST immediately see the new data.
  • "Delete. once a delete() operation on a path other than “/” has completed successfully, it MUST NOT be visible or accessible. Specifically, listStatus(), open() ,rename() and append() operations MUST fail."

The above mentioned characteristics point towards the presence of "Consistency" in the HDFS.

Source: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/filesystem/introduction.html

Argument for Partition Tolerance

HDFS provides High Availability for both Name Nodes and Data Nodes.

Source: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html

Argument for Lack of Availability

It is very clearly mentioned in the documentation(under the section "Operations and failures"):

"The time to complete an operation is undefined and may depend on the implementation and on the state of the system."

This indicates that the "Availability" in the context of CAP is missing in HDFS.

Source: https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/filesystem/introduction.html

Given the above mentioned arguments, I believe HDFS supports "Consistency and Partition Tolerance" and not "Availability" in the context of CAP theorem.

Rafter answered 15/12, 2020 at 17:16 Comment(0)
A
3
  • C – Consistency (All nodes see the data in homogeneous form i.e. every node has the same knowledge of data at any instant of time)
  • A – Availability (A guarantee that every request receives a response which may be processed or failed)
  • P – Partition Tolerance (The system continues to operate even if a message is lost or part of the system fails)

Talking about Hadoop , it supports the Availability and Partition Tolerance property. The Consistency property is not supported because only namenode has the information of where the replicas are placed. This information is not available with each and every node of the cluster.

Aged answered 28/11, 2019 at 10:25 Comment(1)
I would say since there is only one active namenode, and it is the only place of truth regarding to which data is where, there is consistency, since all services querying the data will talk to the namenode and therefore get the same data. In case of a failover, the standby namenode is supposed to be synchronized and to give the same answer than the former active.Frenum

© 2022 - 2024 — McMap. All rights reserved.