How to Best Run Hadoop on Single Machine?

Asked 30/7, 2015 at 20:21 Answered 8/8, 2015 at 18:59

hadoop parallel-processing virtual-machine processing-efficiency linux-containers

I have access to a computer running Linux with 20 cores, 92 GB of RAM, and 100 GB of storage on an HDD. I would like to use Hadoop for a task involving a large amount of data (over 1M words, over 1B word combinations). Would pseudo-distributed mode or fully distributed mode be the best way to leverage the power of Hadoop on a single computer?

For my intended use of Hadoop, experiencing data loss and having to re-run the job because of node failure are not large issues.

This project involving Linux Containers uses fully distributed mode. This article describes pseudo-distributed mode; more detail can be found here.

Productive answered 30/7, 2015 at 20:21 Comment(8)

GPU-level parallelization (e.g., using CUDA) might offer a better alternative for a single machine. – Tade 4/8, 2015 at 14:52

@Tade I'm not familiar with GPU-level parallelization. What sources would you recommend looking at to become more familiar with it? – Productive 4/8, 2015 at 14:55

I am not a CUDA expert, myself, but I think that this is a good starting point: developer.nvidia.com/how-to-cuda-c-cpp I just provided this information, in case you find it useful. Have a look – Tade 4/8, 2015 at 15:6

@Tade Okay thanks! I've got to consider the focus I want to take: learning Hadoop by using it or choosing whatever strategy will make my software run faster. – Productive 4/8, 2015 at 15:12

@Tade I've decided to stick with Hadoop. So the scope of this question does not extend beyond Hadoop. – Productive 4/8, 2015 at 17:37

Please share the memory details also. I can update my answer with memory allocation – Gitagitel 5/8, 2015 at 9:24

@AmalGJose I updated my question with memory details. – Productive 5/8, 2015 at 13:52

If you ask me, I think hadoop is the wrong tool for the give case. I think 'spark' would offer you better efficiency, flexibility for the given configuration. (and probably scaling out perspective) – Chicken 6/8, 2015 at 3:49

As per my understanding, you have a single machine with 20 cores. In this case no need of virtualizing it, because the VMs that you create will consume some resources from the total resources. The best option is to install Linux OS in the laptop, install hadoop in pseudo distributed mode and configure the available resources for container allocation.

You need CPU cores as well as memory for getting good performance. So 20 cores alone will not help you. You need good amount of physical memory also. You can refer this document for allocating memory.

The fundamental behind hadoop is distributed computing and storage for processing large data in a cost effective manner. So if you try to achieve multiple machines in the same parent machine (small machines) by using virtualization, it won't help you because lot of resources will be consumed by the OS of individual machines. Instead if you install hadoop in the machine and configure the resources properly to hadoop, the jobs will execute in multiple containers (depending upon the availability and requirement) and hence parallel processing will happen. Thus you can achieve the maximum performance out of the existing machine.

So the best option is to set up a pseudo distributed cluster and allocate resources properly. Pseudo distributed mode is a mode in which all the daemons run in a single machine.

With the hardware configuration that you shared, you can use the below configuration for your hadoop set up. This can handle enough load.

(yarn-site.xml)    yarn.nodemanager.resource.memory-mb  = 81920
(yarn-site.xml)    yarn.scheduler.minimum-allocation-mb = 1024
(yarn-site.xml)    yarn.scheduler.maximum-allocation-mb = 81920
(yarn-site.xml)    yarn.nodemanager.resource.cpu-vcores = 16
(yarn-site.xml)    yarn.scheduler.minimum-allocation-vcores = 1
(yarn-site.xml)    yarn.scheduler.increment-allocation-vcores = 1
(yarn-site.xml)    yarn.scheduler.maximum-allocation-vcores = 16
(mapred-site.xml)  mapreduce.map.memory.mb  = 4096
(mapred-site.xml)  mapreduce.reduce.memory.mb   = 8192
(mapred-site.xml)  mapreduce.map.java.opts  = 3072
(mapred-site.xml)  mapreduce.reduce.java.opts   = 6144

Gitagitel answered 5/8, 2015 at 9:24 Comment(15)

Please update your answer now that I included memory and storage specs. – Productive 5/8, 2015 at 20:3

How did you come to that configuration? Did you run python yarn-util.py <options> to do so? – Productive 6/8, 2015 at 13:9

No. It is calculated based on the values given by you. You have 92 GB RAM. NN, RM, DN, NM, HS each need a min of 1 GB heap size. I reserved 4 GB for OS and left some memory free for other services. So totally I allocated 80 GB memory for processing. So 80 GB can be used for container allocation. So it can request for a max of 80 GB from the node. 4 GB and 8GB will be fine for Map and reduce heap size (It can handle enough load).. Like this I calculated – Gitagitel 6/8, 2015 at 13:16

Doesn't the storage space also play a role in configuring correctly? – Productive 6/8, 2015 at 13:19

The configuration totally depends on the kind of jobs that you are running and the size of data. The configs I shared will work with large data. Simply storing the data in the disk doesn't matter. But the size that you process at a time matters. – Gitagitel 6/8, 2015 at 13:21

Okay that makes sense. Why 16 max VCores? – Productive 6/8, 2015 at 13:24

Let us continue this discussion in chat. – Gitagitel 6/8, 2015 at 13:46

Please check the chat. – Productive 10/8, 2015 at 20:43

Could you address data replication and node failure with respect to fully vs pseudo distributed mode? – Productive 11/8, 2015 at 10:28

In Pseudo distributed cluster, you have only one datanode and the replication factor will be one. So if that datanode fails, the data will be lost. This is the problem with single machine cluster. Fully distributed mode works with multiple machines, so multiple datanodes. So we can put higher replication factor. So the chances of data loss in case of a node failure will be less. – Gitagitel 11/8, 2015 at 11:40

In light of that, do you still think that pseudo distributed mode is the best option for my situation? – Productive 11/8, 2015 at 11:47

In your case you have only one machine. So you cannot run multiple datanodes unless you create VM's inside that machine. Even if you create multiple VMs, all the VMs are running in the same parent machine. So if the parent machine fails, everything will fail. This doesnt make sense. The best choice for you is pseudo distributed cluster only. If you are much worried about the data that you are processing, you can specify a multiple locations in the datanode directory configuration. You can keep one as a mount drive. But this may reduce the performance . – Gitagitel 11/8, 2015 at 12:21

Okay thanks for the info! Do you recommend setting yarn.app.mapreduce.am.resource.mb and yarn.app.mapreduce.am.command-opts according to the doc you linked? – Productive 11/8, 2015 at 12:25

You can run it with the default value. It will work. If you face any issues, increase the value – Gitagitel 11/8, 2015 at 12:51

Running a test Mapreduce job after doing the config from your answer, total time spent by all map tasks went down from 3399 to 3275, but total time spent by all maps in occupied slots went up from 3399 to 13100. Why do you think this is? – Productive 11/8, 2015 at 13:0

You lose all of the benefits of Hadoop when you are on a single machine. Yeah, you could use containers or VMs but there is no need. A ~~single~~ standalone node instance with MapReduce with 20 mapper/reducer slots will perform better that a fully distributed cluster running on a single machine.

UPDATE: Using pseudo-distributed mode may be better at using all of the cores during M/R job. Apparently, standalone runs in a single java instance which probably isn't ideal for your use case.

Theadora answered 30/7, 2015 at 20:27 Comment(7)

By 'single node instance' do you mean pseudo-distributed mode? – Productive 30/7, 2015 at 20:29

No, I mean the "Standalone Operation" mode which is the default. – Theadora 30/7, 2015 at 20:30

Though pseudo-distributed may be better for using all 20 cores. – Theadora 30/7, 2015 at 20:32

Oh I see. Is there a reference you could provide a link for where I could learn more about configuring it for standalone operation to have 20 slots? Because the Apache tutorial for setting up a single node cluster does not mention an option for specifying number of slots. – Productive 30/7, 2015 at 20:32

I have looked at that tutorial but it doesn't say anything about specifying the number of mapper/reducer slots... Any other links? – Productive 30/7, 2015 at 20:36

I think I found another SO question about it: #8357796 – Productive 30/7, 2015 at 20:38

Could you elaborate on why there is no need to use containers or VMs? – Productive 3/8, 2015 at 21:47

The best way to utilize all the cores is
Method 1: to use virtualization if the hardware supports (install esxi or any of the hypervisors ) and create the VM instances of linux machines or install openstack cloud and create VMs so that you can fully utilize the hardware.
Method 2: Though this can be achieved by a easier method of installing a host OS on a machine and installing VMware or Virtualbox,but as there is two layers b/w hardware and the hadoop performance is little reduced compared to the Method 1.

after this you can install the hadoop flavour you like.

Its always better to use distributed mode because in pseudo distributed mode there is a chance that data loss in case of crash of the system as the replication factor is 1 whereas in distributed mode the replication factor by default is 3. And as in the pseudo distributed mode each deamon spins up with one java thread even a loss in single thread may cause entire MR job to run again.

Edit: Looking at the configuration of storage of 100 gb and almost equal memory(assuming data is less than 100 gb) you can go ahead with pseudo cluster and u can leverage the advantage of in memory processing using Spark as spark supports mapreduce equalent operations along with SQL, dataframes,mllib,graphX (Python is also comes natively), or if u know R u can directly use R using spark 1.4 (as spark is faster than hadoop)

Killiecrankie answered 4/8, 2015 at 12:17 Comment(3)

My question asks specifically about whether fully distributed or pseudo-distributed mode would be better. Could you address that? – Productive 4/8, 2015 at 12:21

For my application of Hadoop, data loss and having to re-run the job are not an issue. In light of that, could you provide another argument for pseudo vs fully distributed mode? – Productive 4/8, 2015 at 12:55

Thanks for updating your answer. Can I also leverage the advantage of in-memory processing using Hadoop? – Productive 11/8, 2015 at 12:34

I don't think you can leverage the real benefits of Hadoop considering you have only one machine in your cluster. In my opinion, an easier and better alternative would be to:

Please go ahead with Pseudo distributed mode and store your data in Hadoop.
Use any in memory database (Impala or Presto or Spark) on top of the data stored in hadoop.
Impala syntax is exactly the same as hive and you won't have to make any extra changes on your data for querying. Alternatively you can use Spark ML Lib for machine learning related tasks.

Assailant answered 7/8, 2015 at 7:34 Comment(1)

This answer does not address the question. Please revise it. – Productive 10/8, 2015 at 13:33

Go for fully distributed mode hadoop cluster on vmware esxi platform, if your hardware supports. It seems to be the best way to exploit your resources

Crowley answered 8/8, 2015 at 18:59 Comment(1)

Can you add details/explanation for why that would be the best way to use my resources? – Productive 10/8, 2015 at 13:22

-1

"fully distributed mode" is a perfect choice where one can take full advantage Hadoop framework.

Rarebit answered 8/8, 2015 at 15:1 Comment(9)

Could you make this answer more specific? – Productive 10/8, 2015 at 13:24

With respect to your scenario which involves processing of large amount of data, There may be chances of container/vm failures during computation. Since you'll have a fully distributed mode environment - where several roles in the ecosystem are distributed among different containers/vms, the probability of a job being successful will be high as the framework restarts the failed tasks. – Rarebit 10/8, 2015 at 18:28

Specifically, why would processing a large amount of data cause container/vm failures? – Productive 10/8, 2015 at 18:29

It is not necessary that there will be a failures all time. But considering a generalized scenarios that failures may occur, may be there is some noise in the dataset, or misconfiguration in the cluster, etc.., – Rarebit 10/8, 2015 at 19:29

Okay. Couldn't I still keep a high job-completion probability with pseudo-distributed mode? – Productive 10/8, 2015 at 19:32

If we consider, Why processing a large amount of data cause failures, then we can ignore the data replication considering that this is a brand new machine, it won't fail and I don't need replication. But this is only an ideal situation. FAILURES can occur, we have to be ready to face them, to prevent loss of information. – Rarebit 10/8, 2015 at 19:33

Understood. But is there not a replication factor in pseudo-distributed mode also? – Productive 10/8, 2015 at 19:35

In pseudo-distributed mode - You'll be running hadoop on a single machine, where data is not replicated. Also it limits parallel processing capability. – Rarebit 10/8, 2015 at 19:37

Pseudo-distributed mode is running Hadoop daemons as separate processes on a single node/machine. Since you have only one node, data is not replicated physically. Even thought your hadoop runs, NN warns that there are under-replicated blocks. – Rarebit 10/8, 2015 at 19:40

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags