Migrating 50TB data from local Hadoop cluster to Google Cloud Storage
Asked Answered
B

2

6

I am trying to migrate existing data (JSON) in my Hadoop cluster to Google Cloud Storage.

I have explored GSUtil and it seems that it is the recommended option to move big data sets to GCS. It seems that it can handle huge datasets. It seems though that GSUtil can only move data from Local machine to GCS or S3<->GCS, however cannot move data from local Hadoop cluster.

  1. What is a recommended way of moving data from local Hadoop cluster to GCS ?

  2. In case of GSUtil, can it directly move data from local Hadoop cluster(HDFS) to GCS or do first need to copy files on machine running GSUtil and then transfer to GCS?

  3. What are the pros and cons of using Google Client Side (Java API) libraries vs GSUtil?

Thanks a lot,

Bysshe answered 13/8, 2014 at 16:25 Comment(0)
V
14

Question 1: The recommended way of moving data from a local Hadoop cluster to GCS is to use the Google Cloud Storage connector for Hadoop. The instructions on that site are mostly for running Hadoop on Google Compute Engine VMs, but you can also download the GCS connector directly, either gcs-connector-1.2.8-hadoop1.jar if you're using Hadoop 1.x or Hadoop 0.20.x, or gcs-connector-1.2.8-hadoop2.jar for Hadoop 2.x or Hadoop 0.23.x.

Simply copy the jarfile into your hadoop/lib dir or $HADOOP_COMMON_LIB_JARS_DIR in the case of Hadoop 2:

cp ~/Downloads/gcs-connector-1.2.8-hadoop1.jar /your/hadoop/dir/lib/

You may need to also add the following to your hadoop/conf/hadoop-env.sh file if youre running 0.20.x:

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/your/hadoop/dir/lib/gcs-connector-1.2.8-hadoop1.jar

Then, you'll likely want to use service-account "keyfile" authentication since you're on an on-premise Hadoop cluster. Visit your cloud.google.com/console, find APIs & auth on the left-hand-side, click Credentials, if you don't already have one click Create new Client ID, select Service account before clicking Create client id, and then for now, the connector requires a ".p12" type of keypair, so click Generate new P12 key and keep track of the .p12 file that gets downloaded. It may be convenient to rename it before placing it in a directory more easily accessible from Hadoop, e.g:

cp ~/Downloads/*.p12 /path/to/hadoop/conf/gcskey.p12

Add the following entries to your core-site.xml file in your Hadoop conf dir:

<property>
  <name>fs.gs.impl</name>
  <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
</property>
<property>
  <name>fs.gs.project.id</name>
  <value>your-ascii-google-project-id</value>
</property>
<property>
  <name>fs.gs.system.bucket</name>
  <value>some-bucket-your-project-owns</value>
</property>
<property>
  <name>fs.gs.working.dir</name>
  <value>/</value>
</property>
<property>
  <name>fs.gs.auth.service.account.enable</name>
  <value>true</value>
</property>
<property>
  <name>fs.gs.auth.service.account.email</name>
  <value>[email protected]</value>
</property>
<property>
  <name>fs.gs.auth.service.account.keyfile</name>
  <value>/path/to/hadoop/conf/gcskey.p12</value>
</property>

The fs.gs.system.bucket generally won't be used except in some cases for mapred temp files, you may want to just create a new one-off bucket for that purpose. With those settings on your master node, you should already be able to test hadoop fs -ls gs://the-bucket-you-want to-list. At this point, you can already try to funnel all the data out of the master node with a simple hadoop fs -cp hdfs://yourhost:yourport/allyourdata gs://your-bucket.

If you want to speed it up using Hadoop's distcp, sync the lib/gcs-connector-1.2.8-hadoop1.jar and conf/core-site.xml to all your Hadoop nodes, and it should all work as expected. Note that there's no need to restart datanodes or namenodes.

Question 2: While the GCS connector for Hadoop is able to copy direct from HDFS without ever needing an extra disk buffer, GSUtil cannot since it has no way of interpreting the HDFS protocol; it only knows how to deal with actual local filesystem files or as you said, GCS/S3 files.

Question 3: The benefit of using the Java API is flexibility; you can choose how to handle errors, retries, buffer sizes, etc, but it takes more work and planning. Using gsutil is good for quick use cases, and you inherit a lot of error-handling and testing from the Google teams. The GCS connector for Hadoop is actually built directly on top of the Java API, and since it's all open-source, you can see what kinds of things it takes to make it work smoothly here in its source code on GitHub : https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/src/main/java/com/google/cloud/hadoop/gcsio/GoogleCloudStorageImpl.java

Vanzant answered 16/8, 2014 at 18:13 Comment(11)
Thanks a lot Dennis for your detailed response, I am also considering gsutil for transferring my 50TB data to the GCS. My criteria for selecting a solution (either Hadoop connector or GsUtil) is the total amount of time it will take to upload the data to GCS. Do you think that a Hadoop solution will be faster than GsUtil (gsutil has an option for utilizing multiple cores)? Secondly, will I be able to transfer my 6GB hdfs files without any loss/change in data from HDFS to GCS using Hadoop-connector (since each file is composed of 128MB hdfs blocks)? Thanks again,Bysshe
I just realized that the original answer only posted the first part of my response; I had tried to post it over a flaky data plan, hopefully my answers to Question 2 and Question 3 help clarify the distinction between using the GCS connector vs gsutil. Generally, gsutil's multithreading will help for uploading files from just a single local machine, but for data inside HDFS, gsutil can't really read that data directly and you'll want to use hadoop distcp instead, which will be able to utilize all the cores across your cluster. It should be as fast as your network allows, using distcp.Vanzant
To more directly answer your followup questions: yes, using hadoop distcp hdfs://host:port/your-current-file-dir gs://your-bucket/new-file-dir should definitely be faster than trying to use gsutil itself. Your 6GB hdfs files should arrive in GCS unchanged/intact, even though once on GCS the actual underlying block sizes will be abstracted away (you can really just think of them as monolithic 6GB files in GCS). I'd recommend first testing with a smaller subset or maybe just single files to ensure your distcp works as intended before trying the full 50TB.Vanzant
Thanks a lot Dennis, I really appreciate your help. I am trying your approach (GCS hadoop connector), however for some reason I cannot see a Service Account option in Credentials whenever I hit 'Create client id' to generate a key. Does it has to do something with the enabled APIs? I have 4 APIs enabled i.e GCS, GCS JSON API, Google Cloud SQL and BigQuery API. I see following options: - Installed application Runs on a desktop computer or handheld device (like Android or iPhone). - Android Learn more - Chrome Application Learn more - iOS Learn more - Other.Bysshe
Since service accounts are often used with Google Compute Engine VMs, it's quite possibly related to enabling Google Compute Engine; it shouldn't be necessary to actually create any VMs, try enabling it and see if that opens up the service-account options.Vanzant
Just double-checked, and I don't think service accounts require enabling any particular API. Most likely, on the first time creating a service-account, the GUI is taking you through some extra initial steps that I may have forgotten about filling about long ago in my own projects. This section in the oauth2 docs has screenshots showing what to look for and in particular, this section is the keypair flow I mentioned. Look for the "Application type" radio buttons.Vanzant
Thanks Dennis, I solved the problem by just logging in to an owner account. I guess I didnt had a permission for generating the key file. My role is Writer. ThanksBysshe
By the way, I was just wondering how to sync the gcs jar (gcs Hadoop connector jar file) file to all hadoop nodes! In the case of hadoop distcp, -libjars is not working for my case. I set the HADOOP_CLASSPATH and also provided the –libjar to dictcp. I.e I keep getting this exception: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found exception. Just trying to think of a way to use DistributedCache from command line to somehow sync the jar file to all nodes. Would be great if you could share your approach.Bysshe
Personally, I typically use the GCS connector as a core library, so at cluster setup time I already distribute the jarfile to all the lib/ directories on my workers either with scp or running an ssh command to download from the public GCS URL. In your case, I'm not entirely familiar with why -libjars isn't working for you, but if you've at least been able to verify that the gcs-connector setup works on the master, you can use rsync and the slaves file.Vanzant
For example, if you installed hadoop as username hadoop and it's under /home/hadoop/hadoop-install/ on every machine, you'd do: for HOST in `cat conf/slaves`; do rsync lib/gcs-connector-1.2.8-hadoop1.jar hadoop@$HOST:/home/hadoop/hadoop-install/lib/gcs-connector-1.2.8-hadoop1.jar; done and then the same for the core-site.xml and the .p12 file.Vanzant
@Dennis/@GoodDok Do we need to update the core-site.xml for all the nodes in the cluster?Babylon
H
2

Look like few property names are changed in recent versions.

`String serviceAccount = "[email protected]";

String keyfile = "/path/to/local/keyfile.p12";

hadoopConfiguration.set("google.cloud.auth.service.account.enable", true); hadoopConfiguration.set("google.cloud.auth.service.account.email", serviceAccount); hadoopConfiguration.set("google.cloud.auth.service.account.keyfile", keyfile);`

Hord answered 4/1, 2017 at 6:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.