Amazon Elastic Map Reduce for analyzing s3 logs

B

3

7

I am using EMR to analyze web nginx logs. But I need to process the logs so that it can fall into rows and columns in order to make it easy for querying. Thus i made two tables - rawlog, processedlog in the following manner:

create table rawlog(line string)
row format delimited fields terminated by '\t' lines terminated by '\n'
LOCATION 's3://istreamanalytics/logs/';

CREATE EXTERNAL TABLE processedlog (
day string,
hour int,
playSessionId string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';

and added a ruby script to hive which can do the transformation, the script is as follows:

#!/usr/bin/env ruby

mon={"Jan" => '01',"Feb" => '02',"Mar" => '03',"Apr" => '04',"May" => '05',"Jun" =>    '06',"Jul" => '07',"Aug" => '08',"Sep" => '09',"Oct" => '10',"Nov" => '11',"Dec" => '12'}

STDIN.each_line do |line|
if line =~ /(\d+)\/(\w+)\/(\d+):(\d+):\d+:\d+ \+\d+] "GET \/api\?playSessionId=(^&*)/
d = "#{$3}-#{mon$2}-#{$1}"
h = $4
pid = $5
puts "#{d}\t#{h}\t#{pid}"
end
end

Now when i run the job using the following command on hive:

from rawlog insert overwrite table processedlog select transform (line) using 'ruby /mnt/var/lib/hive_081/downloaded_resources/hive_transformer.rb' as (day String, hour INT, playSessionId String);

I am getting the following error:

Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201206061145_0015, Tracking URL = http://domU-12-31-39-0F-86-07.compute-1.internal:9100/jobdetails.jsp?jobid=job_201206061145_0015
Kill Command = /home/hadoop/.versions/0.20.205/libexec/../bin/hadoop job -Dmapred.job.tracker=10.193.133.241:9001 -kill job_201206061145_0015
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2012-06-08 09:47:49,644 Stage-1 map = 0%, reduce = 0%
2012-06-08 09:48:50,267 Stage-1 map = 0%, reduce = 0%
2012-06-08 09:48:52,278 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201206061145_0015 with errors
Error during job, obtaining debugging information...
 Examining task ID: task_201206061145_0015_m_000002 (and more) from job job_201206061145_0015

Exception in thread "Thread-41" java.lang.RuntimeException: Error while reading from task log url
at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:130)
at org.apache.hadoop.hive.ql.exec.JobDebugger.showJobFailDebugInfo(JobDebugger.java:211)
at org.apache.hadoop.hive.ql.exec.JobDebugger.run(JobDebugger.java:81)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Server returned HTTP response code: 400 for URL:     

http://10.254.139.143:9103/tasklogtaskid=attempt_201206061145_0015_m_000000_2&start=-8193
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1436)
at java.net.URL.openStream(URL.java:1010)
at org.apache.hadoop.hive.ql.exec.errors.TaskLogProcessor.getErrors(TaskLogProcessor.java:120)
... 3 more
Counters:
 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
 MapReduce Jobs Launched:
 Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
 Total MapReduce CPU Time Spent: 0 msec

Can someone tell me what's wrong ?

Beefsteak answered 8/6, 2012 at 10:21 Comment(2)

Probably it has something to do with private IP address '10.254.139.143', shouldn't it be accessing logs via a public IP? – Bramlett 19/6, 2012 at 20:43

I have come across this before in hive and it's generally been resolved by upping the EMR instance size to an m2 tier instance. Don't have a good explanation for why that works, but it seems to for the most part. – Houseleek 8/3, 2013 at 2:48

D

0

EMR is a very generic tool to deal with logs.

Why not use more tailored technology.

E.g.:

At least with Sumo you could make that kind of processing much easier.

Disposure answered 27/11, 2012 at 23:43 Comment(0)

E

0

The only suggestion I would make is make sure the script is working properly before EMR. Using EMR to test is script should be the very last step in the process. Beyond that it is usually a basic config problem.

Some basic googling found:

http://entxtech.blogspot.com/2010/10/how-to-unit-test-apache-hive-scripts.html http://jairam.me/2011/09/08/hive-on-amazon-emr/

Enphytotic answered 17/1, 2013 at 3:41 Comment(0)

H

0

More details on the error can be found in the log files or see the details here in your case: http://10.254.139.143:9103/tasklogtaskid=attempt_201206061145_0015_m_000000_2&start=-8193

Hurty answered 28/3, 2013 at 4:7 Comment(0)

Recommended topics

Hot tags