Concat Avro files using avro-tools
Asked Answered
A

2

11

Im trying to merge avro files into one big file, the problem is concat command does not accept the wildcard

hadoop jar avro-tools.jar concat /input/part* /output/bigfile.avro

I get:

Exception in thread "main" java.io.FileNotFoundException: File does not exist: /input/part*

I tried to use "" and '' but no chance.

Avon answered 18/1, 2016 at 14:15 Comment(3)
where are your input files?Corset
@54l3d: I think the question was: Are they stored on the local file systems or HDFS ?Blowup
@ClémentMATHIEU may be, they are on HDFSAvon
M
15

I quickly checked Avro's source code (1.7.7) and it seems that concat does not support glob patterns (basically, they call FileSystem.open() on each argument except the last one).

It means that you have to explicitly provide all the filenames as argument. It is cumbersome, but following command should do what you want:

IN=$(hadoop fs -ls /input/part* | awk '{printf "%s ", $NF}')
hadoop jar avro-tools.jar concat ${IN} /output/bigfile.avro

It would be a nice addition to add support of glob pattern to this command.

Maggot answered 20/1, 2016 at 11:58 Comment(2)
make sure to filter out "Found xxx items" from hadoop fs listingCupid
@EdiBice Updated the example. Thanks for the tip!Cooking
C
4

Instead of hadoop jar avro-tools.jar one can run java -jar avro-tools.jar, since you don't need hadoop for this operation.

Cleodell answered 11/9, 2019 at 19:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.