Custom SerDe not supported by Impala, what's the best way to query files in CSV w/double quotes?
Asked Answered
F

3

6

I have a CSV data with each field surronded with double quotes. When I created Hive table used serde 'com.bizo.hive.serde.csv.CSVSerde' When above table is queried in Impala I am getting error SerDe not found.

I added the CSV Serde JAR file in /usr/lib/impala/lib folder.

Later studied in Impala documentation that Impala does not support custom SERDE. In such case how I can overcome this issue such that my CSV data with quotes is taken care. I want to use CSV Serde because it takes of commas in values which is a legitimate field vavlue.

Thanks a lot

Flambeau answered 3/9, 2014 at 10:56 Comment(0)
R
5

Can you use Hive? If so, here is an approach that might work. CREATE your table as an EXTERNAL TABLE in Hive and use your SERDE in the right place of the CREATE Statement (I think you need something like ROW FORMAT SERDE your_serde_here at the end of the CREATE TABLE statement). Before this you might need to do:

ADD JAR 'hdfs:///path/to/your_serde.jar' 

Note that the jar should be somewhere in hdfs and triple /// needed for it to work...

Then, still in Hive, duplicate the table into another table that is stored in a format with which Impala can easily work, such as PARQUET. Something like the following does this copying:

CREATE TABLE copy_of_table 
   STORED AS PARQUET AS
   SELECT * FROM your_original_table

Now in Impala use INVALIDATE METADATA to mark the metadata as stale:

INVALIDATE METADATA copy_of_table

You should be all set to happily work with copy_of_table in Impala now.

Let me know whether this works, as I might have do to something like this in the near future.

Regenerative answered 16/12, 2014 at 18:52 Comment(1)
The correct query is CREATE TABLE copy_of_table STORED AS PARQUET AS SELECT * FROM your_original_table otherwise Hive return a syntax exception.Monnet
E
1

Within Hive

CREATE TABLE mydb.my_serde_table_impala AS SELECT FROM mydb.my_serde_table

Within Impala

INVALIDATE METADATA mydb.my_serde_table_impala

Add these steps to include dropping the _impala table first with whatever populates or ingests files for the serde table.

Impala bypasses MapReduce, unlike Hive. So Impala can't/doesn't use the SerDe the way MapReduce does.

Elvinelvina answered 29/3, 2017 at 21:13 Comment(0)
M
0

Now the default SerDe class is org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe and it is supported by Impala. Unfortunately it has much less features, for exmaple, it does not support escape chars.

Maiocco answered 15/1, 2020 at 9:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.