Hive: When insert into partitioned table, in most of the rows, hive double url-encode the partition key column
Asked Answered
S

1

8

I created a partitioned table:

create table  t1 ( amount double) partitioned by ( events_partition_key string) stored as paquet;

added some data to tmp_table, where column 'events_partition_key' contains timestamp (string type) in the following format: "2018-02-25 00:00:00"

Then I insert some data into the partitioned table.

insert into table t1 partition (events_partition_key)
select amount, events_partition_key
from tmp_table

when selecting from the new partitioned table t1, in some cases events_partition_key column is presented regularly the same it appeared in tmp_table i.e. "2018-02-25 00:00:00", but in most cases, it appears URL-encoded, i.e. "2018-02-25 00%3A00%3A00"

this is how it looks like when selecting distinct events_partition_key

In any case, the original column partition_key has no difference in cases where it was URL-encoded or not,

when showing partitions for the new table:

show partitions t1;

I got everything URL-encoded twice (i.e. "2018-02-25 00%253A00%253A00"), but some cases it was only once (i.e. "2018-02-25 00%3A00%3A00")

enter image description here

there is nothing wrong in the original values, only after it has become a partition key.

Steatopygia answered 25/2, 2018 at 12:52 Comment(2)
I also tried it with 'textfile' instead of 'parquet' and had the same problem. tried it with a small dataset, it worked fine, even with partition keys that previously had double encodingSteatopygia
tested on hive version: hive-1.1.0-cdh5.9.1Steatopygia
S
1

If possible partition by date instead of timestamp.

If not possible:

This Jira HIVE-3679 adds hive.decode.partition.name property to the HiveConf.java in release 0.10.0

Try to set it true.

set hive.decode.partition.name=true;
Splash answered 14/5, 2018 at 17:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.