Hive: Best way to do incremetal updates on a main table

Asked 8/6, 2016 at 17:44 Answered 3/11, 2020 at 8:23

So I have a main table in Hive, it will store all my data.

I want to be able to load a incremental data update about every month with a large amount of data couple billion rows. There will be new data as well as updated entries.

What is the best way to approach this, I know Hive recently upgrade and supports update/insert/delete.

What I've been thinking is to somehow find the entries that will be updated and remove them from the main table and then just insert the new incremental update. However after trying this, the inserts are very fast, but the deletes are very slow.

The other way is to do something using the update statement to match the key values from the main table and the incremental update and update their fields. I haven't tried this yet. This also sounds painfully slow since Hive would have to update each entry 1 by 1.

Anyone got any ideas as to how to do this most efficiently and effectively ?? I'm pretty new to Hive and databases in general.

Crispation answered 8/6, 2016 at 17:44 Comment(0)

If merge in ACID mode is not applicable, then it's possible to update using FULL OUTER JOIN or using UNION ALL + row_number. To find all entries that will be updated you can join increment data with old data:

insert overwrite target_data [partition() if applicable]
SELECT
  --select new if exists, old if not exists
  case when i.PK is not null then i.PK   else t.PK   end as PK,
  case when i.PK is not null then i.COL1 else t.COL1 end as COL1,
  ... 
  case when i.PK is not null then i.COL_n else t.COL_n end as COL_n
  FROM 
      target_data t --restrict partitions if applicable
      FULL JOIN increment_data i on (t.PK=i.PK);

It's possible to optimize this by restricting partitions in target_data that will be overwritten and joined using WHERE partition_col in (select distinct partition_col from increment_data) or pass partition list if possible as a parameter and use in the where clause, it will work even faster.

Also if you want to update all columns with new data, you can apply this solution with UNION ALL+row_number(), it works faster than full join: https://mcmap.net/q/1020540/-how-to-update-table-in-hive-0-13

Putnem answered 10/6, 2016 at 8:58 Comment(8)

Thank you ! this is a great idea ! – Crispation 10/6, 2016 at 22:23

@leftjoin.. could you please elaborate more on how to restrict partitions during update? – Tenorrhaphy 1/1, 2019 at 11:1

@vikrantrana By adding a partition filter like this: WHERE partition_col in (select distinct partition_col from increment_data) or pass partition list as a parameter and use in the where clause – Putnem 1/1, 2019 at 13:10

@leftjoin- can we also specify partition like WHERE partition_col in ('2018-01-24', '2018-03-15') if the partition column is a date column – Tenorrhaphy 20/5, 2019 at 12:54

@vikrantrana Yep, sure, you can even pass the list as a parameter. Everything is easy if partition key is a part of join key. If not, and incremental updates causing rows to migrate from one partition to another then you need to rewrite all affected partitions (containing in target and in increment for the incremental set of join keys) – Putnem 20/5, 2019 at 13:25

@Putnem Do you have any good pointer or reference to implementing SCD2 in hive or spark? I have been searching for this so long but was not able to find something meaningful. – Tenorrhaphy 20/5, 2019 at 14:19

@vikrantrana Classic SCD2 is not popular in Hive because for historical data reprocessing it requires non-equi joins using >=start_date and <end_date or end_date is null like this. And non-equi join is a pain. It's much easier to have all dates in the dimension to join by equi-join using date=date. Isn,t it a bigdata we are working with? And synthetic keys are also not popular in Hive because their generation is not fully parallel. Natural keys or GUID (not sortable) are used – Putnem 20/5, 2019 at 16:15

@leftjoin. Yeah , Thanks for your valuable inputs.. I was trying to join with generating surroagate key using row number but joins were running slow with small set of data. There's less info available fr scd2 – Tenorrhaphy 20/5, 2019 at 17:19

Here is my solution/work around if you are using old hive version. This works better when you have large data in target table which we can't drop and recreate with full data every time.

create one more table say delete_keys table. This will hold all the key from main table which are deleted along with its surrogate key.

While loading incremental data into main table, do a left join with main table. For all the matching records, we ideally should update the main table. But instead, we take keys (along with surrogate key) from main table for all matching records and insert that to delete_keys table. Now we can insert all delta records into main table as it irrespective of whether they are to be updated or inserted.

Create view on main table using delete-keys table so that matching keys with delete-keys table are not fetched. So, this view will be final target table. This view will not show records from main table which are updated with latest records.

Indecorum answered 3/11, 2020 at 8:23 Comment(0)

Recommended topics

Hot tags