Generating star schema in hive
Asked Answered
O

1

12

I am from SQL Datawarehouse world where from a flat feed I generate dimension and fact tables. In general data warehouse projects we divide feed into fact and dimension. Ex:

enter image description here

I am completely new to Hadoop and I came to know that I can build data warehouse in hive. Now, I am familiar with using guid which I think is applicable as a primary key in hive. So, the below strategy is the right way to load fact and dimension in hive?

  1. Load source data into a hive table; let say Sales_Data_Warehouse
  2. Generate Dimension from sales_data_warehouse; ex:

    SELECT New_Guid(), Customer_Name, Customer_Address From Sales_Data_Warehouse

  3. When all dimensions are done then load the fact table like

    SELECT New_Guid() AS 'Fact_Key', Customer.Customer_Key, Store.Store_Key... FROM Sales_Data_Warehouse AS 'source' JOIN Customer_Dimension Customer on source.Customer_Name = Customer.Customer_Name AND source.Customer_Address = Customer.Customer_Address JOIN Store_Dimension AS 'Store' ON Store.Store_Name = Source.Store_Name JOIN Product_Dimension AS 'Product' ON .....

Is this the way I should load my fact and dimension table in hive?

Also, in general warehouse projects we need to update dimensions attributes (ex: Customer_Address is changed to something else) or have to update fact table foreign key (rarely, but it does happen). So, how can I have a INSERT-UPDATE load in hive. (Like we do Lookup in SSIS or MERGE Statement in TSQL)?

Orrin answered 28/3, 2017 at 12:59 Comment(7)
The way you are doing is correct. Hive supports update query from version 0.14Fleischman
There is no key conceptsFleischman
if customer names is changed, then you need update both tables "customer" table and derived tableFleischman
Oh ! I thought hive won't perform on row level update.Orrin
hive wont have update operation version < 0.14Fleischman
thanks for understand it clearlyFleischman
Also you can merge using full join for hive version <0.14. See this: https://mcmap.net/q/1012753/-hive-best-way-to-do-incremetal-updates-on-a-main-tableBedew
C
1

We still get the benefits of dimensional models on Hadoop and Hive. However, some features of Hadoop require us to slightly adopt the standard approach to dimensional modelling.

The Hadoop File System is immutable. We can only add but not update data. As a result we can only append records to dimension tables (While Hive has added an update feature and transactions this seems to be rather buggy). Slowly Changing Dimensions on Hadoop become the default behaviour. In order to get the latest and most up to date record in a dimension table we have three options. First, we can create a View that retrieves the latest record using windowing functions. Second, we can have a compaction service running in the background that recreates the latest state. Third, we can store our dimension tables in mutable storage, e.g. HBase and federate queries across the two types of storage.

The way how data is distributed across HDFS makes it expensive to join data. In a distributed relational database (MPP) we can co-locate records with the same primary and foreign keys on the same node in a cluster. This makes it relatively cheap to join very large tables. No data needs to travel across the network to perform the join. This is very different on Hadoop and HDFS. On HDFS tables are split into big chunks and distributed across the nodes on our cluster. We don’t have any control on how individual records and their keys are spread across the cluster. As a result joins on Hadoop for two very large tables are quite expensive as data has to travel across the network. We should avoid joins where possible. For a large fact and dimension table we can de-normalise the dimension table directly into the fact table. For two very large transaction tables we can nest the records of the child table inside the parent table and flatten out the data at run time. We can use SQL extensions such as array_agg in BigQuery/Postgres etc. to handle multiple grains in a fact table

I would also question the usefulness of surrogate keys. Why not use the natural key? Maybe performance for complex compound keys may be an issue but otherwise surrogate keys are not really useful and I never use them.

Conni answered 1/7, 2017 at 7:24 Comment(2)
First, we can create a View that retrieves the latest record using windowing functions How would you do that using windowing? You need to get the latest version of all the rows from the beginning of time. The only way I can think of achieving this is using group by with max_by(my_field, timestamp)Intromit
Look here cwiki.apache.org/confluence/display/Hive/…Conni

© 2022 - 2024 — McMap. All rights reserved.