How to tackle a BIG DATA Data Mart / Fact Table? ( 240 millions of rows )
Asked Answered
S

6

5

We have a BI customer that generates about 40 Millions of rows every month in their sales data base tables, generated from their sales transactions. They want to build a Sales Data Mart with their historic data from 5 years, meaning that this fact table will potentially have about 240 millions of rows. ( 40 x 12 months x 5 years )

This is well structured data.

This is the first time Im confronted to this amount of data, and this took me to analyze vertical Data Bases tools like Inforbright and other ones. But still with this kind of software a simple query would take a very , very long time to run.

This took me to take a look at Hadoop, but after reading some articles, I concluded that Hadoop is not the best option ( even with Hive ) to create a fact table, since in my understanding is meant to work with unstructured data.

So, My question is: What would be the best way to build this challenge ?? , Am I not looking for the right technology ? What would be the best query response times I could get in a such big fact table ? ..or Am I confronting a real wall here and the only option is to build aggregated tables ?

Squinch answered 7/6, 2012 at 17:56 Comment(8)
What are your requirements? What do you want to do with the data (in detail!)?Dustcloth
We want to do OLAP like analysis: For instance: What are the top 10 sold products in this 5 years ?, top 10 brands, ... and of course more structured with more variables sucha as ... What are the top 5 brands sold in the 5 years between customers aged between 20 -30 in USA ??Squinch
Thanks, that was helpful. How big is the data on disk in GB? I guess this is a standard star schema? And what query duration requirements do exist (seconds, minutes, hours)?Dustcloth
We are targeting 1 minute per simple query, 5 max on those more complex .Squinch
and yes this a standar star schemaSquinch
How big is the data on disk in GB? What hardware do you have (disk and CPU)?Dustcloth
A large Oracle Database on a strong server with a good data model should be able to handle this quantity of transactions without issues. If the retailer has 500 stores, we're talking about 2,500 transactions / store / day to get to 40 mil / month. That's not that big. Look into partitioning and indexing.Tedie
For very large DW, you need to start examining Teradata, Netezza, or other MPP solutions.Tedie
C
4

Have you checked out Google BigQuery (Paid Premium Service) which will suit your needs. It is as simple as

  1. Load the data in CSV (delimited by new line for record, or configurable char for field). The file can be in gzip format. You can also append to existing table.

  2. Start Querying using SQL statement ( limited sql statement though) and the results are returned in secs of multi-million rows.

  3. Extract the data into a CSV or another table ( similar to aggregation layer)

Check out here. https://developers.google.com/bigquery/

First 100GB for data processing is free. So you can get started now and it also integrates with Google Spreadsheet , which will allow you create visulaization like Charts and graphs etc for management. You can export the google spreadsheet as Microsoft Excel / PDF.

Google state it can scale to multi-terrabytes and provides real-time quering ( few secs response).

Cummins answered 8/6, 2012 at 1:24 Comment(1)
Agreed - a great use case for BigQueryBrussels
L
2

first up i'll assume its 240m not 2400m.

Firstly take a look at ssd.analytical-labs.com

The FCC demo has a 150m record fact table running on Infobright, I would suspect on VW it would be even faster.

The key is keeping it simple, there will be queries that make it fall slow down, but largley its pretty responsive.

I would suggest you think about aggregates, the way you are querying and importantly what you are querying.

For example split it down into Marts for performance, by product, by brand, by years etc. If the user wants to just do a query on <1 years worth of data(which is more often the case than most people would like to think) they could then use a much smaller fact table.

Storage is cheep so it doesn't matter particularly if you duplicate data as long as it keeps it responsive.

Of course also if you are doing OLAP you can make use of inline aggregate tables to make sure most of the queries run at a far more acceptable level assuming they've rolled up.

Hardware is also very important, make sure you have fast disks, that is nearly always the bottle neck, the faster you can get the data off the disks generally the faster it will displayed to the end user.

Schema design is also important, modern column store databases much prefer a denormalised table with 0 joins where possible, I have found in the past, having 1 denormalised table for 90% of queries then having a few joining tables (date dim for example) for special cases counts for most use cases.

Anyway thats my 2 cents. Ping me on twitter if you want a skype about it or something.

Tom

Edit:

Also here's a non scientific bench mark to back up what JVD was saying:

  • ssd on physical box: 175.67 MB/sec
  • sata on physical box: 113.52 MB/sec
  • ec2: 75.65 MB/sec
  • ec2 ebs raid: 89.36 MB/sec

As you can see there is a large difference in read speed.

Laurenlaurena answered 7/6, 2012 at 19:5 Comment(3)
is this saiku running on a star schema or denormalised table ?Squinch
denormalised table. i got the star schema they supplied and munged it up when i imported it.Laurenlaurena
Trout speaks the truth. Stay well away from Hadoop and NoSQL for this kind of use case. Start with a free columnstore DB (Infobright, InifniDB, LucidDB) and investigate paid versions only as needed.Andrews
E
2

I think there are a couple of approaches here,

1) You should try aggregate tables on mondrian, the downside of agg tables is that you need to know beforehand the use cases for most recurrent queries, if you don't then it's not so easy to tune that and you will end up having long response times for the queries you didn't optimize the aggregate table.

2) Another option is to partition the data of the fact table, maybe by year, create different schemas for every year and a virtual cube for the whole history. If you have the right software you could also create a materialized view (if you have Oracle) or a Indexed View if you have MS SqlServer.

The late approach has worked very good for me, with noticeable improvements on query times. Besides, my ETL process wasn't affected (in the option 1 you will need to create an extra process to build and maintain Aggregate Tables) since the RDMBS takes care of the process of update the data on every partition.

Enugu answered 7/6, 2012 at 22:34 Comment(1)
From a RDBMS perspective, this is a good answer. 240 million rows is not really "big data" from a data warehouse perspective - we're currently dealing with about 250 million rows of transaction line data per year in our Oracle warehouse.Tedie
A
1

You might consider a packaged NoSQL/Analysis solution such as DataStax Enterprise, which uses Apache Cassandra paired with Hadoop and other useful analysis tools. You're right that Hadoop's "default" HDFS file system is well-suited for unstructured data, but integrating it with a NoSQL data store (like Cassandra or HBase) will allow you to more readily analyze your structured data using MapReduce.

Adroit answered 7/6, 2012 at 18:50 Comment(0)
W
1

Another combination of the technologies that I successfully used for a very large data warehouse is Hadoop + Hive. Data was manipulated using Map/Reduce jobs, and presented to Hive as external tables. Updates were performed by swapping partitions between the stage and data warehouse areas.

The big advantage of this approach was that one could run (almost) normal SQL queries on the data. The disadvantage - you couldn't hook up up Hive back end to an interactive UI front end. But if all you do is running daily reports and datamining, this should work.

Wingfooted answered 7/6, 2012 at 19:31 Comment(0)
M
0

hadoop is absolutely suitable for such big data..you can use it with hbase that allows us to expand to millions of rows and billions of columns, and provides great horizontal scalability as well..it is suitable for real time random read write access..on the other hand hive is good for batch processing, so you can run hive jobs in back ground for other tasks..we should not mistake hadoop as an alternative to traditional RDBMSs, but it really helpful in dealing with huge data sets..you can use another apache project "sqoop" that allows us to import our data form existing databases to hadoop cluster without much pain.

Mask answered 7/6, 2012 at 19:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.