Vertica and joins

Asked 23/11, 2012 at 14:42 Answered 26/12, 2014 at 21:38

I'm adapting a web analysis tool to use Vertica as the DB. I'm having real problems optimizing joins. I tried creating pre-join projections for some of my queries, and while it did make the queries blazing fast, it slowed data loading into the fact table to a crawl.

A simple INSERT INTO ... SELECT * FROM which we use to load data into the fact table from a staging table goes from taking ~5 seconds to taking 20+ minutes.

Because of this I dropped all pre-join projections and tried using the Database Designer to design query specific projections but it's not enough. Even with those projections a simple join is taking ~14 seconds, something that takes ~1 second with a pre-join projection.

My question is this: Is it normal for a pre-join projection to slow data insertion this much and if not, what could be the culprit? If it is normal, then it's a show stopper for us and are there other techniques we could use to speed up the joins?

We're running Vertica on a 5 node cluster, each node having 2 x quad core CPU and 32 GB of memory. The tables in my example query have 188,843,085 and 25,712,878 rows respectively.

The EXPLAIN output looks like this:

EXPLAIN SELECT referer_via_.url as referralPageUrl, COUNT(DISTINCT sessio
n.id) as visits FROM owa_session as session JOIN owa_referer AS referer_vi
a_ ON session.referer_id = referer_via_.id WHERE session.yyyymmdd BETWEEN 
'20121123' AND '20121123' AND session.site_id = '49' GROUP BY referer_via_
.url  ORDER BY visits DESC LIMIT 250;

Access Path:
+-SELECT  LIMIT 250 [Cost: 1M, Rows: 250 (STALE STATISTICS)] (PATH ID: 0)
|  Output Only: 250 tuples
|  Execute on: Query Initiator
| +---> SORT [Cost: 1M, Rows: 1 (STALE STATISTICS)] (PATH ID: 1)
| |      Order: count(DISTINCT "session".id) DESC
| |      Output Only: 250 tuples
| |      Execute on: All Nodes
| | +---> GROUPBY PIPELINED (RESEGMENT GROUPS) [Cost: 1M, Rows: 1 (STALE 
STATISTICS)] (PATH ID: 2)
| | |      Aggregates: count(DISTINCT "session".id)
| | |      Group By: referer_via_.url
| | |      Execute on: All Nodes
| | | +---> GROUPBY HASH (SORT OUTPUT) (RESEGMENT GROUPS) [Cost: 1M, Rows
: 1 (STALE STATISTICS)] (PATH ID: 3)
| | | |      Group By: referer_via_.url, "session".id
| | | |      Execute on: All Nodes
| | | | +---> JOIN HASH [Cost: 1M, Rows: 1 (STALE STATISTICS)] (PATH ID: 
4) Outer (RESEGMENT)
| | | | |      Join Cond: ("session".referer_id = referer_via_.id)
| | | | |      Execute on: All Nodes
| | | | | +-- Outer -> STORAGE ACCESS for session [Cost: 463, Rows: 1 (ST
ALE STATISTICS)] (PUSHED GROUPING) (PATH ID: 5)
| | | | | |      Projection: public.owa_session_projection
| | | | | |      Materialize: "session".id, "session".referer_id
| | | | | |      Filter: ("session".site_id = '49')
| | | | | |      Filter: (("session".yyyymmdd >= 20121123) AND ("session"
.yyyymmdd <= 20121123))
| | | | | |      Execute on: All Nodes
| | | | | +-- Inner -> STORAGE ACCESS for referer_via_ [Cost: 293K, Rows:
26M] (PATH ID: 6)
| | | | | |      Projection: public.owa_referer_DBD_1_seg_Potency_2012112
2_Potency_20121122
| | | | | |      Materialize: referer_via_.id, referer_via_.url
| | | | | |      Execute on: All Nodes

Neubauer answered 23/11, 2012 at 14:42 Comment(6)

I know nothing about Vertica, but a hint like (STALE STATISTICS) would worry me very much. Did you try to update the statistics? – Sorely 23/11, 2012 at 18:47

Biggest flag to me is the lack of filters on public.owa_referer_DBD_1_seg_Potency_2012112. My first approach to improving query performance is ensure all projections have the maximum number of filters applied. – Instability 24/11, 2012 at 16:23

STALE STATISTICS can get reported sometimes even when stats are relatively fresh. I don't expect running ANALYZE_STATISTICS to suddenly speed things up. Worth trying, certainly. – Flagellate 25/11, 2012 at 19:35

Can you profile the query and upload the profile data somewhere and I will try to tell you whats going on? Questions What is the encoding like on the two projections? Are the tables partitioned? What is the planned concurrency set to? – Counterblast 28/11, 2012 at 17:24

We are having a similar problem, but I think we came a tiny bit further optimizing the projections. It's still too slow, though. See: vertica-forums.com/… Have you made any progress since you posted this question? – Amaurosis 30/1, 2013 at 13:1

If you posted your actual query in addition to that this would make a great candidate for SE Code Review. codereview.stackexchange.com – Heptagonal 29/5, 2014 at 17:8

To speedup join:

Design session table as being partitioned on column "yyyymmdd". This will enable partition pruning
Add condition on column "yyyymmdd" to _referer_via_ and partition on it, if it is possible (most likely not)
have column site_id as possible close to the beginning of order by list in used (super)projection of session
- have both tables segmented on referer_id and id correspondingly.

And having more nodes in cluster do help.

Laporte answered 26/12, 2014 at 21:38 Comment(0)

My question is this: Is it normal for a pre-join projection to slow data insertion this much and if not, what could be the culprit? If it is normal, then it's a show stopper for us and are there other techniques we could use to speed up the joins?

I guess the amount affected would vary depending on data sets and structures you are working with. But, since this is the variable you changed, I believe it is safe to say the pre-join projection is causing the slowness. You are gaining query time at the expense of insertion time.

Someone please correct me if any of the following is wrong. I'm going by memory and by information picked up with conversations with others.

You can speed up your joins without a pre-join projection a few ways. In this case, the referrer ID. I believe if you segment your projections for both tables with the join predicate that would help. Anything you can do to filter the data.

Looking at your explain plan, you are doing a hash join instead of a merge join, which you probably want to look at.

Lastly, I would like to know via the explain plan or through system tables if your query is actually using the projections Database Designer has recommended. If not, explicitly specify them in your query and see if that helps.

Singapore answered 26/11, 2012 at 16:54 Comment(0)

You seem to have a lot of STALE STATISTICS. Responding to STALE statistics is important. Because that is the reason why your queries are slow. Without statistics about the underlying data, Vertica's query optimizer cannot choose the best execution plan. And responding to STALE statistics only improves SELECT performance not update performance.

If you update your tables regularly do remember there are additional things you have to consider in VERTICA. Please check the answer that I posted to this question. I hope that should help improve your update speed.

Explore the AHM settings as explained in that answer. If you don't need to be able to select deleted rows in a table later, it is often a good idea to not keep them around. There are ways to keep only the latest epoch version of the data. Or manually purge deleted data. Let me know how it goes.

Largeminded answered 20/6, 2013 at 22:6 Comment(0)

I think your query could use some more of being explicit. Also don't use that Devil BETWEEN Try this:

EXPLAIN SELECT 
    referer_via_.url as referralPageUrl, 
    COUNT(DISTINCT session.id) as visits 
FROM owa_session as session 
JOIN owa_referer AS referer_via_ 
    ON session.referer_id = referer_via_.id
WHERE session.yyyymmdd <= '20121123' 
AND session.yyyymmdd > '20121123' 
AND session.site_id = '49' 
GROUP BY referer_via_.url
-- this `visits` column needs a table name
ORDER BY visits DESC LIMIT 250;

I'll say I'm really perplexed as to why you would use the same DATE with BETWEEN may want to look into that.

Heptagonal answered 28/6, 2014 at 6:57 Comment(0)

this is my view coming from an academic background working with column databases, including Vertica (recent PhD graduate in database systems).

Blockquote My question is this: Is it normal for a pre-join projection to slow data insertion this much and if not, what could be the culprit? If it is normal, then it's a show stopper for us and are there other techniques we could use to speed up the joins? Blockquote

Yes, updating projections is very slow and you should ideally do it only in large batches to amortize the update cost. The fundamental reason is that each projection represents another copy of the data (of each table column that is part of the projection).

A single row insert requires adding one value (one attribute) to each column in the projection. For example, a single row insert in a table with 20 attributes requires at least 20 column updates. To make things worse, each column is sorted and compressed. This means that inserting the new value in a column requires multiple operations on large chunks of data: read data / decompress / update / sort / compress data / write data back. Vertica has several optimization for updates but cannot hide completely the cost.

Projections can be thought of as the equivalent of multi-column indexes in a traditional row store (MySQL, PostgreSQL, Oracle, etc.). The upside of projections versus traditional B-Tree indexes is that reading them (using them to answer a query) is much faster than using traditional indexes. The reasons are multiple: no need to access head data as for non-clustered indexes, smaller size due to compression, etc. The flipside is that they are way more difficult to update. Tradeoffs...

Madriene answered 1/10, 2014 at 16:34 Comment(0)

Recommended topics

Hot tags