Are Mondrian / OLAP the wrong tool for joining large dimensions/sets?
Asked Answered
P

4

6

Summary: Most of the examples I've seen of MDX joins have involved joining relatively small sets, say with tens or hundreds of items each. But I find myself also wanting to try joining (in particular "non-empty joining") sets that have thousands or tens of thousands of items each, and it's not working well so far. I'm wondering if this could be made to work, or if I perhaps need to consider using something other than Mondrian/OLAP.

To be concrete, I have a cube that records interactions between Firms (n=7000) and Clients (n=27000). Currently both Firm and Client are completely flat hierarchies; there's the All level and the individual-company level, with no other levels in between. There is a central fact table, and separate dimension tables for Firms and for Clients.

My users at least appear to want to get summary reports along these lines, aggregating all the non-empty interactions between Firms and Clients:

select
  [Measures].[Amount] on columns,
  NonEmptyCrossJoin([Firm].Children,
                      [Client].Children) on rows
from MyCube

But this query and variations on it don't work in my test Mondrian setup. Either I get an OutOfMemoryException (on a 2GB Java heap), or Java seems to spend impossibly long time in mondrian.rolap.RolapResult$AxisMember.mergeTuple(TupleCursor). (I can provide a more complete stack trace if it would help.) By "impossibly long" I mean Java will stay slaving away at the query for hours and hours before I give up.

I initially expected the above query to perform ok, because conceptually it could be done somewhat efficiently by just doing a SQL query along these lines:

select Firm, Client, Sum(Amount) as n
from fact, firm, client
where fact.firmid = firm.firmid and fact.clientid = client.clientid
group by Firm, Client

(In fact, if I execute something like this directly in MySql it doesn't take more than 15sec to execute.)

But from the debug logs Mondrian doesn't seem to attempt this optimization. Instead it appears to be doing the join internally, and in a way that ends up being particularly slow. I've set mondrian.native.crossjoin.enable=true in my mondrian.properties, but this doesn't seem like one of the join types that Mondrian is able to "make native". (If I turn on mondrian.native.unsupported.alert=ERROR then I get the corresponding exception.)

I'm left wondering whether I need to prevent my users from attempting joins on such large dimensions/sets, or whether Mondrian is maybe not the tool I'm looking for here. But maybe I'm just doing something wrong.

Prevenient answered 19/11, 2011 at 0:44 Comment(0)
P
2

To follow up, I tried setting up an analogous cube in Sql Server Analysis Services (Sql Server 2008), and it seems that icCube has a point about different OLAP tools performing differently:

Even before I learned much about SSAS best practices, performance on this type of MDX was much improved. A query along these lines

select
  [Measures].[Amount] on columns,
  NON EMPTY
  crossjoin([Firms].[Firm Name].Children,
            [Clients].[Client Name].Children)
  on rows
from MyCube

went from being non-viable with Mondrian to taking around ten seconds under Sql Server. Conceivably this has to do with MS' Business Intelligence Development Studio guiding me into creating a MOLAP cube by default, or perhaps SSAS has a smarter query planner.

In any case, perhaps this is fast enough for me. If not, I'm not yet sure how much more optimized SSAS can get in this case. (One disappointing thing is that, even when I re-run the query a second time, it still takes about 10 sec; I was hoping caching might have a more dramatic effect.)

Tangentially, you may notice in the just-quoted MDX I've replaced my original NonEmptyCrossJoin with a normal crossjoin combined with NON EMPTY. This is because, at least in the Sql Server world, NonEmptyCrossJoin is apparently regarded as a deprecated bad practice. (This is noted in Microsoft's MDX Language Reference. Mosha, one of the former SSAS devs, describes the situation in an article called MDX: NonEmpty, Exists and evil NonEmptyCrossJoin. Short version is that NonEmptyCrossJoin has confusing semantics and limited application, and since Sql Server 2005 or so, the query optimizer has been smart enough to make your query fast without NonEmptyCrossJoin.) So I've substituted a more modern approved equivalent in the above MDX. (It does still work with NonEmptyCrossJoin as well, though NonEmptyCrossJoin does not speed things up at all.)

Prevenient answered 30/11, 2011 at 3:25 Comment(0)
T
2

I'm not 100% sure, but have you tried setting:

mondrian.native.nonempty.enable = true

This optimisation seems to push some operations like that down to the sql level - Sounds like it could help.

Tonetic answered 2/4, 2012 at 14:29 Comment(5)
mondrian.native.nonempty.enable is indeed probably worth checking in addition to mondrian.native.crossjoin.enable, although the docs claim that both already default to true. Unfortunately I don't have my mondrian setup readily available to see whether I already tried your suggestion, or see if it would make a difference in my case.Prevenient
Ok was worth mentioning. I have a pentaho 3.8 server build and its set false there was wondering why! Curiously it didnt help me either.Tonetic
For anyone else who comes across this, i also recommend reviewing your approxRowCount and highCardinality settings.Tonetic
mondrian.native.nonempty.enable=true helped me. I used non empty years and quarters (about 50) as columns and non empty users (about 75k) as rows, it was taking forever before setting this parameter to true, now less than second.Audiometer
yeah dont understand why this isnt the default!Tonetic
K
1

I'll answer the part of OLAP. There three big families of OLAP tools. ROLAP, MOLAP and HOLAP.

ROLAP, Relational, are build on a relation database. MDX request, if cache is missed, are performed in a relational database using a SQL statement. They have an advantage of scalability, by delagation, but depened for their performance on the underlying data base. QoS might be tricky as it's the db QoS.

MOLAP, InMemory, copy the data into internal structures (memory). Here the QoS, answer times, are more stable and faster as all processing is done in the same server. The issue with MOLAP is scalability as you might get out-of memory (>100mio).

HOLAP those are a mixed of ROLAP and MOLAP. I've no direct experience but in theory they can bring the best of both worlds.

Looking at the numbers you should not get any problem with MOLAP tools, it's really a small cube.

So, before leaving the OLAP world, give a chance to the MOLAP servers. For a list of OLAP servers you can check wikipedia

Kelle answered 20/11, 2011 at 15:4 Comment(0)
B
0

Mondrian OLAP dont have suport a large databases.

Well, im developing The Bitmap Join Index OLAP Tool (BJIn OLAP), that is a OLAP tool based in Java open source. This use a SQL variance syntax, not MDX.

Documentation here

Trial version here

Bred answered 13/12, 2011 at 23:34 Comment(1)
Sir, instead of a link to the Trial version, could you provide us with a link to the source code of this open source software? I couldn't find it on the project page. Thanks.Dumpy

© 2022 - 2024 — McMap. All rights reserved.