Chris's profileChris Webb's BI BlogBlogLists Tools Help

Chris Webb

Sandbox

Loading...
9/2/2010

One Cube vs Multiple Cubes

One of the questions discussed in the book that Marco, Alberto and I wrote last year, “Expert Cube Development with SSAS 2008” (available in all good bookshops, folks!) was whether, if you have multiple fact tables, you should create one big cube with multiple measure groups or multiple cubes each with a single measure group. While I still stand by what we wrote then, I recently took part in an interesting debate on this subject in the MSDN Forum with Akshai Mirchandani from the dev team about the pros and cons of each approach where some interesting new details came to light:

http://social.technet.microsoft.com/Forums/en-US/sqlanalysisservices/thread/681e59bd-93ca-4a91-9f26-8ed96e825553

Here are the main points you need to consider when deciding whether to use the single cube approach or the multiple cube approach. In favour of the multiple cube approach:

  • Having multiple, smaller cubes may result in faster query performance than one large cube in some cases, especially if your fact tables have very different dimensionality. This was true in SSAS 2005, and while it’s less obvious in 2008 it’s apparently still there. This was what I’d previously not been sure about: I’d heard rumours about this, and seen it happen in some cases myself with 2005 - although in other cases when I’d tested this out I’d seen no difference in performance – and I wasn’t sure what the situation was with 2008. You’d need to test the two approaches yourself on your cubes and queries to be sure. Here’s what Akshai says on the matter:

    if you keep adding lots of dimensions to a cube, then the virtual space of the cube grows -- it does not add to the storage cost, but it does hurt formula engine performance in some scenarios because the cell coordinates are based on the number of attributes in the cube space. Increasing the number of attributes in the cube space will start costing performance in lots of small ways and result in performance regressions. Adding lots of unrelated measure groups would result in you adding lots unrelated dimensions to the cube space and cause a performance slowdown -- if you had 10 measure groups and they all shared lots of common dimensions, then one cube makes the most sense…

    …As I already explained… it affects the sizes of data structures inside the formula engine that are based on the number of attributes in the cube space. When those data structures get larger, there is an incremental cost that can add up (depending on your calculations and query patterns).

    For example, you see the Query Subcube Verbose events in Profiler -- they show you the subcubes that are used for accessing measure groups. There are similar subcubes that are used for calculating formulas and cell coordinates -- all those subcubes get wider and wider as you start adding more attributes into the cube. The cost of accessing and indexing those data structures is what we're talking about here. If adding new measure groups doesn't require adding new attributes/dimensions, then there is no problem...

    We had measured the difference before 2005 shipped for some real customer cubes and found there there was a noticeable performance improvement to split up into multiple cubes…

  • While it is possible to apply dimension security to the Measures dimension, it is much easier to allow or deny access to a cube with the multiple cube approach than it is to apply security to all the measures in a measure group using the single cube approach.
  • Having multiple, simpler cubes can be much more user friendly than one monster cube with loads of dimensions and measure groups. If you have Enterprise Edition you can of course use Perspectives to counter this, but if you are using Standard Edition then Perspectives aren’t available.
  • Maintenance can be easier and less disruptive with multiple cubes: if you need to make changes to a cube while users are querying it, you might end up invalidating users’ connections and dropping caches. With one cube the chances of this disruption affecting more users increases.
  • It’s easier to scale out with multiple cubes: if you find your server is maxing out, you can simply buy another server and distribute your cubes equally between the two. With a single cube approach you end up having to look at (admittedly not that much) more complex scale-out scenarios like network load balancing.

On the other side, here are the arguments in favour of the single cube approach:

  • If you ever need to work with data from two fact tables in the same query or calculation, or if you think you might ever need to in the future, you should go with the single cube approach. The two options for cross-cube querying, linked measure groups and the LookUpCube MDX function, should be avoided. Linked measure groups are a pain to manage, carry a slight query performance overhead, and can result in the same MDX calculations being duplicated across the original cube and the cube containing the linked measure group (which means maintenance becomes more difficult). The LookUpCube function is probably the worst MDX function to use in a calculation from a performance point of view and should be avoided at all costs. So a single cube is the only feasible option.
  • Even if your users tell you they will not ever need to analyse data from two fact tables in the same query, be prepared for them to change their minds. In my experience, SSAS projects have a tendency to grow in complexity over time, and cubes that start out simple in a first release often grow lots of new functionality as time goes on - and the more successful the project, the quicker things get complicated. As soon as your users see what's possible with SSAS they will start to have new, more ambitious ideas about the kind of analysis they want to do with their data, and it's very likely that they will realise they do need to do cross-measure-group queries and calculations. If you started out on the multiple cube approach and then this happens you will have no choice but to use linked measure groups, and as I said this can make maintenance difficult; using the single-cube approach from the start means you won't have this problem.

My personal preference is to use the single cube approach by default, and then move to multiple cubes if there are pressing reasons to do so, for example if query performance is a problem. This might seem a bit strange given the number of reasons I’ve given for the multiple cube approach, but frankly the need to support cross-measure-group querying and calculations trumps them all. As I said, if you need to do it (and 99% of the time you will), or you even half suspect you might need to do it sometime in the future, you have to go with the single cube approach. That said, I know other people are more inclined to the multiple cube approach than I am and to a certain extent it’s a matter of taste.

8/29/2010

SQLBits Sessions

OK, I’m back from holiday now and catching up with my blogging backlog. First of all, I need to mention that the SQLBits agenda has now been published and I’ll be doing two sessions: one on DAX and one on common SSAS design mistakes. You can see the agenda and more about my sessions here:

http://www.sqlbits.com/information/Agenda.aspx

It’s a truly stellar lineup this time; as Simon says, we have 20 MVPs and 4 former MVPs speaking. It’s also a much more international conference too – looking at the registration stats we’ve got people coming from Denmark, France, Germany, India, Ireland, Namibia, Poland, Portugal, Sweden, Switzerland, the Netherlands and the US as well as the UK.

Don’t forget also I’m doing a full day seminar on performance tuning SSAS on the SQLBits training day too:

http://www.sqlbits.com/information/TrainingDay.aspx

8/21/2010

Teradata does MDX

Another major BI vendor supports MDX: the Teradata OLAP Connector is an OLEDB for OLAP Provider for Teradata’s new ROLAP tool (for more information search for the Teradata Business Intelligence Optimizer). Here’s the press release:
http://www.prnewswire.com/news-releases/teradata-olap-connector-enriches-direct-excel-connectivity-enhancing-enterprise-and-real-time-analytics-101075004.html

It’s a boost for MDX as a cross-platform language; I wouldn’t be surprised if other vendors did something similar.

UPDATE: I thought I detected the hand of Simba at work here:
http://www.simba.com/news/Simba-Builds-Custom-ROLAP-BI-Solution-for-Teradata.htm




8/19/2010

Dryad goes commercial?

I’m in the middle of my summer holiday at the moment, hence the relative silence here, but I just noticed this article by Mary Jo Foley suggesting that Microsoft might be turning Dryad into a commercial product next year and couldn’t resist posting it:
http://www.zdnet.com/blog/microsoft/microsoft-research-parallel-programming-project-set-to-go-commercial-in-2011/7161

There was a flurry of interest in it a few years ago but I thought all had gone quiet; maybe this is this the source of the rumours that Microsoft/Hadoop rumours too? I also notice that SQL Server/SSIS has disappeared from the ‘software stack’ diagram; is there going to be any tie-in with the rest of Microsoft’s BI efforts?

8/3/2010

Order of Nested SCOPE Statements

Funny how you can work with a product for years and years and still discover new things, isn’t it? I was writing some scoped assignments on a cube the other day and found that when you’re using nested SCOPE statements, the order that you put those SCOPE statements in the MDX Script is significant, contrary to what I had believed.

Consider a simple date dimension with the following attributes and attribute relationships:

image

Now, if we add the following MDX to the script, to scope on every member (including the All Member) on the date dimension:

CREATE MEMBER CURRENTCUBE.MEASURES.DEMO AS 1;

SCOPE(MEASURES.DEMO);
    SCOPE([Order Date].[Date].MEMBERS);
        THIS = 2;
    END SCOPE;
END SCOPE;

You’ll see that it has changed the values of the DEMO measure for the whole of the date dimension from 1 to 2:

image

Now, if we add a second nested SCOPE on all the members of the Month level of the Month attribute (ie so not including the All Member on Month) as follows:

CREATE MEMBER CURRENTCUBE.MEASURES.DEMO AS 1;
SCOPE(MEASURES.DEMO);
    SCOPE([Order Date].[Date].MEMBERS);
        SCOPE([Order Date].[Month].[Month].MEMBERS);
            THIS = 2;
        END SCOPE;
    END SCOPE;
END SCOPE;

You’ll see that it now only changes month values, and nothing else:

image

This set of assignments gives the same result as the following, where there is no assignment on Date at all:

CREATE MEMBER CURRENTCUBE.MEASURES.DEMO AS 1;
SCOPE(MEASURES.DEMO);
    SCOPE([Order Date].[Month].[Month].MEMBERS);
        THIS = 2;
    END SCOPE;
END SCOPE;

However, if you reverse the order of the two SCOPEs you get a different result. So:

CREATE MEMBER CURRENTCUBE.MEASURES.DEMO AS 1;
SCOPE(MEASURES.DEMO);
    SCOPE([Order Date].[Month].[Month].MEMBERS);
        SCOPE([Order Date].[Date].MEMBERS);

            THIS = 2;
        END SCOPE;
    END SCOPE;
END SCOPE;

When you scope on all the members of Month except the All Member first, then all the members of Date including the All Member, the scope covers all dates and months:

image

Crossjoining these two sets in the same SCOPE has the same effect:

CREATE MEMBER CURRENTCUBE.MEASURES.DEMO AS 1;
SCOPE(MEASURES.DEMO);
    SCOPE([Order Date].[Month].[Month].MEMBERS,[Order Date].[Date].MEMBERS);
        THIS = 2;
    END SCOPE;
END SCOPE;

What’s happening is that where you have two nested SCOPE statements using sets of members from different attribute hierarchies on the same dimension, the attribute relationships that exist between those attributes become significant. Without any SCOPE statements then the ‘current’ scope context in the MDX Script is the entire cube, and then each successive SCOPE overwrites the previous context to create a new context and that process of overwriting context is not commutative. This is called attribute overwrite; I’ve wrestled with it before and frankly it makes my head hurt, but you can find out more about it here.

I suppose, therefore, that it would be a good idea when you’re scoping on sets containing members from multiple attributes from the same dimension to crossjoin those sets together and use a single SCOPE, to avoid any potential confusion. While I can just about rationalise this behaviour I can’t guarantee I’d be able to predict how attribute overwrite worked on a real dimension with multiple nested SCOPEs… Remember, of course if you have multiple nested SCOPEs using sets of members from hierarchies on different dimensions then the order of nesting doesn’t matter because there’s no attribute overwrite going on.

Thanks to Tomislav and Akshai for their help in understanding this problem.

7/28/2010

SQLBits 7 Registration Open

If you haven’t heard already (and I’m at least twelve hours late on this news) then the big news is that registration has opened for SQLBits 7. The conference will be taking place from Thursday September 30th to Saturday October 2nd in York, in the UK; to find out more about what’s going on, and to register, go to http://www.sqlbits.com/. At the time of writing 116 people have registered already, and we haven’t even published the agendas for the Friday or Saturday yet…!

I’m going to be running one of the training days that will be taking place on Thursday 30th too: a whole day on performance tuning Analysis Services 2008. I’ll be covering designing aggregations, partitioning, caching and cache warming, tuning queries and calculations, and hardware configuration; basically it will be a brain dump of everything I know on the subject and at a very modest price too…

7/23/2010

The Deployment Wizard, and when “Retain Partitions” doesn’t retain partitions

I learned something the hard way this week about how the Deployment Wizard actually works (thank goodness for daily backups) that I thought I’d share.

I’ve used the Deployment Wizard lots of times to deploy changes to a SSAS database in production. If you have a system where new partitions are generated dynamically, for example in an SSIS package that creates new partitions automatically when new data is loaded into the cube, then these partitions are not going to be present in your project in BIDS and you don’t want them to be overwritten when you deploy the new version of your database. The Deployment Wizard allows you to stop this happening, and also to not overwrite other setting that might be different in dev and prod, such as connection strings and security roles.

However, what I didn’t realise was that in order for the Deployment Wizard to work properly in all cases it needs to be able to see the server you want to deploy to, and in my current customer’s environment you can’t see prod from dev. Working on dev, what I did was to edit the .deploymenttargets input file manually to enter the name of the database in prod I wanted to deploy to, then ran the wizard with the /a argument (see here for more information on these settings) to set other options in the input file such as Retain Partitions, then ran the wizard again with the /o and /d argument to generate the xmla deployment script without needing to connect to the target server (which I couldn’t, of course, see). And then, when I ran the resulting script in prod, I saw all the partitions that weren’t in dev disappear even though I’d selected the Retain Partitions option.

What I had assumed was that by selecting the Retain Partitions option the wizard would somehow generate the XMLA Alter command it outputs so that it would just ignore the partitions in each measure group. However this is not correct: you can’t have an Alter command for a database that doesn’t list all the partitions in all the measure groups in the database. This is why the wizard needs to connect to the target server: it scripts out all the partitions that currently exist on the target and inserts them into the script it generates, instead of just the partitions that are in the .asdatabase file. If, though, you use the /d argument you’re telling the wizard to use only the information that’s in the input files and not to connect to the target server to see what’s there, so it can’t know what partitions are present on the target server and it has to use the partitions from the .asdatabase file instead. So, if you use the /d argument, even if you specify the Retain Partitions option you may see partitions deleted on the target when you run the wizard’s script.

7/15/2010

MDX Course in Dublin

Although I’ve mentioned it before, let me plug it once again: I’ll be teaching a public MDX course in Dublin on September 16th and 17th. More details and how to register can be found here:
http://www.prodata.ie/Events/MDX2010.aspx

7/11/2010

Building a Better Cache-Warmer, Part 2: The Formula Engine Cache

Sorry for the very long delay – at long last, here’s the companion piece to the post I wrote last September about cache-warming. That post dealt with the relatively straightforward topic of warming the Storage Engine cache; this time we’ll be looking at how to warm the Formula Engine cache. As I promised in the first post, the idea is that I’m going to take the ideas discussed here and use them to create a better cache warmer than the ones I’ve blogged about in the past.

The first thing to note when talking about caching the result of MDX calculations is that it does happen – a surprising number of people think that Analysis Services performs every MDX calculation needed by a query from scratch every time. This does indeed happen in worst-case scenarios - and it’s all too easy to do things which accidentally get you into these worst-case scenarios – but when SSAS can cache the result of calculations it can have a massive impact on query performance.

The second thing to note is that when I say that SSAS can cache the result of calculations, what I mean is that it can cache the values returned by cells in the cube which have been subject to some form of MDX calculation. Analysis Services can’t cache the result of MDX expressions as such, it can only do so when the result of those expressions is surfaced as a value returned by a cell somewhere in the cube. This means that if we create a calculated measure on the cube’s MDX Script, and then include this calculated measure in a query, we can expect that SSAS should be able to cache the values returned by this calculated measure. On the other hand if we have a complex MDX set expression on the Rows axis of our query, the set that expression returns cannot be cached – it will be re-evaluated every time the query is run.

The third, and possibly most important thing to remember is that SSAS can only cache the results of a calculation for as long as the calculation actually exists. As you may already know, there are three places you can create a calculated member in SSAS, and these equate to three different ‘lifetimes’. If you create a calculated measure on the MDX Script of the cube it will live until the next time the cube’s MDX Script is dropped and recreated, and this will happen if you explicitly clear the cache or you do any kind of processing. This is called ‘global’ scope. If, however, you create a calculated member in the WITH clause of a query then that calculated member will only exist for the lifetime of that query – so while you will benefit from caching while the query is executing, after the query has returned the calculated member will disappear and so will any related cache. This is called ‘query’ scope. The third scope, or lifetime, for creating a calculated member is ‘session’ scope: you can create calculated members that live for the lifetime of a session using the CREATE MEMBER statement, so values can only be cached for one user and one session; this is used only very rarely though.

The fourth to understand is that SSAS can’t cache the results of some calculations, or if it can cache them for one user it cannot share the contents of this cache with other users. For example this will happen if the MDX calculation might return a different result every time it runs (eg if it used the Now() function to return the system date and time – which is obviously going to be different each time it’s called) or if it might return different results for different users (eg when two different users run the same query through different security roles, because they might be able to see different parts of the cube the values returned might be different). The way in which a query is constructed can also force ‘query’ scoping for the formula engine cache too: the presence of a calculated member in the WITH clause can do this, as can the use of subselects. This also means that FE cache warming is pointless for certain popular client tools like SSRS or Excel because the MDX they generate prevents the use of the FE cache.

Finally, the Formula Engine has two different structures it uses to hold cached values. For calculations that execute in bulk mode then it can use the same structure that the Storage Engine cache uses, namely the data cache registry. For calculations that execute in cell-by-cell mode, however, it uses a different structure called the flat cache. Calculations that execute in bulk mode in most cases cannot make use of values stored in the flat cache, and calculations that execute in cell-by-cell mode cannot make use of values stored in the data cache registry. Furthermore, the size of the flat cache is restricted to 10% of the TotalMemoryLimit server property; if it grows bigger than that it will be completely emptied.

(Incidentally, if you’re looking for more detail on any of the above points, I suggest you watch the video of my session on “Cache Warming Strategies for SSAS 2008” here).

So clearly there are many potential pitfalls to watch out for when warming the FE cache, and indeed in many cases I’d say that it’s just easier to concentrate on warming the SE cache and tuning your calculations so they execute as fast as possible even on a cold FE cache!

If you do need to warm the FE cache though, you need to be very careful. Unfortunately the CREATE CACHE statement mentioned in the previous post only works for the SE cache, so the only way to warm the FE cache is to use MDX queries. These queries should be hand-written specifically for the purpose – if you simply record queries run in production using Profiler, it’s likely you’ll end up with queries that don’t warm the FE cache because they contain a WITH clause or a subselect – but it’s nonetheless a good idea to use production queries as a starting point and then modify them so they become FE-friendly. A smaller number of larger queries is going to be better than a large number of queries that return small amounts of data, to avoid cache fragmentation in the data cache registry, and it’s a good idea to keep them as simple (and as non-arbitrary-shaped) as possible.

Since it’s likely that the calculations you’re most interested in caching are going to be the ones that, despite your best efforts at tuning them, execute in cell-by-cell mode, then it’s a good idea to keep an eye on the overall size of the various flat caches in existence. The following DMV, I think, shows the size of all the flat caches across all the databases and for every session on an SSAS instance:

select * from
$system.discover_object_memory_usage
where object_id='FormulaCache'

…but I need to do a bit more research to make sure it shows what I think it shows and to make sure I can observe this 10% limit resulting in the flat cache being emptied.

Lastly, as with the SE cache, you don’t want to overfill it and leave no memory for the natural growth in cache that will occur as a result of normal querying; and you certainly don’t want to go over the LowMemoryLimit or TotalMemoryLimit while you’re warming the cache and end up with cache evictions taking place.

7/9/2010

Speaking at the PASS Summit 2010

Hurray! I’ve been selected to speak at this year’s PASS Summit in Seattle:
http://sqlpass.eventpoint.com/topic/details/BIA412

Looks like another strong lineup of speakers again; I hope they do a better job of scheduling the sessions though…

…and don’t forget, if you’re in the UK (or indeed Europe) and your budget doesn’t stretch to a flight to Seattle, there’s always SQLBits! Just look at the sessions we’ve had submitted so far:
http://www.sqlbits.com/information/PublicSessions.aspx