5/12/2007
An MDX Challenge: Debtor Days
As I said in my previous post, last night's event at the Experience Music Project was good fun. There was live music, free booze and of course - and this is probably evidence that I need to be taken away by the men in white coats - conversation naturally turned to the topic of tricky MDX problems. Richard Halliday (I hope I've got your name right) came up with an interesting calculation for me which he referred to as 'debtor days': if I understood correctly, he had a cube with a measure containing values of individual debts incurred by customers and was interested in finding out for any given customer and any given day, the minimum number of days it took from the current date backwards in time for the total of that customer's debts to reach a given value. After digesting it overnight I had a go at implementing it this morning and found that it was a really fascinating problem - although getting the right value is fairly tricky and worth discussing, there are some great opportunities for optimisation too which I wanted to blog about.
First let's translate the problem into Adventure Works: we'll use the Date dimension and the Customer dimension, and find out how many days you need to go back from the current date for the current member on the [Customer Geography] user hierarchy for the cumulative total of Internet Sales Amount to exceed 10000. Here's my sample query with my calculated member:
with
member measures.daysto10000 as
iif(
count(
nonempty(
null:{{[Date].[Date].
currentmember}
as mycurrentdate}.
item(0),[Measures].[Internet Sales Amount])
as mynonemptydates)=0,
null,
iif(
isempty([Measures].[Internet Sales Amount])
and (
not isempty((measures.daysto10000, [Date].[Date].
currentmember.
prevmember)))
, (measures.daysto10000, [Date].[Date].
currentmember.
prevmember)+1,
iif(
count({{}
as myresults,
generate(
mynonemptydates
,
iif(
count(myresults)=0,
iif(
sum(
subset(mynonemptydates,
count(mynonemptydates)-mynonemptydates.
currentordinal), [Measures].[Internet Sales Amount]) > 10000
, {mynonemptydates.
item(
count(mynonemptydates)-mynonemptydates.
currentordinal)}
as myresults
, myresults)
, myresults
)
)
})=0,
null,
count(myresults.
item(0): mycurrentdate.
item(0))-1
)
)
)
select
descendants
([Date].[Calendar].[Calendar Year].&[2004]
,
[Date].[Calendar].[Date])
on 0,
non empty
descendants(
[Customer].[Customer Geography].[State-Province].&[HH]&[DE]
,
[Customer].[Customer Geography].[Postal Code])
on 1
from [Adventure Works]
where(measures.daysto10000)
On a cold cache this executes in a touch under 15 seconds on my laptop. The select statement puts all of the days in the year 2004 on columns, all of the postal codes in Hamburg on rows, and slices on the calculated measure defined in the with clause. Here are some things to notice about the calculated measure:
- The outermost iif simply says that if the set of dates from the start of the Date level to the current date contains no values at all for Internet Sales Amount, then return null. If there are values then the set of dates with values is stored in the named set mynonemptydates, declared inline.
- The next level of iif represents a recursive calculation, and I found that this was one of the extra touches that made a big difference to performance. It says that if the current date has no value for Internet Sales Amount but the value of the calculated measure is not null for the previous day, then simply add one to the value of the calculated measure from the previous day - this avoids a lot of extra work later on.
- The next level of iif is where I do the main part of the calculation and this is going to need a lot of explaining... Put simply, I've already got from step 1 a set of members representing the dates from the start of time to the current date which have values in and what I want to do is loop through that set from the end backwards doing a cumulative sum, stopping when that sum reaches 10000 and then taking the date I've stopped at and finding the number of days from that date to the current date. Originally I attempted the problem like this:
iif(
count(
tail(
filter(
mynonemptydates
, sum(subset(mynonemptydates, mynonemptydates.currentordinal-1), [Measures].[Internet Sales Amount]) > 10000
)
,1) as myresults
)=0, null,
count(myresults.item(0): mycurrentdate.item(0))-1
Here I'm filtering the entire set to get the set of all dates where the sum from the current date to the end of the set is greater than 10000, then getting the last item in that set. This seemed inelegant though - if we had a large set then potentially we'd be doing the expensive sum a lot of times we didn't need to do it. It seemed better to loop through the set backwards and then somehow be able to stop the loop when I'd reached the first member which fulfilled my filter criteria. But how was this going to be possible in MDX? I didn't manage it completely, but I did work out a way of stopping doing the expensive calculation as soon as I'd found the member I was looking for. Let's take a look at the specific section from the main query above:
iif(
count({{} as myresults,
generate(
mynonemptydates
, iif(count(myresults)=0,
iif(
sum(subset(mynonemptydates, count(mynonemptydates)-mynonemptydates.currentordinal), [Measures].[Internet Sales Amount]) > 10000
, {mynonemptydates.item(count(mynonemptydates)-mynonemptydates.currentordinal)} as myresults
, myresults)
, myresults
)
)
})=0, null,
count(myresults.item(0): mycurrentdate.item(0))-1
)
What I do first is declare an empty set inline called myresults. I then use the generate function to loop through the set nonemptydates. The first thing you'll see after the generate is an iif checking if the count of myresults is 0, and the first time we run this check it will be so we need to do our cumulative sum. Because generate loops from the start of a set to the end, and I want to go in the other direction, I get the current ordinal of the iteration and then find the cumulative sum from the item that is that number of members away from the end of the set up to the end of the set. Once I've got the cumulative sum I can check if it is greater than 10000; if it is, then I return a set from the iif statement and at the same time overwrite the declaration of myresults with a set of the same name which now contains that one member. As a result, at all subsequent iterations the test count(myresults) returns 1 and I don't try and do the cumulative sum again. I was quite pleased at finding I could do this - I hadn't realised it was possible. It only makes about 0.5 seconds difference to the overall query performance though.
-
Finally, on the last line of the calculated measure I can take the member I've got in the set myresults and using the range operator find the number of days between it and the current date, which I've also stored in a named set called mycurrentdate.
Pretty fun, eh? No, please don't answer that question. But if you can think of an alternative, better-performing way of solving this problem I would love to hear it...
UPDATE: it turns out that Richard Tkachuk not only had a go at the same problem (although his interpretation of what it is is slightly different) but got just as excited about it as I did, and wrote up his findings here:
http://www.sqlserveranalysisservices.com/OLAPPapers/ReverseRunningSum.htm