Chris's profileChris Webb's BI BlogBlogLists Tools Help
    10/28/2008

    Reporting Services-generated MDX, Subselects and Formula Caching

    Analysis Services can, of course, cache the results of MDX calculations and share that cache with other users most of the time... but when it can't the result is much unnecessary load on your server as calculations are recomputed again and again, and much longer query times. Unfortunately the details on what caching AS can do when are not properly documented anywhere and I've only learned the bits and pieces that I know about this subject from my own experience and from talking to the likes of Mosha; I've avoided blogging about it because my knowledge is incomplete and this is a complex topic. Happily the upcoming AS2008 Performance Guide should rectify this lack of proper documentation, and I believe Mosha is also going to address this subject in his pre-conf seminar at PASS.

    However, there is one specific scenario that I thought I'd write about because I come across it regularly and it affects anyone using a client that generates MDX with subselects in, such as the Reporting Services query builder. Let's say you've got an expensive calculated member on your cube - for example, add the following (deliberately slow) calculated measure to the Adventure Works cube:

    CREATE MEMBER CURRENTCUBE.MEASURES.EXPENSIVECALC AS
    COUNT(bottomcount(
    {[Scenario].[Scenario].&[1], [Scenario].[Scenario].&[2]}
    *
    [Account].[Account].[Account].members
    *
    [Date].[Date].[Date].members
    *
    [Department].[Departments].[Department Level 02].members
    , 10,[Measures].[Amount])
    );

    In Reporting Services you might then use the query builder to create a query that looked like this to select just Calendar Year 2004 on rows and this new calculated measure on columns:

    SELECT
    NON EMPTY { [Measures].[EXPENSIVECALC] } ON COLUMNS,
    NON EMPTY { ([Date].[Calendar Year].[Calendar Year].ALLMEMBERS ) }
    DIMENSION PROPERTIES MEMBER_CAPTION, MEMBER_UNIQUE_NAME
    ON ROWS
    FROM (
    SELECT (
    { [Date].[Calendar Year].&[2004] } )
    ON COLUMNS
    FROM [Adventure Works])
    CELL PROPERTIES VALUE, BACK_COLOR, FORE_COLOR, FORMATTED_VALUE,
    FORMAT_STRING, FONT_NAME, FONT_SIZE, FONT_FLAGS

    On a cold cache, on my laptop, this query takes 33 seconds; on a warm cache it still takes 33 seconds. This is not good - practically all of the query time is taken up evaluating the calculation, and this calculation is clearly not being cached. The reason why is because the query contains a subselect, and when a query contains a subselect it means that the results of any calculations can only be cached for the lifetime of the query. You can easily rewrite the query as follows to avoid the subselect:

    SELECT
    { [Measures].[EXPENSIVECALC] } ON COLUMNS,
    NON EMPTY { [Date].[Calendar Year].&[2004] }
    DIMENSION PROPERTIES MEMBER_CAPTION, MEMBER_UNIQUE_NAME
    ON ROWS
    FROM [Adventure Works]
    CELL PROPERTIES VALUE, BACK_COLOR, FORE_COLOR, FORMATTED_VALUE,
    FORMAT_STRING, FONT_NAME, FONT_SIZE, FONT_FLAGS

    On a cold cache this query still takes 33 seconds, but on a warm cache it comes back in 1 second because the calculation can now be cached and this cache can be reused for subsequent queries. Clearly this is much better.

    I wonder how many SSRS implementations are suffering from needlessly poor performance just because of this? Subselects are a useful addition to MDX and make calculating 'visual totals' type values very easy, but they should be used with care and not just as a means of selecting individual members.

    Azure and Analysis Services in the cloud

    Hmm, I'm a bit late on this but it seems at PDC the other day buried in Bob Muglia's keynote was the announcement that we'll be getting a cloud-based version of Analysis Services as part of SQL Data Services. Here's Jamie Thomson's post on this:
    http://blogs.conchango.com/jamiethomson/archive/2008/10/27/ssds-developments.aspx

    Here's the official Azure site and white paper:
    http://www.microsoft.com/azure/default.mspx
    http://download.microsoft.com/download/e/4/3/e43bb484-3b52-4fa8-a9f9-ec60a32954bc/Azure_Services_Platform.docx

    UPDATE: see Mosha's comment below - there was no announcement of cloud-based AS. Oh well... but I've just read about the Office-in-the-cloud announcement. I wonder if we'll get any AS connectivity in the cloud-based Excel?

    10/27/2008

    Pentaho 2.0 and Pentaho Aggregate Designer

    Via Julian Hyde, I see Pentaho have released version 2.0 of their BI suite. As Julian points out in this blog entry here, one of the major new features in this release is the Pentaho Aggregate Designer, which makes it much easier to design aggregate tables for Mondrian. A quick look at the screenshot on Julian's blog suggests that Mondrian is continuing to flatter Analysis Services by imitation (you can even see that the aggregations are being built on the Foodmart database!) - not that this is a bad thing, in fact it's what makes Mondrian so easy to use if you're used to Analysis Services.

    So Mondrian is definitely catching up in terms of functionality... and in terms of performance, well, as Julian says "Mondrian is beholden to the RDBMS for performance" and those RDBMSes are getting ever faster. I suspect very soon more and more people are going to find that Mondrian plus an open source RDBMS (perhaps the column-store-based LucidDB, which I see already already has support for the new aggregate table designer) is going to be 'good enough' for many BI projects - after all, most AS implementations I see don't involve massive amounts of data, and if Mondrian can give good query response times on a fact table of 20-50 million rows then the decision on which platform to use will be more heavily influenced by price. And what with the current financial crisis, price is going to be an ever-more important factor for many customers. I wonder how long it will be before I see MS BI consultancies starting to offer open source BI, given the relative ease with which you can transfer your cube design and MDX skills between the two platforms?

    10/22/2008

    Book review: Applied Microsoft SQL Server 2008 Reporting Services, by Teo Lachev

    Here's the deal: if you liked Teo Lachev's book "Applied Microsoft Analysis Services 2005", which I did, then you'll like "Applied Microsoft SQL Server 2008 Reporting Services". Both books share the same format and approach and are exhaustive guides to their subjects.

    In case you've not seen one of Teo's books before, though, what should you expect? Well, "Applied Reporting Services 2008" is 750 pages long and covers just about every aspect of Reporting Services 2008 that's worth covering, from installation to report design to management to security to extensibility in great detail. This probably isn't a book you're going to sit down and read from cover to cover, but it's a great reference guide and Teo's prose is very clear so reading individual chapters as and when you need to is no chore. The best thing about this book, though, is the fact that on every page you can see Teo's vast real-world experience showing through. For some reason many books on Reporting Services fall into the trap of being Books Online rewritten, never telling you the stuff you really need to know; this book on the other hand is a goldmine of information on how things really work. Let me take the chapter I can really speak from authority on - the chapter on using Reporting Services with Analysis Services. Every other SSRS book I've seen has been written by someone with clearly no practical experience of using SSAS and SSRS together, and parrots the usual line about 'great integration' and 'easy-to-use MDX query designer' etc. Teo on the other hand has, for the first time anywhere, put together all the tips and tricks I've ever seen (plus a few I haven't) on this topic in one place - he lists the pros and cons of the built-in Analysis Services datasource and the OLEDB datasource and how to work around them, how to handle parent/child hierarchies, using extended properties, the lot.

    All in all, then, a very highly recommended book; it's ideal for both beginners and experienced developers and probably the only Reporting Services 2008 book you'll ever need. You can read some sample chapters and see video demos on the book website here:
    http://www.prologika.com/Books/0976635313/Book.aspx

    OLAP PivotTable Extensions new release

    For some reason I've not blogged about this before, but anyway the ever-industrious Greg Galloway has just released a new version of his OLAP PivotTable Extensions:
    http://www.codeplex.com/OlapPivotTableExtend

    It's an Excel addin that gives you useful new functionality in Excel 2007 pivot tables connected to Analysis Services, such as the ability to add private calculations, view the MDX behind the pivot table, and (in the new release) search for members and other things. Definitely worth a look, and useful too if you've ever wondered how to work with the Excel pivot table in code.

    10/20/2008

    Calculated members are better than assignments to real members (at least sometimes)

    Earlier this year I blogged about how, in my experience, using real members and overwriting them with MDX Script assignments sometimes seemed to perform worse than using calculated members. See here for the full posting:
    http://cwebbbi.spaces.live.com/blog/cns!7B84B0F2C239489A!1598.entry

    Recently I came across this problem again but was able to get together a proper repro and open a support case with Microsoft. What was happening was that I had a time utility dimension with three members - let's call them member A, member B and member C. A was the default member and had no calculation, it just showed the real values from the rest of the cube; B was a straightforward YTD calculation; C was a really nasty, untuned, same period previous year calculation with lots of special logic. I had a query that included A and B but not C and which was performing really badly, and I found that when I commented out the calculation for C the query returned instantly. So I guessed that for some reason the calculation for C was being evaluated when the query ran, even though C wasn't actually requested in the query and this was confirmed by Microsoft.

    Unfortunately I was also told this behaviour was 'by design', something to do with prefetching and sonar. Thankfully it doesn't happen for all queries or all cubes (I tried and failed to repro it on Adventure Works) but in my opinion there should never be a situation where the formula engine evaluates calculations that aren't needed - precisely because in some cases it will mean the query runs much longer than it ever should. Probably the worst thing about all this is that you have no idea when exactly it's happening with the tools we've got available at the moment. I suspect that many people out there have poorly-performing cubes because of this issue...

    One of the workarounds was to use calculated members instead of real members, as I recommended in my original post, but as I also said in that post there are some cases where calculated members are not a good alternative to real members. In my particular case I was able to tune the calculation associated with member C so either the problem didn't happen or C evaluated so quickly it didn't matter - I'm not sure which. Anyway, I stand by what I said in my original posting: use calculated members rather than real members overwritten by Script assignments wherever you can!

    10/15/2008

    Metadata, semantic web technologies and, yes, Gemini again

    I saw a very interesting article the other day in Intelligent Enterprise by Seth Grimes, about a newly-published report on the semantic web by David Provost. Grimes is rightly sceptical about how close we are to these ideas reaching fruition and notes that many of the companies mentioned in the report are concentrating on 'semantic data integration'. Frankly, to me the idea of being able to integrate data sourced from different parts of the web is still far-fetched, given the trials you have to go through to integrate data from different parts of the same company. But it did get me thinking: when users get Gemini, where will they get their data from? Yes, they'll be downloading data on 'industry trends' etc from the web in the way we saw in the Gemini demo, but it won't be that often. In well-run companies most of the time the data will come from four sources:

    • The data warehouse, either from the relational source or Analysis Services.
    • OLTP systems
    • Data that lives in someone's small, well-maintained, official, IT-department tolerated Excel spreadsheet. The kind of spreadsheet that couldn't and shouldn't be promoted to database form because it's too small, or short-lived, or needs to be maintained by non-technical people, or needs the complete flexibility that Excel gives you. In fact, exactly the kind of thing that Excel is meant to be used for.
    • Data that originally came from the data warehouse but was downloaded into someone's local Access database, or exported to Excel, or sent to the user in something like a SSRS report; so data that has come to the user second hand.

    How, then, can Gemini know (beyond its cleverness with column names and analysis of the data within those columns) what data can be integrated with what? Ideally it would have access to some form of metadata embedded in the source data that would help it make the correct decision all in all cases. Where this metadata comes from is a problem much larger than Gemini of course, and goes down to the fundamental question of how an enterprise can keep track of all of its data assets wherever they're stored and understand what the data in each data store actually means; Microsoft is regularly criticised for its lack of a metadata tool and I guess MS is working on something in this area. If, in the first and second scenarios above, Gemini could connect to SQL Server or Analysis Services and see a common layer of metadata that would help it out, allowing it to join data from SQL Server with data from Analysis Services for instance. In a way, Analysis Services is already a metadata layer on top of the data warehouse, containing information on how tables need to be joined and how measures should be aggregated; perhaps this side of it will become more important as the MOLAP engine is superceded?

    I also think some of the technologies of the semantic web, when applied to the enterprise, could be very useful here; I've only really just come across this stuff myself, but as an introduction I found the one-page explanation of RDF on Twine was very good, and the Microformats site was also full of interesting information. One of the companies mentioned in the semantic web report is Cambridge Semantics, whose product Anzo for Excel is aimed at imposing structure on all of those Excel spreadsheets I referred to in the third bullet point above. The demo on the web concentrates on using the tool for sharing data and collaboration, but as far as I can see the key to getting that to work is selecting data in the spreadsheet and linking it back to a common metadata layer. Of course the obvious criticism of a product like this is that you're again relying on your users to make the effort to mark up their data and do so properly, but if there's an incentive in the form of easier collaboration and - to put it bluntly - less of that boring cutting and pasting then maybe there's a chance they could be persuaded to do so. I could imagine MS offering something very similar to this as part of Excel, with all the central management being done through Sharepoint, and extended throughout the Office suite to Access, Word etc. Gemini would then be able to do a better job of understanding what and how data in an Excel spreadsheet, say, could be integrated with data in the data warehouse.

    So the bottom-up approach could be combined with the more traditional top-down metadata management approach, and crucially I'd want to see metadata automatically embedded in various the output formats of the server products. For example when you built a SSRS report or created an Analysis Services pivot table in Excel, in both cases I'd want the data by default to retain some record of what it was and where it had come from - the metadata would embedded in the document in the same way as if the user had marked up the document themselves. And this in turn would make it a lot more easily reusable and comprehensible by Gemini and other applications such as Enterprise Search.  Metadata would accompany data as it travelled from document to document, data store to data store, format to format. This would then cover the scenario in the third bullet point, allowing you to integrate data from an SSRS report with data from the data warehouse it originally came from, or data from a user-generated Excel spreadsheet that had been marked up manually.

    So much for structured data; ideally we'd want to be able to include unstructured data too. Some of the applications of natural language processing mentioned in the semantic web report looked very interesting, especially OpenCalais (which already has integration with MOSS 2007 I see). If you were looking at sales figures for a particular customer in Gemini, wouldn't you also want to be able to look for documents and web pages that discussed sales for that customer too? And I've often thought that while the super-simple type-a-search-term approach for Search has worked brilliantly over the last few years, there's a niche for a power-user interface for search too, kind of like a Proclarity Desktop for search where you could drag and drop combinations of terms from the central metadata repository and see what you found; could that be Gemini too? A tool for searching, integrating, aggregating and manipulating all enterprise data, not just numeric data?

    Over the last few days I've seen the Gemini team (notably Amir) engaging with bloggers like me about our concerns through comments on our postings. I appreciate that, but my position hasn't changed: I think the technology is cool but there's too great a risk that it will be misused as it stands at the moment. Relying on oversight from the IT department isn't enough; if MS had a convincing metadata story extending to all data types and data sources, metadata that Gemini could use, it would go a long way to addressing my concerns.

    10/14/2008

    PALO supports MDX and XMLA

    Thanks to Marco Groeneveld for sending me this link:
    The open source OLAP server PALO now supports XMLA and MDX. Another victory for MDX!
    10/12/2008

    Last thoughts on Gemini for the moment

    Despite the worldwide financial meltdown, all talk on the MS BI blogosphere has been about Project Gemini this week. Even though no-one knows much about it - and certainly no-one knows everything about it either, as we've been told there are more announcements to come - the reaction has been pretty clear: the back-end technology looks cool, but the strategy of putting Gemini in Excel on the desktop is an invitation to bypass the data warehouse and create spreadmart hell whatever MS says about Sharepoint-based management of Gemini models. I've already linked to just about everyone who's expressed this opinion apart from Mick Horne, whose three posts here, here and here are well worth reading as they put the case against Gemini in its currently proposed form very clearly. Choice quote: "Where is the single version of the truth in this architecture? I’ve just spent 4 years of my life trying to convince users to stop using Excel as a data store and here are Microsoft positively encouraging it. Hell will freeze over before this capability is used responsibly in most organisations".

    So what would I like to see MS actually do with Gemini? From the comments on Marco's blog, at the moment it sounds like the priority is to get the Gemini storage mode working for local cubes and Excel rather than put everything in place for the full server version of Analysis Services. I would prefer the emphasis to be reversed and have Gemini storage mode ready as soon as possible on the server side. It's clearly going to provide a massive performance boost when it does arrive (and let's not forget it's at least two years away), and with other COP databases and data warehouse appliances improving every day there's going to be significantly more competitive pressure on Analysis Services and SQL Server-based data warehouse projects in the future. I'd hate to see current MS BI shops start abandoning their Analysis Services implementations because they can get better query performance and scalability elsewhere.

    But whatever us bloggers say I'm sure we'll get Gemini in the form we've already been shown. The reason is that whatever the rights and wrongs of it from a BI consultant's point of view, the people who use Excel will definitely want this and so there will be an overwhelming commercial case for it. This kind of desktop, DIY BI is in a way similar to illegal drugs: there are always some people that want it, a certain number of them are always going to do it even though they know they shouldn't, so you've got two choices - either legalise it and then hope to control it, as with Gemini, or throw all your efforts into outlawing it. With the first option you run the risk of encouraging what you wanted to discourage and ending up with a worse problem, with the second option you run the risk of people resenting your rules so much they end up being widely ignored. What's the best option?

    UPDATE: one last link for you - Nigel Pendse of the OLAP Report gives an unreservedly positive review here:
    http://www.olapreport.com/Comment_Gemini.htm

    10/7/2008

    More thoughts on Project Gemini

    I've now had a chance to watch the demos and read all the first-hand accounts of what was announced yesterday (see Marco, Mosha, Tim Kent, Jeremy Kashel, Richard Tkachuk). Here are some unstructured thoughts and questions:

    • As per my comment yesterday about Qlikview, self-service BI is undoubtedly what many end users want - but as BI professionals we know only too well that it can be dangerous; in fact just about every blog entry I've read on Gemini has made this point. The question of whether it's a good idea to let your power users do what Gemini lets them do is likely to cause all kinds of heated religious dispute: I was involved in an argument about this at SQLBits recently, and over on Brent Ozar's blog (see here and here) you can see the same discussion being had. Although I completely understand the scenario that MS describes in its demos of users needing to work with data that isn't and will never be in a data warehouse, I lean slightly to the side of those who see self-service BI vendors as selling "snake oil". But being a BI consultant I would, wouldn't I? All this talk of Gemini representing the 'constellation of twins', power users and the IT department working together happily, is something of a fairy tale...
    • In a comment on my blog yesterday, Mosha stated that there was no cube wizard needed for Gemini. But looking at the demo there's certainly a step needed where you connect to data sources and choose the tables and fields you want to work with, so whether you call it a cube wizard in the strictest sense you need to have some understanding of your data before you can do anything with it. And whatever the demo says, the application you're using can only take you part of the way, there's no way a model can be 100% inferred. What happens if fields that mean the same thing have two different names in two different data sources, or if there are two fields which mean different things which have the same name? And, even for many power users, the question of what a table or a join or even a database actually will still need some explanation.
    • While we're at it - and I know this is a bit of a tangent - expecting power users to understand basic technical concepts is one thing but in many cases (as this excellent blog entry points out) "people have no way of knowing which questions are meaningful ones to ask, and which are meaningless". Not that I'm saying your average BI consultant/IT guy has a better idea either, far from it.
    • I was pleased to see mention of data cleaning functionality in the Gemini addin. Is this coming from Zoomix?
    • Certainly the Gemini pivot table demo was very impressive. Is this what pivot tables will look like in Office.Next? If so, are we going to see Excel finally grow up to being a full-featured AS client tool for power users in the same way Proclarity Desktop was?
    • Moving on, on one hand we've got Project Madison, which gives us in SQL Server the ability to query vast amounts of data very quickly. Since this is in SQL Server, I would expect to be able to use AS in ROLAP mode on top. On the other hand we have Project Gemini which will give us a super-fast in-memory storage mode for AS but for slightly smaller data volumes. Where do the two meet? Will we be able to create a HOLAP like solution where your raw data stays in SQL Server/Madison and you can create Gemini-mode aggregations? And can you persist the data in Gemini to disk easily, in case of hardware failure? How long does it take to load data into Gemini?
    • Apart from Qlikview, the other product being mentioned in the same breathe as Gemini is TM1, which is of course primarily used for financial apps. So what will the benefits of Gemini be for PerformancePoint and home-grown AS financial cubes? Not only faster storage engine queries, but also faster calculations (although I know only too well that sometimes you can have poor query performance due to calculations even on a warm storage engine cache, even in AS2008). And will you be able to do writeback on a Gemini partition? Now that would be a major performance benefit.
    • Having said that the need to be able to write MDX will keep people like me in a job, it's worth noting that it should indeed be possible to make it easy to write many MDX calculations in Excel. Indeed, one of the cool features of the Intelligencia Query MDX generator is precisely this: the ability to turn spreadsheet style formulas into MDX calculations. And yes, Andrew is in the process of getting this functionality patented.
    • I love the idea of Gemini being AS, but I can imagine that some more relationally orientated people would want the ability to query this new data store with SQL. Of course AS actually can be queried with SQL but it's a very limited subset; it would be great to see tighter integration between AS and the relational engine (along the lines of Oracle's new cube-based materialised views) so the performance gains that AS gives you can be made available to the relational engine.
    • Which thought in turn leads onto whether Madison style MPP can be applied to the Analysis Services engine itself (as I wondered here), either directly or if AS was more tightly integrated with the relational engine. So many permutations of these technologies are possible...
    • As with PerformancePoint and Excel Services, there seems to be yet another dependency on Sharepoint here for the management of Gemini models. Of course some central repository is necessary and it makes sense to use Sharepoint rather than reinvent the wheel, but as Microsoft Watch points out this cross-dependency helps MS sell more licenses. And as anyone who has tried to sell a MS BI solution will tell you, selling more server products can be a problem - it's not necessarily the licence cost but the perfectly valid "we don't use Sharepoint here, we use X and we don't want to have to support Sharepoint just for this" response that has to be overcome. I think this issue part-explains why I've seen so little use of Excel Services with Analysis Services in my work when it seems such a compelling proposition for almost all companies.
    • Lastly, given the current financial crisis, something tells me that when the first CTPs of all this appear next year consultants like me will have plenty of free time to test it out. I know pundits out there are saying that the BI industry will weather any recession because companies will want to compete on information, but I'm sceptical - in my experience most companies don't make rational decisions in circumstances like these (is that heresy coming from a BI consultant?), they just cut budgets and fire staff without thinking much. And IT consultants, perceived as a cost and of lesser importance to the health of the business than things like, say, the CEO's bonus, always feels the pain first. Hohum.
    10/6/2008

    Kilimanjaro, Project Gemini, Project Madison - even more new cool stuff

    Ah, October 6th at last - the date when I was promised All Would Be Revealed. I'd been hearing rumours of something very new and exciting in the world of Microsoft BI for a while but never had any details (they probably reasoned that telling an inveterate blogger like me something top secret would be asking for trouble, but honestly I can keep my mouth shut when I need to); Mosha and Marco both mentioned it recently but didn't give anything away either.

    Anyway, to coincide with the keynote at the BI Conference, more details have shown up on the web:
    http://www.intelligententerprise.com/channels/business_intelligence/showArticle.jhtml?articleID=210700171
    http://www.earthtimes.org/articles/show/microsoft-empowers-enterprises-to-think-bigger-about-business-intelligence,568317.shtml
    http://blogs.forrester.com/information_management/2008/10/bi-crystal-ball.html

    Here's what I gather:

    • Kilimanjaro is the code name for the next release of SQL Server, due 2010
    • Project Madison is the code name for what's being done with DATAllegro
    • Project Gemini is the new, exciting thing: an in-memory storage mode for Analysis Services. To quote Tom Casey in the Intelligent Enterprise article:
      "It's essentially another storage mode for Microsoft SQL Server Analysis Services with access via MDX, so existing applications will be able to take advantage of the performance enhancements."
      But it's clearly more than that - from the Forrester blog entry above:
      "Its Gemini tool (to be available for beta testing sometime in 2009 and general availability in 2010) will not only enable power users to build their own models and BI applications, but easily make them available to power users, almost completely taking IT out of the loop. In Gemini, the in-memory, on the fly modeling will be done via a familiar Excel interface. Once a new model and an application is built in Excel, a power user can then publish the application to Sharepoint, making it instantly available to casual users. Not only that, but the act of publishing the model to Sharepoint also creates a SQLServer Analysis Services cube, which can be instantaneously accessed by any other BI, even non Microsoft, tool"

    So, self-service cube design and in-memory capabilities. Sounds very, very reminiscent of Qlikview and other similar tools; and given that Qlikview is by all accounts growing rapidly, it's an obvious market for MS to get into. I guess what will happen is that end users will get a kind of turbo-charged version of the cube wizard where they choose some tables containing the data they want to work with, and it builds a cube that works in ROLAP-ish mode on top of this new in-memory data store. We'll also get even better query performance too (from COP? pointer-based? data structures).

    All in all, super-exciting and despite all the hype about end-user empowerment I'm sure there'll be even more opportunity for the likes of me to earn consultancy fees off this doing MDX work, tuning etc. But the point about end-user empowerment brings me back to Qlikview: I've never seen it, but it's interesting because I've heard some very positive reports about it and some very negative ones too. From what I can make out it is very fast and easy-to-use, and has some great visualisation capabilities, but I've also heard it's very limited in terms of the calculations you can do (at least compared to MDX); I've also heard that it's marketed on the basis that you don't need a data warehouse to use it - which perhaps explains some of its popularity, but also explains more of the negative comments that it's had, because of course if you don't build a data warehouse you're going to run into all kinds of data quality and data integration issues. Perhaps this last point explains why Qlikview does so appallingly in the BI Survey's rankings of how well products perform in a competitive evaluation. So something to be wary of if you're giving tools to end users...

    Anyway, if you're at the BI Conference and have any more details or thoughts on this, please leave a comment!

    10/1/2008

    Interesting stuff coming from Microsoft soon

    A couple of interesting (and possibly BI-related) technologies are coming soon from Microsoft:

    • I see today via Nick Carr that soon we'll be able to run Windows and SQL Server on Amazon Elastic Compute Cloud (EC2). I wonder if that includes Analysis Services too? If so, that would be handy.
    • Also announced today, the new MS cloud operating system, coming within the month
    • Windows High Performance Computing (HPC) server is due to launch at the beginning of November (see here for a news report on it, here and here for some details). Hmm, I see SQL Server is listed on the 'supported applications' page... surely there's got to be some kind of tie-in here with the whole MatrixDB/DATAllegro MPP stuff?

    Using AS Data Mining to Add Forecast Values to a Cube

    It's ridiculous, really, that OLAP and data mining functionality have co-existed inside Analysis Services for years now yet they never seem to be used together. I only ever work with the OLAP side of things and just play around with data mining - I've never done a serious project with it, unfortunately - and from what I can see the people out there working with Analysis Services data mining don't tend to use it with cubes; it's two separate worlds. Anyway, to get to the point, while I was preparing for SQLBits a few weeks ago I came up with a simple idea of how you could use AS data mining to generate forecast data which could then be loaded into a cube, an approach that I think could be easily adapted for use in most real production systems. Now, having spent an evening in a Munich hotel room testing this out, I present my findings.

    The data set I'm using here was compiled by Tony Rogerson and put into a dimensional model by Allan Mitchell, and it contains the details of postings to SQL Server newsgroups over the past 8 years. For the purposes of this example I aggregated the data so it contained a single fact table with one measure, the number of postings to all newsgroups per day, which joined to just one Time dimension. It's interesting data in fact. Looking at the high level at the number of postings per year from 2002 to 2007, you can see that newsgroup postings hit a peak in 2004 and then went into decline:

    ngpostingsperyear

    I guess this is probably a result of the introduction of the MSDN Forums and their growth at the expense of newsgroups. Looking at the day level, you can see an obvious pattern where the number of postings is much higher during the week than at weekends:

    ngpostingsperday

    The first thing I did was build a simple cube from this data, with a single sum measure representing the number of postings made and a simple Time dimension with Years, Quarters, Months and Dates. This allowed me to explore the data and create the above graphs in Excel. However I only had data up to April 29th 2008, and what I wanted to see was a forecast in my cube of postings for the rest of 2008 and 2009. Here's what I did to get that:

    1. I created a new dimension in my cube called Scenario. It was built from a table containing the following two rows:
      1 Actual
      2 Forecast

      This dimension then had a single attribute hierarchy with two members on it, Actual and Forecast.
    2. I then added a new foreign key column to my existing fact table so I could join the new Scenario dimension to it. This column always contained the value 1 because all the data in my fact table only represented Actual values.
    3. Next I created a new data mining model using the Microsoft Time Series algorithm based off my fact table. The Time dimension foreign key column was used as my Key Time column and the measure representing the number of postings was set to be an Input as well as Predictable.
      miningmodel
      Now I'm certainly not a data mining expert and I didn't spend any time tweaking the model, but it seemed to give reasonably good results at the day level picking up the weekday/weekend variation already noted above:
      predictionchart
      Obviously if you were doing this for real you'd want to spend a bit more time making sure your mining model was giving you good results.
    4. With my mining model processed, I was then able to use the following DMX query to give me a flattened resultset showing the predicted number of postings per day from April 30th 2008 until the end of 2009:
      SELECT FLATTENED
        PredictTimeSeries([Forecasts].[Postings], 611)
      From
        [Forecasts]
      One thing I found in order to get this to work nicely was that I had to change the 'meaningful' surrogate keys on the Time dimension that Allan had provided me with (eg the key for April 30th 2008 was 20080430, May 1st 2008 was 20080501) to meaningless integer surrogate keys where each new day had a key value one greater than the previous day (eg April 30th 2008 was key 3043, May 1st was key 3044 etc). This meant that when I ran the above DMX query it returned surrogate key values for each predicted value that I could use with my time dimension - before I did this I found the DMX query just added 1 to the date key for each predicted date, giving me incorrect key values like 20080430, 20080431, 20080432 and so on for April 30th, May 1st, May 2nd and so on.
    5. I then took the results of this query and used SSIS to load them into a second fact table in my relational data source. The SSIS data flow task looked like this:
      predictdataflow
      Here's what I did at each step:
      1. Used an OLEDB data source, connected up to Analysis Services, to run the initial DMX prediction query
      2. Added a derived column with the value 2, as the foreign key for my Scenario dimension, to show that this was all Forecast data.
      3. Used a data conversion transform to convert the two columns returned by the DMX query from bigints to ints.
      4. Used an OLEDB destination to load the data into a new Forecast fact table (with a structure identical to the existing fact table) in SQL Server.
    6. Finally I went back to my cube and added a new partition to my existing measure group, pointing at the new Forecasts fact table, and reprocessed. This meant I now had data for the Actual member on my Scenario dimension up to April 29th 2008, and data for the Forecast member from April 30th 2008 until the end of 2009. And voila - forecast data loaded into my cube at Day level, which is then of course easy to explore, see aggregated up to Month, Quarter and Year level, and so on. Here's what the data looked like in an Excel pivot chart:
      actualforecastpivotchart
      And of course if I didn't include the Scenario dimension in my query I could look at Actual and Forecast values aggregated up to say the Month level:
      monthview

    Some thoughts on how you'd implement this in a real production system:

    • You're not going to want to (or probably be able to) use the Time Series algorithm at the granularity of your fact table. It might make sense to use it at Day granularity, but you'd probably want to aggregate your other dimensions up to a much higher level (for example aggregate from Product to Product Category, Customer to Country and so on) and then generate the forecasts. This would mean in turn you couldn't just create a new partition in your existing measure group to hold the forecast data, you'd need to create a new measure group. However with some simple MDX you'd be able to integrate the two sets of values and get a similar result to the one I've achieved here.
    • You'd want to automate the training of the data mining model, and this can be done pretty easily in SSIS using the Analysis Services processing task.
    • You might want to store the set of forecasts you generate each day, so you could compare different forecasts made on different days or compare an old forecast with the actual data that subsequently came in. To do this you'd need to add a new dimension to your Forecast measure group which would be Forecast Date (ie the date the forecast data was generated), and maybe make your forecast measures semi-additive (eg last nonempty) with regard to this new Forecast Date dimension.
    • As I said, you'd want to spend a lot more time than I did making sure your data mining model gave you good predictions; the aim of this blog entry is to describe how you'd load the forecast data into AS, not make sure the forecasts were accurate! Trying to forecast 1.5 years into the future as I did is probably a bit optimistic; it might be better to limit your predictions to a few days or a month. I used AS2005 here but AS2008 has got a much improved Time Series algorithm you can find out about here. And if you want to learn more about AS data mining your starting point should be www.sqlserverdatamining.com; I see also that there's a 2008 version of the DM team's book out soon too - I liked the 2005 version and it was certainly a lot more helpful than BOL for understanding all this.

    So overall a fun exercise and one I might work up into a presentation for a user group meeting or something similar. I'd be very interested to hear from anyone who is doing this for real though, to find out how well it works in practice; if I get the chance to implement it with one of my customers I'll blog about it.

    And wouldn't it be cool if, in the next version of Analysis Services, you could automatically add a Forecast partition to your cube in BIDS and have all this work done for you through a wizard?