| Chris's profileChris Webb's BI BlogBlogLists | Help |
|
10/28/2008 Reporting Services-generated MDX, Subselects and Formula CachingAnalysis Services can, of course, cache the results of MDX calculations and share that cache with other users most of the time... but when it can't the result is much unnecessary load on your server as calculations are recomputed again and again, and much longer query times. Unfortunately the details on what caching AS can do when are not properly documented anywhere and I've only learned the bits and pieces that I know about this subject from my own experience and from talking to the likes of Mosha; I've avoided blogging about it because my knowledge is incomplete and this is a complex topic. Happily the upcoming AS2008 Performance Guide should rectify this lack of proper documentation, and I believe Mosha is also going to address this subject in his pre-conf seminar at PASS. However, there is one specific scenario that I thought I'd write about because I come across it regularly and it affects anyone using a client that generates MDX with subselects in, such as the Reporting Services query builder. Let's say you've got an expensive calculated member on your cube - for example, add the following (deliberately slow) calculated measure to the Adventure Works cube: CREATE MEMBER CURRENTCUBE.MEASURES.EXPENSIVECALC AS In Reporting Services you might then use the query builder to create a query that looked like this to select just Calendar Year 2004 on rows and this new calculated measure on columns: SELECT On a cold cache, on my laptop, this query takes 33 seconds; on a warm cache it still takes 33 seconds. This is not good - practically all of the query time is taken up evaluating the calculation, and this calculation is clearly not being cached. The reason why is because the query contains a subselect, and when a query contains a subselect it means that the results of any calculations can only be cached for the lifetime of the query. You can easily rewrite the query as follows to avoid the subselect: SELECT On a cold cache this query still takes 33 seconds, but on a warm cache it comes back in 1 second because the calculation can now be cached and this cache can be reused for subsequent queries. Clearly this is much better. I wonder how many SSRS implementations are suffering from needlessly poor performance just because of this? Subselects are a useful addition to MDX and make calculating 'visual totals' type values very easy, but they should be used with care and not just as a means of selecting individual members. Azure and Analysis Services in the cloudHmm, I'm a bit late on this but it seems at PDC the other day buried in Bob Muglia's keynote was the announcement that we'll be getting a cloud-based version of Analysis Services as part of SQL Data Services. Here's Jamie Thomson's post on this: Here's the official Azure site and white paper: UPDATE: see Mosha's comment below - there was no announcement of cloud-based AS. Oh well... but I've just read about the Office-in-the-cloud announcement. I wonder if we'll get any AS connectivity in the cloud-based Excel? 10/27/2008 Pentaho 2.0 and Pentaho Aggregate DesignerVia Julian Hyde, I see Pentaho have released version 2.0 of their BI suite. As Julian points out in this blog entry here, one of the major new features in this release is the Pentaho Aggregate Designer, which makes it much easier to design aggregate tables for Mondrian. A quick look at the screenshot on Julian's blog suggests that Mondrian is continuing to flatter Analysis Services by imitation (you can even see that the aggregations are being built on the Foodmart database!) - not that this is a bad thing, in fact it's what makes Mondrian so easy to use if you're used to Analysis Services. So Mondrian is definitely catching up in terms of functionality... and in terms of performance, well, as Julian says "Mondrian is beholden to the RDBMS for performance" and those RDBMSes are getting ever faster. I suspect very soon more and more people are going to find that Mondrian plus an open source RDBMS (perhaps the column-store-based LucidDB, which I see already already has support for the new aggregate table designer) is going to be 'good enough' for many BI projects - after all, most AS implementations I see don't involve massive amounts of data, and if Mondrian can give good query response times on a fact table of 20-50 million rows then the decision on which platform to use will be more heavily influenced by price. And what with the current financial crisis, price is going to be an ever-more important factor for many customers. I wonder how long it will be before I see MS BI consultancies starting to offer open source BI, given the relative ease with which you can transfer your cube design and MDX skills between the two platforms? 10/22/2008 Book review: Applied Microsoft SQL Server 2008 Reporting Services, by Teo LachevHere's the deal: if you liked Teo Lachev's book "Applied Microsoft Analysis Services 2005", which I did, then you'll like "Applied Microsoft SQL Server 2008 Reporting Services". Both books share the same format and approach and are exhaustive guides to their subjects. In case you've not seen one of Teo's books before, though, what should you expect? Well, "Applied Reporting Services 2008" is 750 pages long and covers just about every aspect of Reporting Services 2008 that's worth covering, from installation to report design to management to security to extensibility in great detail. This probably isn't a book you're going to sit down and read from cover to cover, but it's a great reference guide and Teo's prose is very clear so reading individual chapters as and when you need to is no chore. The best thing about this book, though, is the fact that on every page you can see Teo's vast real-world experience showing through. For some reason many books on Reporting Services fall into the trap of being Books Online rewritten, never telling you the stuff you really need to know; this book on the other hand is a goldmine of information on how things really work. Let me take the chapter I can really speak from authority on - the chapter on using Reporting Services with Analysis Services. Every other SSRS book I've seen has been written by someone with clearly no practical experience of using SSAS and SSRS together, and parrots the usual line about 'great integration' and 'easy-to-use MDX query designer' etc. Teo on the other hand has, for the first time anywhere, put together all the tips and tricks I've ever seen (plus a few I haven't) on this topic in one place - he lists the pros and cons of the built-in Analysis Services datasource and the OLEDB datasource and how to work around them, how to handle parent/child hierarchies, using extended properties, the lot. All in all, then, a very highly recommended book; it's ideal for both beginners and experienced developers and probably the only Reporting Services 2008 book you'll ever need. You can read some sample chapters and see video demos on the book website here: OLAP PivotTable Extensions new releaseFor some reason I've not blogged about this before, but anyway the ever-industrious Greg Galloway has just released a new version of his OLAP PivotTable Extensions: It's an Excel addin that gives you useful new functionality in Excel 2007 pivot tables connected to Analysis Services, such as the ability to add private calculations, view the MDX behind the pivot table, and (in the new release) search for members and other things. Definitely worth a look, and useful too if you've ever wondered how to work with the Excel pivot table in code. 10/20/2008 Calculated members are better than assignments to real members (at least sometimes)Earlier this year I blogged about how, in my experience, using real members and overwriting them with MDX Script assignments sometimes seemed to perform worse than using calculated members. See here for the full posting: Recently I came across this problem again but was able to get together a proper repro and open a support case with Microsoft. What was happening was that I had a time utility dimension with three members - let's call them member A, member B and member C. A was the default member and had no calculation, it just showed the real values from the rest of the cube; B was a straightforward YTD calculation; C was a really nasty, untuned, same period previous year calculation with lots of special logic. I had a query that included A and B but not C and which was performing really badly, and I found that when I commented out the calculation for C the query returned instantly. So I guessed that for some reason the calculation for C was being evaluated when the query ran, even though C wasn't actually requested in the query and this was confirmed by Microsoft. Unfortunately I was also told this behaviour was 'by design', something to do with prefetching and sonar. Thankfully it doesn't happen for all queries or all cubes (I tried and failed to repro it on Adventure Works) but in my opinion there should never be a situation where the formula engine evaluates calculations that aren't needed - precisely because in some cases it will mean the query runs much longer than it ever should. Probably the worst thing about all this is that you have no idea when exactly it's happening with the tools we've got available at the moment. I suspect that many people out there have poorly-performing cubes because of this issue... One of the workarounds was to use calculated members instead of real members, as I recommended in my original post, but as I also said in that post there are some cases where calculated members are not a good alternative to real members. In my particular case I was able to tune the calculation associated with member C so either the problem didn't happen or C evaluated so quickly it didn't matter - I'm not sure which. Anyway, I stand by what I said in my original posting: use calculated members rather than real members overwritten by Script assignments wherever you can! 10/15/2008 Metadata, semantic web technologies and, yes, Gemini againI saw a very interesting article the other day in Intelligent Enterprise by Seth Grimes, about a newly-published report on the semantic web by David Provost. Grimes is rightly sceptical about how close we are to these ideas reaching fruition and notes that many of the companies mentioned in the report are concentrating on 'semantic data integration'. Frankly, to me the idea of being able to integrate data sourced from different parts of the web is still far-fetched, given the trials you have to go through to integrate data from different parts of the same company. But it did get me thinking: when users get Gemini, where will they get their data from? Yes, they'll be downloading data on 'industry trends' etc from the web in the way we saw in the Gemini demo, but it won't be that often. In well-run companies most of the time the data will come from four sources:
How, then, can Gemini know (beyond its cleverness with column names and analysis of the data within those columns) what data can be integrated with what? Ideally it would have access to some form of metadata embedded in the source data that would help it make the correct decision all in all cases. Where this metadata comes from is a problem much larger than Gemini of course, and goes down to the fundamental question of how an enterprise can keep track of all of its data assets wherever they're stored and understand what the data in each data store actually means; Microsoft is regularly criticised for its lack of a metadata tool and I guess MS is working on something in this area. If, in the first and second scenarios above, Gemini could connect to SQL Server or Analysis Services and see a common layer of metadata that would help it out, allowing it to join data from SQL Server with data from Analysis Services for instance. In a way, Analysis Services is already a metadata layer on top of the data warehouse, containing information on how tables need to be joined and how measures should be aggregated; perhaps this side of it will become more important as the MOLAP engine is superceded? I also think some of the technologies of the semantic web, when applied to the enterprise, could be very useful here; I've only really just come across this stuff myself, but as an introduction I found the one-page explanation of RDF on Twine was very good, and the Microformats site was also full of interesting information. One of the companies mentioned in the semantic web report is Cambridge Semantics, whose product Anzo for Excel is aimed at imposing structure on all of those Excel spreadsheets I referred to in the third bullet point above. The demo on the web concentrates on using the tool for sharing data and collaboration, but as far as I can see the key to getting that to work is selecting data in the spreadsheet and linking it back to a common metadata layer. Of course the obvious criticism of a product like this is that you're again relying on your users to make the effort to mark up their data and do so properly, but if there's an incentive in the form of easier collaboration and - to put it bluntly - less of that boring cutting and pasting then maybe there's a chance they could be persuaded to do so. I could imagine MS offering something very similar to this as part of Excel, with all the central management being done through Sharepoint, and extended throughout the Office suite to Access, Word etc. Gemini would then be able to do a better job of understanding what and how data in an Excel spreadsheet, say, could be integrated with data in the data warehouse. So the bottom-up approach could be combined with the more traditional top-down metadata management approach, and crucially I'd want to see metadata automatically embedded in various the output formats of the server products. For example when you built a SSRS report or created an Analysis Services pivot table in Excel, in both cases I'd want the data by default to retain some record of what it was and where it had come from - the metadata would embedded in the document in the same way as if the user had marked up the document themselves. And this in turn would make it a lot more easily reusable and comprehensible by Gemini and other applications such as Enterprise Search. Metadata would accompany data as it travelled from document to document, data store to data store, format to format. This would then cover the scenario in the third bullet point, allowing you to integrate data from an SSRS report with data from the data warehouse it originally came from, or data from a user-generated Excel spreadsheet that had been marked up manually. So much for structured data; ideally we'd want to be able to include unstructured data too. Some of the applications of natural language processing mentioned in the semantic web report looked very interesting, especially OpenCalais (which already has integration with MOSS 2007 I see). If you were looking at sales figures for a particular customer in Gemini, wouldn't you also want to be able to look for documents and web pages that discussed sales for that customer too? And I've often thought that while the super-simple type-a-search-term approach for Search has worked brilliantly over the last few years, there's a niche for a power-user interface for search too, kind of like a Proclarity Desktop for search where you could drag and drop combinations of terms from the central metadata repository and see what you found; could that be Gemini too? A tool for searching, integrating, aggregating and manipulating all enterprise data, not just numeric data? Over the last few days I've seen the Gemini team (notably Amir) engaging with bloggers like me about our concerns through comments on our postings. I appreciate that, but my position hasn't changed: I think the technology is cool but there's too great a risk that it will be misused as it stands at the moment. Relying on oversight from the IT department isn't enough; if MS had a convincing metadata story extending to all data types and data sources, metadata that Gemini could use, it would go a long way to addressing my concerns. 10/14/2008 PALO supports MDX and XMLAThanks to Marco Groeneveld for sending me this link:
The open source OLAP server PALO now supports XMLA and MDX. Another victory for MDX! 10/12/2008 Last thoughts on Gemini for the momentDespite the worldwide financial meltdown, all talk on the MS BI blogosphere has been about Project Gemini this week. Even though no-one knows much about it - and certainly no-one knows everything about it either, as we've been told there are more announcements to come - the reaction has been pretty clear: the back-end technology looks cool, but the strategy of putting Gemini in Excel on the desktop is an invitation to bypass the data warehouse and create spreadmart hell whatever MS says about Sharepoint-based management of Gemini models. I've already linked to just about everyone who's expressed this opinion apart from Mick Horne, whose three posts here, here and here are well worth reading as they put the case against Gemini in its currently proposed form very clearly. Choice quote: "Where is the single version of the truth in this architecture? I’ve just spent 4 years of my life trying to convince users to stop using Excel as a data store and here are Microsoft positively encouraging it. Hell will freeze over before this capability is used responsibly in most organisations". So what would I like to see MS actually do with Gemini? From the comments on Marco's blog, at the moment it sounds like the priority is to get the Gemini storage mode working for local cubes and Excel rather than put everything in place for the full server version of Analysis Services. I would prefer the emphasis to be reversed and have Gemini storage mode ready as soon as possible on the server side. It's clearly going to provide a massive performance boost when it does arrive (and let's not forget it's at least two years away), and with other COP databases and data warehouse appliances improving every day there's going to be significantly more competitive pressure on Analysis Services and SQL Server-based data warehouse projects in the future. I'd hate to see current MS BI shops start abandoning their Analysis Services implementations because they can get better query performance and scalability elsewhere. But whatever us bloggers say I'm sure we'll get Gemini in the form we've already been shown. The reason is that whatever the rights and wrongs of it from a BI consultant's point of view, the people who use Excel will definitely want this and so there will be an overwhelming commercial case for it. This kind of desktop, DIY BI is in a way similar to illegal drugs: there are always some people that want it, a certain number of them are always going to do it even though they know they shouldn't, so you've got two choices - either legalise it and then hope to control it, as with Gemini, or throw all your efforts into outlawing it. With the first option you run the risk of encouraging what you wanted to discourage and ending up with a worse problem, with the second option you run the risk of people resenting your rules so much they end up being widely ignored. What's the best option? UPDATE: one last link for you - Nigel Pendse of the OLAP Report gives an unreservedly positive review here: 10/7/2008 More thoughts on Project GeminiI've now had a chance to watch the demos and read all the first-hand accounts of what was announced yesterday (see Marco, Mosha, Tim Kent, Jeremy Kashel, Richard Tkachuk). Here are some unstructured thoughts and questions:
10/6/2008 Kilimanjaro, Project Gemini, Project Madison - even more new cool stuffAh, October 6th at last - the date when I was promised All Would Be Revealed. I'd been hearing rumours of something very new and exciting in the world of Microsoft BI for a while but never had any details (they probably reasoned that telling an inveterate blogger like me something top secret would be asking for trouble, but honestly I can keep my mouth shut when I need to); Mosha and Marco both mentioned it recently but didn't give anything away either. Anyway, to coincide with the keynote at the BI Conference, more details have shown up on the web: Here's what I gather:
So, self-service cube design and in-memory capabilities. Sounds very, very reminiscent of Qlikview and other similar tools; and given that Qlikview is by all accounts growing rapidly, it's an obvious market for MS to get into. I guess what will happen is that end users will get a kind of turbo-charged version of the cube wizard where they choose some tables containing the data they want to work with, and it builds a cube that works in ROLAP-ish mode on top of this new in-memory data store. We'll also get even better query performance too (from COP? pointer-based? data structures). All in all, super-exciting and despite all the hype about end-user empowerment I'm sure there'll be even more opportunity for the likes of me to earn consultancy fees off this doing MDX work, tuning etc. But the point about end-user empowerment brings me back to Qlikview: I've never seen it, but it's interesting because I've heard some very positive reports about it and some very negative ones too. From what I can make out it is very fast and easy-to-use, and has some great visualisation capabilities, but I've also heard it's very limited in terms of the calculations you can do (at least compared to MDX); I've also heard that it's marketed on the basis that you don't need a data warehouse to use it - which perhaps explains some of its popularity, but also explains more of the negative comments that it's had, because of course if you don't build a data warehouse you're going to run into all kinds of data quality and data integration issues. Perhaps this last point explains why Qlikview does so appallingly in the BI Survey's rankings of how well products perform in a competitive evaluation. So something to be wary of if you're giving tools to end users... Anyway, if you're at the BI Conference and have any more details or thoughts on this, please leave a comment! 10/1/2008 Interesting stuff coming from Microsoft soonA couple of interesting (and possibly BI-related) technologies are coming soon from Microsoft:
Using AS Data Mining to Add Forecast Values to a CubeIt's ridiculous, really, that OLAP and data mining functionality have co-existed inside Analysis Services for years now yet they never seem to be used together. I only ever work with the OLAP side of things and just play around with data mining - I've never done a serious project with it, unfortunately - and from what I can see the people out there working with Analysis Services data mining don't tend to use it with cubes; it's two separate worlds. Anyway, to get to the point, while I was preparing for SQLBits a few weeks ago I came up with a simple idea of how you could use AS data mining to generate forecast data which could then be loaded into a cube, an approach that I think could be easily adapted for use in most real production systems. Now, having spent an evening in a Munich hotel room testing this out, I present my findings. The data set I'm using here was compiled by Tony Rogerson and put into a dimensional model by Allan Mitchell, and it contains the details of postings to SQL Server newsgroups over the past 8 years. For the purposes of this example I aggregated the data so it contained a single fact table with one measure, the number of postings to all newsgroups per day, which joined to just one Time dimension. It's interesting data in fact. Looking at the high level at the number of postings per year from 2002 to 2007, you can see that newsgroup postings hit a peak in 2004 and then went into decline: I guess this is probably a result of the introduction of the MSDN Forums and their growth at the expense of newsgroups. Looking at the day level, you can see an obvious pattern where the number of postings is much higher during the week than at weekends: The first thing I did was build a simple cube from this data, with a single sum measure representing the number of postings made and a simple Time dimension with Years, Quarters, Months and Dates. This allowed me to explore the data and create the above graphs in Excel. However I only had data up to April 29th 2008, and what I wanted to see was a forecast in my cube of postings for the rest of 2008 and 2009. Here's what I did to get that:
And of course if I didn't include the Scenario dimension in my query I could look at Actual and Forecast values aggregated up to say the Month level: Some thoughts on how you'd implement this in a real production system:
So overall a fun exercise and one I might work up into a presentation for a user group meeting or something similar. I'd be very interested to hear from anyone who is doing this for real though, to find out how well it works in practice; if I get the chance to implement it with one of my customers I'll blog about it. And wouldn't it be cool if, in the next version of Analysis Services, you could automatically add a Forecast partition to your cube in BIDS and have all this work done for you through a wizard? |
|
|