New Semi Join, Anti Join And Query Folding Functionality In Power Query

There are a couple of nice new features to do with table joins (or merges as they are known in M) and query folding in Power Query in the April release of Power BI Desktop that I want to highlight.

Anti Joins now fold

First of all, a few months ago I wrote a post about how the built-in anti join functionality didn’t fold in Power Query. The good news is that it now does on SQL Server-related sources, so no more workarounds are needed. For example, if you have two tables in a SQL Server database called Fruit1 and Fruit2 and two Power Query queries that get data from those tables:

…then the following M code:

let
  Source = Table.Join(
    Fruit1,
    {"Fruit1"},
    Fruit2,
    {"Fruit2"},
    JoinKind.LeftAnti
  )
in
  Source

…returns the following table of fruits that are in the Fruit1 table and not in the Fruit2 table:

Of course that’s what the code above returned in previous versions of Power Query too. The difference now is that query folding occurs and the following SQL code is generated:

select [$Outer].[Fruit1],
    cast(null as nvarchar(50)) as [Fruit2]
from 
(
    select [_].[Fruit] as [Fruit1]
    from [dbo].[Fruit1] as [_]
) as [$Outer]
where not exists 
(
    select 1
    from 
    (
        select [_].[Fruit] as [Fruit2]
        from [dbo].[Fruit2] as [_]
    ) as [$Inner]
    where [$Outer].[Fruit1] = [$Inner].[Fruit2] or [$Outer].[Fruit1] is null and [$Inner].[Fruit2] is null

New join kind: semi joins

There are also two brand new join kind you can use in the Table.Join and Table.NestedJoin functions: JoinKind.LeftSemi and JoinKind.RightSemi. Semi joins allow you to select the rows in one table that have matching values in another table. Using the Fruit1 and Fruit2 tables above, the following M code:

let
  Source = Table.Join(
    Fruit1, 
    {"Fruit1"}, 
    Fruit2, 
    {"Fruit2"}, 
    JoinKind.LeftSemi
  )
in
  Source

Returns all the rows in Fruit1 where there is a matching value in Fruit2:

Here’s the SQL that is generated:

select [$Outer].[Fruit1],
    cast(null as nvarchar(50)) as [Fruit2]
from 
(
    select [_].[Fruit] as [Fruit1]
    from [dbo].[Fruit1] as [_]
) as [$Outer]
where exists 
(
    select 1
    from 
    (
        select [_].[Fruit] as [Fruit2]
        from [dbo].[Fruit2] as [_]
    ) as [$Inner]
    where [$Outer].[Fruit1] = [$Inner].[Fruit2] or [$Outer].[Fruit1] is null and [$Inner].[Fruit2] is null

The ?? operator now folds

The M language’s ?? coalesce operator is used for replacing null values and this now folds on SQL Server-related sources too now. For example, the M query in the previous section that did a semi join on Fruit1 and Fruit2 returns a table where all the rows in the Fruit2 colum contain null values. The following M query adds a new custom column that returns the text value “Nothing” when the Fruit2 column contains a null:

let
  Source = Table.Join(
    Fruit1, 
    {"Fruit1"}, 
    Fruit2, 
    {"Fruit2"}, 
    JoinKind.LeftSemi
  ), 
  ReplaceNulls = Table.AddColumn(
    Source, 
    "NullReplacement", 
    each [Fruit2] ?? "Nothing"
  )
in
  ReplaceNulls

Here’s the SQL generated for this, where the ?? operator is folded to a CASE statement:

select [_].[Fruit1] as [Fruit1],
    [_].[Fruit2] as [Fruit2],
    case
        when [_].[Fruit2] is null
        then 'Nothing'
        else [_].[Fruit2]
    end as [NullReplacement]
from 
(
    select [$Outer].[Fruit1],
        cast(null as nvarchar(50)) as [Fruit2]
    from 
    (
        select [_].[Fruit] as [Fruit1]
        from [dbo].[Fruit1] as [_]
    ) as [$Outer]
    where exists 
    (
        select 1
        from 
        (
            select [_].[Fruit] as [Fruit2]
            from [dbo].[Fruit2] as [_]
        ) as [$Inner]
        where [$Outer].[Fruit1] = [$Inner].[Fruit2] or [$Outer].[Fruit1] is null and [$Inner].[Fruit2] is null
    )
) as [_]

[Thanks to Curt Hagenlocher for the information in this post]

Displaying Azure Maps In A Power BI Paginated Report

The built-in mapping functionality in Power BI paginated reports is fairly basic. However the integration of Power Query into Power BI paginated reports gives you an interesting new way of creating maps in paginated reports: you can call the Azure Maps API using Power Query and display the image returned in an Image report item. In this blog post I’ll show you how.

Here’s a quick summary of what I’m going to do:

  • Call the API from https://data.police.uk/ (specifically the Crimes At Location endpoint) using Power Query to get all the recorded crimes within a one mile radius of a given latitude and longitude in a given month for any location in England, Wales or Northern Ireland
  • Take this list of crimes and pass them to the Azure Maps API Get Map Static Image endpoint to return an image of a map with the crime locations on it
  • Display this image in an Image report part in a paginated report

And here’s an example of what the final paginated report will look like:

Step 1: Sign up for the Azure Maps API

In order to call the Azure Maps API you’ll need to go to the Azure Portal and create a resource. The pricing is very reasonable: the first 1000 calls to the endpoint used here are free and after that it’s $4.50 per month for up to 500,000 calls, which should be more than enough for BI purposes.

Step 2: Create Shareable Cloud Connections

To connect to data sources in Power Query in paginated reports you need to create Shareable Cloud Connections in the Power BI portal. You’ll need two connections for this report: one for the Azure Maps API with the URL https://atlas.microsoft.com/map/static/png and one for the Crime API with the URL https://data.police.uk/api/crimes-street/all-crime. Both SCCs should have the authentication method Anonymous and the privacy level Public and have the Skip Test Connection option checked:

Step 3: Create a paginated report and Power Query query to call APIs

After creating a new paginated report in Power BI Report Builder you need to create a dataset (called AzureMap here) to get data from the APIs. This dataset uses Power Query as a source and has one main query (also called AzureMap) and four parameters:

  • lon and lat, to hold the latitude and longitude of the location to get crime data for, which will also be the centre point of the map
  • zoom, which is the zoom level of the map
  • yearmonth, which is the year and month in YYYY-MM format to get crime data for:

Here’s the M code for the query:

let
  CallCrimeAPI = Json.Document(
    Web.Contents(
      "https://data.police.uk/api/crimes-street/all-crime",
      [
        Query = [
          lat  = Text.From(lat),
          lng  = Text.From(lon),
          date = yearmonth
        ]
      ]
    )
  ),
  ToTable = Table.FromList(
    CallCrimeAPI,
    Splitter.SplitByNothing(),
    null,
    null,
    ExtraValues.Error
  ),
  First50 = Table.FirstN(ToTable, 50),
  ExpandColumn1 = Table.ExpandRecordColumn(
    First50,
    "Column1",
    {"location"},
    {"location"}
  ),
  Expandlocation = Table.ExpandRecordColumn(
    ExpandColumn1,
    "location",
    {"latitude", "street", "longitude"},
    {
      "location.latitude",
      "location.street",
      "location.longitude"
    }
  ),
  JustLatLon = Table.SelectColumns(
    Expandlocation,
    {"location.longitude", "location.latitude"}
  ),
  TypeToText = Table.TransformColumnTypes(
    JustLatLon,
    {
      {"location.longitude", type text},
      {"location.latitude", type text}
    }
  ),
  MergedColumns = Table.CombineColumns(
    TypeToText,
    {"location.longitude", "location.latitude"},
    Combiner.CombineTextByDelimiter(
      " ",
      QuoteStyle.None
    ),
    "LongLat"
  ),
  PrefixPipe = Table.TransformColumns(
    MergedColumns,
    {{"LongLat", each "|" & _, type text}}
  ),
  GetString = "|"
    & Text.Combine(PrefixPipe[LongLat]),
  QueryRecord = [
    #"subscription-key"
      = "InsertYourSubscriptionKeyHere",
    #"api-version" = "2022-08-01",
    layer = "basic",
    style = "main",
    #"zoom" = Text.From(zoom),
    center = Text.From(lon) & ", " & Text.From(lat),
    width = "768",
    height = "768"
  ],
  AddPins = try
    Record.AddField(
      QueryRecord,
      "pins",
      "default|sc0.5" & GetString
    )
  otherwise
    QueryRecord,
  CallAzureMapsAPI = Web.Contents(
    "https://atlas.microsoft.com/map/static/png",
    [Query = AddPins]
  ),
  ToText = Binary.ToText(
    CallAzureMapsAPI,
    BinaryEncoding.Base64
  ),
  OutputTable = #table(
    type table [image = text],
    {{ToText}}
  )
in
  OutputTable

You need to put all this code in a single M query to avoid the Formula.Firewall: Query ‘Query1’ (step ‘xyz’) references other queries or steps, so it may not directly access a data source. Please rebuild this data combination error. You can find out more about this error by watching my data privacy video here.

A few things to note:

  • The CallCrimeAPI step calls the Get Crimes At Location API endpoint to get all the reported crimes within a one mile radius of the given latitude and longitude in the given year and month.
  • Because of the way I’m sending the crime location data to the Azure Maps API I limited the number of locations to 50, in the First50 step, to avoid hitting errors relating to the maximum length of a URL.
  • The GetString step returns a pipe delimited list of longitudes and latitudes of crime locations for the Azure Maps API to display as pins on the map. However, some error handling is needed in case there were no reported crimes in the given location or month and that happens in the AddPins step.
  • The QueryRecord step contains all the parameters to send to the Azure Maps Get Map Static Image endpoint. This docs page has more information on what’s possible with this API – I’m barely scratching the surface of what’s possible in this example.
  • Authentication to the Azure Maps API is via a subscription key which you’ll need to pass to the subscription-key parameter. You can get the key from the resource created in step 1 in the Azure Portal.
  • The API returns an image binary which is converted to text and returned in a table with one column and one row in the ToText and OutputTable steps. The code is similar to what I showed in this blog post but luckily I didn’t seem to need to break it up into multiple rows.

Step 4: Create Power Query query to return values for Zoom parameter

The Zoom parameter of the Get Map Static Image API endpoint accepts a value between 0 and 20, which represents the zoom level of the displayed map. You need to create a separate dataset and M query to return a table containing those values with the following code:

let
  Source = {0 .. 20}, 
  #"Converted to table" = Table.FromList(
    Source, 
    Splitter.SplitByNothing(), 
    null, 
    null, 
    ExtraValues.Error
  ), 
  #"Changed column type"
    = Table.TransformColumnTypes(
    #"Converted to table", 
    {{"Column1", Int64.Type}}
  ), 
  #"Renamed columns" = Table.RenameColumns(
    #"Changed column type", 
    {{"Column1", "Zoom"}}
  )
in
  #"Renamed columns"

Step 5: Create paginated report parameters

Next you need to create four parameters in the paginated report for the longitude, latitude, zoom level and year month:

To make it easy for end users to select a zoom level, you need to bind the available values for the zoom parameter to the table returned by the dataset from the previous step:

Step 6: Display the map in an Image report part

In the paginated report itself the only interesting thing is the configuration of the Image report part in the centre of the report:

You need to set the image source to “Database”, bind it to the following expression

=First(Fields!image.Value, "AzureMap")

…which gets the text value from the sole row and column in the table returned by the AzureMap dataset created in step 3, and set the MIME type to be “image/png”.

And that’s it! After publishing you can enter any latitude and longitude in England, Wales or Northern Ireland, a year and month, and a zoom level, and get all the reported crimes on a map:

You can download the .rdl file with the paginated report in here (remember to edit the AzureMaps query to insert your Azure Map API key).

Power BI Paginated Reports That Connect To Web Services And Excel

By far the most exciting announcement for me this week was the new release of Power BI Report Builder that has Power Query built in, allowing you to connect to far more data sources in paginated reports than you ever could before. There’s a very detailed blog post and video showing you how this new functionality works here:

https://powerbi.microsoft.com/en-us/blog/get-data-with-power-query-available-in-power-bi-report-builder-preview

The main justification for building this feature was to allow customer to build paginated reports on sources like Snowflake or BigQuery, something which had only been possible before if you used an ODBC connection via a gateway or built a semantic model in between – neither of which are an ideal solution. However it also opens up a lot of other possibilities too.

For example, you can now build paginated reports on web services (with some limitations). I frequently get asked about building regular Power BI reports that get data from web services on demand – something which isn’t possible, as I explained here. To test using paginated reports on a web service I registered for Transport for London’s APIs and built a simple report on top of their Journey Planner API (Transport for London are the organisation that manages public transport in London). This report allows you to enter a journey starting point and ending point anywhere in or around London, calls the API and returns a table with different routes from the start to the destination, along with timings and instructions for each route. Here’s the report showing different routes for a journey from 10 Downing Street in London to Buckingham Palace:

You can also build paginated reports that connect to Excel workbooks that are stored in OneDrive or OneLake, meaning that changes made in the Excel workbook show up in the report as soon as the workbook is saved and closed:

So. Much. Fun. I’ll probably develop a presentation for user groups explaining how I built these reports soon.

And yes, if you need to export data to Excel on a schedule, paginated reports are now an even better choice. You know your users want this.

Best Practices For Power BI On Databricks Webinar

I recently took part in a webinar with Denny Lee, Liping Huang and Marius Panga from Databricks on the subject of best practices for using Power BI on Databricks. You can view the recording on LinkedIn here:

https://www.linkedin.com/video/live/urn:li:ugcPost:7174102151407779841

…or on YouTube here:

My section at the beginning covering Power BI best practices for Import and DirectQuery doesn’t contain any new information – if you’ve been following the DirectQuery posts on this blog or read the DirectQuery guidance docs here and here then there won’t be any surprises. What I thought was really useful, though, was hearing the folks from Databricks talk about best practices on the Databricks side and this took up the majority of the webinar. Definitely worth checking out.

How Much Does Copilot Cost In Microsoft Fabric?

There’s a lot of excitement about Copilot in Power BI and in Fabric as a whole. The number one question I’m asked about Copilot by customers is “How much does it cost?” and indeed there have been two posts on the Fabric blog here and here attempting to answer this question. The point I want to make in this post, though, is that it’s the wrong question to ask.

Why? Well to start off with Copilot isn’t something you buy separately. Every time you use Copilot in Fabric it uses compute from a capacity, either a P SKU or an F SKU, just the same as if you opened a report or refreshed a semantic model. You buy the capacity and that gives you a pool of compute that you can use however you want, and using Copilot is just one of the things you can use that compute for. No-one ever asked me how much it cost in dollars to open a report in Power BI or refresh a semantic model, so why ask this question about Copilot?

Of course I understand why Copilot feels like a special case: customers know a lot of users will want to play with it and they also know how greedy AI is for resources. Which brings me back to the point of this post: the question you need to ask about Copilot is “How much of my Power BI/Fabric capacity’s compute will be used by Copilot if I turn it on?”. Answering this question in terms of percentages is useful because if you consistently go over 100% usage on a Fabric capacity then you will either need to buy more capacity or experience throttling. And if Copilot does end up using a lot of compute, and you don’t want to buy more capacity or deal with throttling, then maybe you do want to limit its usage to a subset of your users or even turn it off completely?

To a certain extent the question about percentage usage can be answered with the Premium Capacity Metrics app. For example, I opened up a gen2 dataflow on a Premium capacity that returns the following data:

I expanded the Copilot pane and typed the following prompt:

Filter the table so I only have products that were produced in the country France

And here’s what I got back – a filtered table showing the row with the Producer Country France:

So what impact did this have on my capacity? Since I know this was the only time I used Copilot the day I wrote this post it was easy to find the relevant line in the “Background operations for timerange” table on the TimePoint Detail page of the Premium Capacity Metrics app:

There are two important things to note:

  • For the 30-second window I selected the operation above used 0.02% of my capacity
  • Copilot operations are classed as “background operations” so their cost is smoothed out over 24 hours. Therefore the operation above used 0.02% of my capacity for 24 hours from a few minutes after the point I hit enter and the operation ran; in this particular case I ran my test just before 16:20 on Friday March 15th and the operation used 0.02% of my capacity from 16:20 on Friday the 15th until 16:20 on Saturday March 16th.

How can you extrapolate tests like this to understand the likely load on your capacity in the real world? With great difficulty. Almost all the Fabric/Power BI Copilot experiences available today are targeted at people developing rather than just consuming, so that naturally limits the opportunities people have to use Copilot. Different prompts will come with different costs (as the blog posts mentioned above explain), a single user will want to use Copilot more than once in a day and you’ll have more than one user wanting to use Copilot. What’s more, going forward there will be more and more opportunities to use Copilot in different scenarios and as Copilot gets better and better your users will want to use it more. The compute usage of different Fabric Copilots may change in the future too. So your mileage will vary a lot.

Is it possible to make a rough estimate? Let’s say you have 20 developers (which I think is reasonable for a P1/F64 – how many actively developed solutions are you likely to have on any given capacity?) writing 20 prompts per day. If each prompt uses 0.02% of your capacity then 20*20*0.02%=a maximum of 8% of your capacity used by Copilot for the whole day. That’s not inconsiderable and I’m sure someone will leave a comment saying I’m underestimating what usage will be.

Which brings me to my last point: should you even see Copilot as being different from anything else you can do in Fabric that consumes compute? Or as an optional extra or additional cost? After all, dataflows consume compute, you can enable or disable dataflows in your tenant in the same way you can enable or disable Copilot, but very few customers disable dataflows because they see the benefit of using them. Turning off dataflows would reduce the load on your capacity but it would also stop your users from being so productive, and why would you do that? If we at Microsoft deliver on the promise of Copilot (and believe me, we’re working hard on this) then the productivity gains it brings to your developers should offset the cost of any extra capacity you need to buy – if indeed you need to buy any extra capacity.

So, to sum up, if you enable Copilot in Fabric you will see additional load on your capacities and you may need to buy more capacity as a result – but the benefits will be worth it. Predicting that additional load is difficult but it’s no different from predicting how your overal Fabric capacity usage will grow over time, as more and more reports, semantic models, notebooks, warehouses and so on get created and used. Rather than doing lots of complex calculations based on vague assumptions to try to predict that load, my advice is that you should use the Capacity Metrics app to monitor your actual usage and buy that extra capacity when you see you’re going to need it.

Overhead Of Getting Relationship Columns In Power BI DirectQuery Mode

Many Power BI connectors for relational databases, such as the SQL Server connector, have an advanced option to control whether relationship columns are returned or not. By default this option is on. Returning these relationship columns adds a small overhead to the time taken to open a connection to a data source and so, for Power BI DirectQuery semantic models, turning this option off can improve report performance slightly.

What are relationship columns? If you connect to the DimDate table in the Adventure Works DW 2017 sample database using Power Query, you’ll see then on the right-hand side of the table. The following M code:

let
Source = Sql.Database("localhost", "AdventureWorksDW2017"),
dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data]
in
dbo_DimDate

…shows the relationship columns:

Whereas if you explicitly turn off the relationships by deselecting the “Including relationship columns” checkbox:

…you get the following M code with the CreateNavigationProperties property set to false:

let
Source = Sql.Database("localhost", "AdventureWorksDW2017", [CreateNavigationProperties=false]),
dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data]
in
dbo_DimDate

…and you don’t see those extra columns.

How much overhead does fetching relationship columns add? It depends on the type of source you’re using, how many relationships are defined and how many tables there are in your model (because the calls to get this information are not made in parallel). It’s also, as far as I know, impossible to measure the overhead from any public telemetry such as a Profiler trace or to deduce it by looking at the calls made on the database side. The overhead only happens when Power BI opens a connection to a data source and the result is cached afterwards, so it will only be encountered occasionally and not for every query that is run against your data source. I can say that the overhead can be quite significant in some cases though and can be made worse by other factors such as a lack of available connections or network/gateway issues. Since I have never seen anyone actually use these relationship columns in a DirectQuery model – they are quite handy in Power Query in general though – you should always turn them off when using DirectQuery mode.

[Thanks to Curt Hagenlocher for the information in this post]

Measuring The Total Time Spent Querying Sources In Power BI DirectQuery Mode

If you’re tuning a DirectQuery semantic model in Power BI one of the most important things you need to measure is the total amount of time spent querying your data source(s). Now that the queries Power BI generates to get data from your source can be run in parallel it means you can’t just sum up the durations of the individual queries sent to get the end-to-end duration. The good news is that there are new traces event available in Log Analytics (though not in Profiler at the time of writing) which solves this problem.

The events have the OperationName ProgressReportBegin/ProgressReportEnd and the OperationDetailName ParallelSession. Here’s a simple Log Analytics query that you can use to see how this event works:

PowerBIDatasetsWorkspace
| where TimeGenerated > ago(10min)
| where OperationName in
("QueryBegin", "QueryEnd",
"DirectQueryBegin", "DirectQueryEnd",
"ProgressReportBegin", "ProgressReportEnd")
| project OperationName, OperationDetailName, EventText, TimeGenerated,
DurationMs,CpuTimeMs, DatasetMode, XmlaRequestId
| order by TimeGenerated asc

Here’s what this query returned for a single DAX query that generated multiple SQL queries (represented by DirectQueryBegin/End event pairs) against a relational data source:

Notice that there is just one ProgressReportBegin/End event pair per query; this is always the case, no matter how many fact tables or data sources are used, at least as far as I know. The ProgressReportBegin event for ParallelSession comes before the DirectQueryBegin/End event pairs and the associated End event comes after the final DirectQueryEnd event. The DurationMs column for the ProgressReportEnd event gives you the total duration in milliseconds of all the DirectQuery events. In this case there were six SQL queries sent to the source (and so six DirectQueryBegin/End event pairs) each of which took between 1-2 seconds. However, since all of these queries ran in parallel, the overall duration was still just over 2 seconds.

dataCoverageDefinition: A New Optimisation For Hybrid Tables In Power BI

Hybrid tables – tables which contain both Import mode and DirectQuery mode partitions to hold data from different time periods – have been around for a while. They are useful in cases where your historic data doesn’t change but your most recent data changes very frequently and you need to reflect those changes in your reports; you can also have “reverse hybrid tables” where the latest data is in Import mode but your historic data (which may not be queried often but still needs to be available) is in DirectQuery mode. Up to now they had a problem though: even when you were querying data that was in the Import mode partition, Power BI still sent a SQL query to the DirectQuery partition and that could hurt performance. That problem is now solved with the new dataCoverageDefinition property on the DirectQuery partition.

You can find full documentation here:
https://learn.microsoft.com/en-us/analysis-services/tom/table-partitions?view=asallproducts-allversions

What dataCoverageDefinition does is tell Power BI what data is present in the DirectQuery partition so it knows whether to generate a SQL query to get data from your DirectQuery source or not. It takes the form of a simple DAX expression that returns True for rows that are stored in the DirectQuery partition, for example:

RELATED('DimDate'[CalendarYear]) IN {2017,2018,2019}

If you’re working with large fact tables in DirectQuery mode and don’t have a religious objection to some of that data being stored in Import mode, you should check out hybrid tables because they can really improve report performance!

Query Caching In Power BI Premium

The more you monitor something the more likely it is that you’ll see something strange you can’t explain. This is true of several customers I’ve spoken to recently who saw DAX queries run in the middle of the night by people who were definitely not online at the time and who were worried about the resulting load on their Premium capacities. What is the cause – hackers? Ghosts? In a lot of cases the culprit is actually Power BI’s query caching feature.

Query caching is a feature that is available in Premium and Fabric capacities for Import mode semantic models. It allows Power BI to pre-cache the data needed by some report visuals and dashboard tiles so the reports and dashboards open faster. The documentation here is pretty good but, as always, a few examples are useful to help understand how it works and there are a couple of extra things to point out that are not on the docs page I just linked to.

Let’s start with a simple pbix file called FruitSalesModel.pbix containing a single blank report page and one table in the Import mode semantic model:

The different fruit names will make it easy to link different visuals to different DAX queries. There’s also one measure called Sales Amount that sums the values in the Sales column.

If you publish this pbix to an otherwise empty workspace hosted on a Premium capacity and you’ll see that a semantic model and a blank report are created from it.

Query caching can be enabled on the semantic model by going to the model’s Settings page, expanding the Query Caching section and selecting “On”:

Enabling query caching now means that the Power BI Service may execute DAX queries to populate the cache immediately after a semantic model refresh takes place. If a query has been cached then, when an end user opens a report, Power BI can use the cached resultset for visuals which means that the report will open faster.

You can see how long all these queries take to run by looking at the Refresh History dialog and the details for an individual refresh (as mentioned in this blog post). The Query Cache line tells you when this query activity took place and how long it took:

At this point in our example, however, no queries will be executed because the only report connected to this semantic model is completely blank.

Which queries get cached? This is documented but I always prefer to run my own tests to see what happens; you can see the DAX queries being run after a refresh by running a Profiler trace on the semantic model.

If you edit the currently-blank report in the workspace so that there are two pages, add a single card to the first page that shows the value for the Sales Amount measure for the product Apples and a single card to the second page that shows the value for the Sales Amount measure for the product Grapes, make sure the first page is the page that is displayed when a user opens the report, and then close the report without going back to Reading View, and then refresh the semantic model, then a Profiler trace shows… no queries being executed after the refresh. This is because only visuals that are viewed by an end user in Reading View (not Editing mode) in the browser have their queries cached.

If you then open the report in Reading View and view both pages of the report and then refresh the semantic model again, you’ll see one query appear in the trace:

This is the query for the card on the first page of the report, for Sales Amount for Apples. The query for the card on the second page is not cached – only the queries for the visuals on the page that the report opens on are cached.

It’s also interesting to note that the ApplicationContext column in the trace has a value for the DatasetId but not ReportId or VisualId for these queries. This makes sense because these queries are not executed as a result of a report rendering.

If you then edit the report so it opens on the second page (which shows Sales Amount for Grapes) and not the first page, then refresh the semantic model again, the Profiler trace shows queries for both cards on both pages. In fact, even if you delete the first page from the report and refresh, both queries are still run:

The query for Sales Amount for Grapes is run because it’s now the query on the page of the report that opens first. The query for Sales Amount for Apples is still run because, once a query has been added to the list of queries to be cached, it will stay there until you either delete the semantic model or you turn the Query Caching setting on the semantic model off and on again.

It’s not just the visuals on the opening page of a report that are cached. If you pin any visual, from any page of a report, to a dashboard and then view that dashboard, the query for that visual will be cached. Also, if you create a personal bookmark and make that your default view when you open a report, then the queries for that personal bookmark will also be cached. Finally, if an end user opens a report, changes a filter or slicer on the opening page of a report, then closes the report and reopens it, the report will show the filter or slicer selection that was made when the report was last closed (this behaviour, known as “persistent filters”, is explained here). Queries for each user’s persistent filters on the opening page of a report will also be cached.

The behaviour above applies to all reports connected to a semantic model.

As a result, if your end users like creating dashboards or personal bookmarks, or if you have a lot of filters or slicers on the opening page of your report, or if you have lots of reports connected to a single semantic model, you can end up with a lot of queries being generated after the semantic model has refreshed when query caching is turned on. The good news for Premium capacity admins is that the queries run to populate the cache are classed as background operations and not interactive operations (as is the case for queries generated when a report renders). The CU cost of background operations is smoothed over 24 hours which means that you won’t get big spikes in usage on your capacity when a refresh takes place and the cache is populated, although if a lot of expensive queries are run this is definitely something you still need to monitor. Here’s a screenshot from the Capacity Metrics App showing DAX queries appearing as background operations:

In conclusion, query caching is a very powerful feature. Turning it on can make a big difference to the performance of your reports but you do need to understand how caching works to get the full benefit of it. You also need to make sure you watch out for it when monitoring a Premium capacity to make sure that it doesn’t increase your background utilisation too much.

Reading Parquet Metadata In Power Query In Power BI

There’s a new M function in Power Query in Power BI that allows you to read the data from a Parquet file: Parquet.Metadata. It’s not documented yet and it’s currently marked as “intended for internal use only” but I’ve been told I can blog about it. Here’s an example of how to use it:

let
Source = Parquet.Metadata(File.Contents("C:\myfile.snappy.parquet"))
in
Source

…and here’s an example of the output:

This query shows how to expand the record returned by this function into a table:

let
m = Parquet.Metadata(File.Contents("C:\myfile.snappy.parquet")),
schema = List.Accumulate(Table.ToRecords(m[Schema]), [], (x, y) => if y[NumChildren] = null then Record.AddField(x, y[Name], y[LogicalType] ?? y[ConvertedType]) else x),
expanded1 = Table.ExpandTableColumn(m[RowGroups], "Columns", {"MetaData"}),
renamed1 = Table.RenameColumns(expanded1, {{"Ordinal", "RowGroup"}, {"TotalCompressedSize", "RowGroupCompressedSize"}, {"TotalByteSize", "RowGroupSize"}}),
expanded2 = Table.ExpandRecordColumn(renamed1, "MetaData", {"Type", "Encodings", "PathInSchema", "Codec", "NumValues", "TotalUncompressedSize", "TotalCompressedSize", "KeyValueMetadata", "DataPageOffset", "IndexPageOffset", "DictionaryPageOffset", "Statistics", "EncodingStats"}),
renamed2 = Table.RenameColumns(expanded2, {{"Type", "PhysicalType"}}),
added1 = Table.AddColumn(renamed2, "Column", each Text.Combine([PathInSchema])),
added2 = Table.AddColumn(added1, "Cardinality", each [Statistics][DistinctCount]),
added3 = Table.AddColumn(added2, "NullCount", each [Statistics][NullCount]),
added4 = Table.AddColumn(added3, "DictionarySize", each [DataPageOffset] - [DictionaryPageOffset]),
added5 = Table.AddColumn(added4, "LogicalType", each Record.FieldOrDefault(schema, [Column], null)),
selected = Table.SelectColumns(added5, {"RowGroup", "Column", "Codec", "NumValues", "Cardinality", "NullCount", "TotalCompressedSize", "TotalUncompressedSize", "DictionarySize", "PhysicalType", "LogicalType"})
in
selected

As you can see this gives you all kinds of useful information about a Parquet file such as the schema, the compression type used, column cardinality and so on.

[Thanks to Curt Hagenlocher for the tip-off and the query above]