From cumulative numbers to increments

In this tutorial, we will learn how to convert cumulative numbers to incremental values using Power Query. In one of our earlier COVID-19 tutorials here, we learned how to convert cumulative values to incremental ones using DAX. We worked with Johns Hopkins University’s COVID-19 dataset that includes only the cumulative confirmed cases and applied the time intelligence logic in DAX.

You can read this tutorial as a standalone piece to learn how to transform cumulative amounts into incremental ones, by jumping to the Power Query solution section below, or read it as part of the COVID-19 tutorials here.

Our solution at the time included the following DAX measures to calculate total confirmed cases and daily confirmed cases:

Total Confirmed = 
VAR sum_confirmed =
SUMX (
    VALUES( 'Covid-19 Cases'[Country/Region] ),
    SUMX (
        VALUES ('Covid-19 Cases'[Province/State] ),
        CALCULATE ( MAX ( 'Covid-19 Cases'[Confirmed] ) )
    )
)
RETURN
IF (sum_confirmed = 0, BLANK(), sum_confirmed )
Daily New Cases = 
VAR current_day = [Total Confirmed]
VAR prev_day = 
CALCULATE (
    [Total Confirmed],
    DATEADD ( 'Calendar'[Date], -1, DAY )
)
RETURN
IF ( ISBLANK ( prev_day ), BLANK(), current_day - prev_day )

As you may see, the performance of Total Confirmed is not optimal. It uses two nested SUMX functions to find the maximal amount of confirmed cases by iterating over two levels of geo-location hierarchy. At the time, this measure helped us to handle data quality challenges and allowed us to apply the same measure in both cards and line charts no matter if you filter by dates or not.

To improve the performance of our report, let’s first optimize our Total Confirmed measure, and then we will learn how to improve the performance even more, by solving the incremental values using Power Query.

To motivate you to keep reading, here is a screenshot that will show you the performance improvement, with the faster performance achieved by storing the incremental amounts (Daily Confirmed Cases) as a column using Power Query.

The DAX Optimization

To start the tutorial you can download this pbix and open it using Power BI Desktop’s latest version. After you refresh the report, right-click on COVID-19 Cases in the Fields pane, and select New measure.

Enter the following measure and press Enter.

Simple SUM Confirmed Cases = SUM ( 'Covid-19 Cases'[Confirmed] )

This measure has a simple SUM on top of the Confirmed cumulative COVID-19 cases. Let’s assume for example, that we don’t care that in Day #2 the cumulative number is lower than in Day #1. The simple version of confirmed cases will be more efficient in performance and we can use it on visuals that are filtered by dates. The main value of the new measure is that we can now use it in our Daily New Cases measure, instead of using the slower Total Confirmed.

So, let’s create a new measure, Daily New Cases Faster, that will invoke our new Simple SUM Confirmed Cases instead of Total Confirmed. Here is the new measure:

Daily New Cases Faster = 
VAR current_day = [Simple SUM Confirmed Cases]
VAR prev_day = 
CALCULATE (
    [Simple SUM Confirmed Cases],
    DATEADD ( 'Calendar'[Date], -1, DAY )
)
RETURN
IF ( ISBLANK ( prev_day ), BLANK(), current_day - prev_day )

Running Performance analyzer on two visuals side by side and comparing Daily New Cases vs. Daily New Cases Faster revealed a performance improvement from 223 ms to 195 ms. The new measure is calculated 42% faster!

The Power Query Solution

Now that we are motivated to further optimize our report, would it be better if we would have the daily new cases presented as a column in our data? Can we manipulate the data in Power Query to extract the increments? In this part of the tutorial, you will learn how.

Before we start, I would like to share an important disclaimer. When you work with very large datasets that consist of multiple time-series sets, the technique below will be slow to refresh and may reach timeouts in the Power BI service. With large datasets, it is recommended to ensure that the incremental values can be stored on the external data source. This tutorial assumes that you cannot make changes to the data source and that the dataset is not too large. In the next post, I will publish an improved version of this technique that refreshes faster and can better handle larger datasets.

Our main challenge is that our data consists of different subsets of time series with cumulative numbers. The confirmed cases are accumulated by Country and Province/State. We would need to extract the daily confirmed cases in Power Query for each combination of Country and Province/State separately.

In this section, we will solve the problem for a specific country and later generalize it to all the different geo-locations. Open the Power BI report, and select Transform data. In Power Query Editor, select COVID-19 Cases query on the Queries pane. Select the Transform tab and then select Group By.

In the Group By dialog box, select Advanced option. Then, select Province/State in the first drop-down menu. Select Country/Region in the next drop-down menu. Enter Group in New column name. sSelect All Rows under Operation, and click OK.

You can see that we now have Table objects under the new Group column. Right click on the white space on one of the cells and select Add as New Query in the shortcut menu.

Rename the new query as Increments. Now we will start working on the selected geo-location. We have a time series that consists of the Date and Confirmed columns. Let’s sort it by the date in ascending order. We do it because we cannot rely on Power Query after the grouping that we have the correct order.

Click on the Filter control in the Date column header and select Sort Ascending.

Now that our time series is sorted, we will create a new column to store the confirmed cases of the previous day. This will allow us to subtract in a third column the previous number from the current cumulative number and get the increment. To help us, we will use an index column. In Add Column tab, click Index column and select From 0.

You can see that we have a new Index column now with numbers starting from zero. Now, let’s create a new column for the previous amount. In Add Column tab, select Custom Column.

In Custom Column dialog box, enter Previous as New column name, and copy and paste the following M expression into the Custom column formula and click OK:

if [Index] > 0 then #"Added Index"[Confirmed]{[Index] - 1} else 0

This is not the final formula. We will modify it for performance reasons, but first, let’s explain what we have so far. For the first row, we don’t know how many confirmed cases were reported in the day before, so we can set Previous to zero. You can see it applied in the Else section of the formula. Otherwise, when Index is greater than zero, we will return the previous value in the Confirmed column. #”Added Index” is the table from the previous query step. To accessing the Confirmed column as a list, we use the brackets and refer to the column name: #”Added Index”[Confirmed]. Now, to access the previous item in the list we use the index of the previous row with curly brackets. [Index] returns the current index, so [Index] -1 is the index of the previous row.

Now, using curly brackets, we can access the [Index] – 1 item in the list: #”Added Index”[Confirmed]{[Index] – 1}

You can see now that the Previous column includes the Confirmed cases of the previous row as highlighted in this screenshot:

Let’s subtract between Confirmed and Previous. Select the two columns (using the CTRL key) and then in Add Column tab, select Standard and then Subtract.

Next, let’s rename the new column as Incremental Confirmed and change its type to Whole Number. Congrats! By now you were able to extract the daily confirmed cases from the cumulative Confirmed column.

The Custom Function & List.Buffer

Our solution so far handles only a single geo location. To generalize it, it’s time to convert the Increments query into a function. In Home tab, select Advanced Editor.

We will start by ignoring all the steps before #”Sorted Rows”. These steps were created when we selected Add as New Query and chose one of the geo-locations. Next, let’s add the first line

(Group) =>

This line is our function declaration. It receives a table called Group and will run the steps inside the let statement to extract the daily new cases. Take a look at the #”Sorted Rows” line. Replace the first argument of Table.Sort, with Group as shown here:

    #"Sorted Rows" = Table.Sort(Group,{{"Date", Order.Ascending}}),
   

When you are done, you will have this M expression:

(Group) =>
let
    #"Sorted Rows" = Table.Sort(Group,{{"Date", Order.Ascending}}),
    #"Added Index" = Table.AddIndexColumn(#"Sorted Rows", "Index", 0, 1, Int64.Type),
    #"Added Custom" = Table.AddColumn(#"Added Index", "Previous", each if [Index] > 0 then #"Added Index"[Confirmed]{[Index] - 1} else 0),
    #"Inserted Subtraction" = Table.AddColumn(#"Added Custom", "Incremental Confirmed", each [Confirmed] - [Previous], type number),
    #"Removed Columns" = Table.RemoveColumns(#"Inserted Subtraction",{"Index", "Previous"}),
    #"Changed Type1" = Table.TransformColumnTypes(#"Removed Columns",{{"Incremental Confirmed", Int64.Type}})
in
    #"Changed Type1"

I mentioned earlier that we will also pause and focus on performance optimization. If you feel that this is too advanced for you, simply copy paste the next M expression and proceed.

When you work with custom columns and refer to an entire table as we did earlier (we referred to #”Added Index” in the code), Power Query is not optimized to refresh small-to-medium datasets. I explained it in more detail in Chapter 11 of my book. To solve this issue, you can use Table.Buffer or List.Buffer and refer to the new object instead of the original table or list. This will enforce Power Query to load the entire data to memory and access it faster for the values in the new column. The following M expression includes a list (ListBuffer) that stores in memory the entire Confirmed column. We access the members in the list by their index to retrieve the previous confirmed cases. Copy the new M expression and paste it in the Advanced Editor. It will run much faster in this tutorial.

(Group) =>
let
    #"Sorted Rows" = Table.Sort(Group,{{"Date", Order.Ascending}}),
    #"Added Index" = Table.AddIndexColumn(#"Sorted Rows", "Index", 0, 1, Int64.Type),

    ListBuffer = List.Buffer(#"Added Index"[Confirmed]),

    #"Added Custom" = Table.AddColumn(#"Added Index", "Previous", each if [Index] > 0 then ListBuffer{[Index] - 1} else 0),
    #"Inserted Subtraction" = Table.AddColumn(#"Added Custom", "Incremental Confirmed", each [Confirmed] - [Previous], type number),
    #"Removed Columns" = Table.RemoveColumns(#"Inserted Subtraction",{"Index", "Previous"}),
    #"Changed Type1" = Table.TransformColumnTypes(#"Removed Columns",{{"Incremental Confirmed", Int64.Type}})
in
    #"Changed Type1"

Notice the Increments has now a new fx icon in the Queries pane. Rename it to FnIncrements. The Fn prefix is optional. It helps me remember that I have a custom function in my Queries. Go back to Covid-19 Case query. In Add Column tab, select Invoke Custom Function.

In Function query drop-down menu, select FnIncrements. Under Groups drop-down, select Group as the column and click OK.

We now have a new column, FnIncrements, with the daily confirmed cases for each geo-location group. We can remove the Group column by right-clicking the Group column and selecting Remove in the shortcut menu.

Let’s extract the data from our FnIncrements tables. Click on the expand control in FnIncrements column header and uncheck Province/State and Country/Region. Then uncheck Use original column name as prefix and click OK.

We now have the daily new cases in the Incremental Confirmed column. You would need to change the column types of Date to Date, Confirmed to Whole Number and Incremental Confirmed to Whole Number by clicking the highlighted icons and selecting the relevant types n the drop-down menu.

That’s it. Select Close and Apply to load the data into the report and save the report. You can now create this new measure:

Daily New Cases Fastest = SUM ( 'Covid-19 Cases'[Incremental Confirmed] )

Below you can see the comparison of our three Daily New Cases being calculated in a line chart with Date in the X axis. On the left is the original Daily New Cases measure which was evaluated in 338 milliseconds. Next, is the optimized measure from the first section with 195 milliseconds, and finally our fastest measure above with 120 milliseconds – 64% faster than the original measure.

If you subscribed to DataChant, you can find the solution file in the subscribers folder (Link to the folder is provided in DataChant’s update email).

Bonus: Expand FnIncrements to support different date and amount columns

In case you would like to apply FnIncrements on your own data, you can see that the function we built relies on having “Date” and “Confirmed” as the column names. The function below enhances our logic by accepting two more arguments the date column name as DateColumn and the amount column name as AmountColumn.

(Group, DateColumn, AmountColumn) =>
let
    #"Sorted Rows" = Table.Sort(Group,{{DateColumn, Order.Ascending}}),
    #"Added Index" = Table.AddIndexColumn(#"Sorted Rows", "Index", 0, 1, Int64.Type),

    ListBuffer = List.Buffer(Table.Column(#"Added Index", AmountColumn)),

    #"Added Custom" = Table.AddColumn(#"Added Index", "Previous", each if [Index] > 0 then ListBuffer{[Index] - 1} else 0),
    #"Inserted Subtraction" = Table.AddColumn(#"Added Custom", "Incremental " & AmountColumn, each Record.Field(_, AmountColumn) - [Previous], type number),
    #"Removed Columns" = Table.RemoveColumns(#"Inserted Subtraction",{"Index", "Previous"}),
    #"Changed Type1" = Table.TransformColumnTypes(#"Removed Columns",{{"Incremental " & AmountColumn, Int64.Type}})
in
    #"Changed Type1"

The tricky part in the code above is in the Table.AddColumn step. We use Record.Field(_, AmountColumn) to return the amount. The underscore is used to refer to the record which represents the current row. I know this is a bit advanced, but you can learn more about it in Chapter 9 of my book.

To can invoke the new function, using Add Column > Invoke Custom Function. In the screenshot below, you can see how I set the parameters for the table we used in the tutorial.

That’s it for today. I hope you found this tutorial useful.

1 Comment

Leave a Reply