Find overlapping time periods using #PowerQuery – Part 3 (Fast & Furious)

In the last post here, we walked through the steps to detect overlapping time periods and calculate their durations. We used Cartesian Product to match all the pairs of tasks of the same resource, and even used the M function Table.Buffer to accelerate the load of the table.

In this article, we will learn how to improve the performance of the Cartesian Product even further. Special thanks to Imke Feldmann for inspiring me to write Part 3 (Check out her amazing work here).

The Challenge – Can you be Fast & Furious

Your input data is a very large table with tasks, resources and their start and end dates (including time). As the head of PMO you are responsible for tracking the progress of complex mission-critical projects in your organization. Your job is to find the over-allocated resources that are assigned to multiple tasks with overlapping time periods (as demonstrated in the figure below).

Find overlapping time periods using Power Query
Can you find the overlapping time periods and the over-allocated resources?

In Part 2 of the challenge, the sample data consisted of 5 tasks. As a result, it was relatively easy to apply the Cartesian Product to match a total of 25 combinations (5 x 5 tasks). The query ran very quickly. But what would be the implications of running the Cartesian product on a larger dataset?

The following diagrams illustrates the Cartesian product running on a simple dataset of 4 tasks and 2 resources. Applying the Cartesian product on the entire table will lead to the processing of 16 combinations.

But what is the point in matching between tasks of different resources Wouldn’t it make more sense to group the tasks by resource and then detect the overlapping periods for each resource? Can we run the Cartesian product on each separate nested table? Let’s see how many combinations we have when we do it on our example of 4 tasks and 2 resources.

When we first group the resources and then apply the Cartesian product on the nested tables as illustrated in the diagram below, we get 8 combinations. We moved from 16 to 8 combinations and can potentially reduce the refresh time by up to 50% in this example.

Now, let’s move to a bigger dataset. Let’s say we have 2500 tasks and 250 resources. What would be the point in applying Cartesian Product on all the combinations of 2500 x 2500 rows as we did in Part 2. Running through 6250000 combinations is going to take a lot of time.

Let’s see how we may improve the performance of the Cartesian product in our 2500 tasks and 250 resources example. By grouping the tasks by resource, we will have 250 sub tables. If each resource has an average of 10 tasks, we will run the Cartesian Product on 10 x 10 rows per resource. So, while the performance of the original solution would lead to a scan of 6250000 rows, the new solution will lead to a scan of 250 x 10 x 10 = 25000 rows. The new query may be up to 250 times faster!

Note: We didn’t consider the complexity of the grouping. We can safely assume that it is more efficient than the Cartesian product. If you are familiar with computation theory, the Cartesian product requires O(n^2) steps to solve while grouping can take O(n) steps using a dictionary, so it will not impact the performance.

Before we explain how to group the tasks by resources and run the Cartesian product per resource, let’s prepare our data for the challenge.

Data Preparations

To make our problem easy to share, let’s start with the original tasks table from Part 1 and duplicate it by 500 to reach 2500 fictitious tasks.

TaskResourceStartEnd
Task 1Resource 17/2/2019 9:00:00 AM7/2/2019 5:00:00 PM
Task 2Resource 27/2/2019 11:00:00 AM7/2/2019 9:00:00 PM
Task 3Resource 27/2/2019 2:00:00 PM7/2/2019 8:00:00 PM
Task 4Resource 37/2/2019 4:00:00 PM7/2/2019 11:00:00 PM
Task 5Resource 37/2/2019 1:00:00 PM7/2/2019 7:00:00 PM

If we will just duplicate the table, we will end up with multiple tasks and resources. The following logic will generate 2500 tasks with ~250 resources.

Start with a blank query, copy this code to the Advanced Editor and name the query Tasks.

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WCkotzi8tSk5VMFTSUTLXN9I3MjC0VLC0MjAAIgVHX2RRU6hogK9SrA6SViNkRYaGWPVaEqPXCKEISdQCh1ZjZEUmWLXCHYNXryFWveZIWmMB", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type text) meta [Serialized.Text = true]) in type table [Resource = _t, Start = _t, End = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Resource", type text}, {"Start", type datetime}, {"End", type datetime}})
in
    #"Changed Type"

To duplicate the table to reach our target of 2500 tasks we can duplicate our 5 original tasks 500 times using the Table.Repeat function. To do it, click the fx button next to the formula bar and enter the following formula:

= Table.Repeat(#"Changed Type", 500)

In Add Column tab, select Index Column, From 1. Rename the column Task. While you are still in Add Column tab, select Custom Column. The Custom Column dialog box will open. Enter New Resource in New column name box.

Copy the following formula in the Custom column formula box:

if [Task] <= 5 then [Resource] else "New Resource " & Text.From(Number.IntegerDivide([Task],10))

The code above, will generate unique identifiers for the resources for every 10 tasks after the original 5 rows and generate resource names.

Now, let’s convert the New Resource column to Text, remove the old Resource column and change the name of New Resource column to Resource. Finally, you can order the table columns, but that is not required.

Here is the complete code (in case you want to skip the steps above):

let
    Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("i45WCkotzi8tSk5VMFTSUTLXN9I3MjC0VLC0MjAAIgVHX2RRU6hogK9SrA6SViNkRYaGWPVaEqPXCKEISdQCh1ZjZEUmWLXCHYNXryFWveZIWmMB", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type text) meta [Serialized.Text = true]) in type table [Resource = _t, Start = _t, End = _t]),
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Resource", type text}, {"Start", type datetime}, {"End", type datetime}}),
    Custom1 = Table.Repeat(#"Changed Type", 500),
    #"Added Index" = Table.AddIndexColumn(Custom1, "Task", 1, 1),
    #"Added Conditional Column" = Table.AddColumn(#"Added Index", "New Resource", each if [Task] <= 5 then [Resource] else "New Resource " & Text.From(Number.IntegerDivide([Task],10))),
    #"Changed Type1" = Table.TransformColumnTypes(#"Added Conditional Column",{{"Task", type text}}),
    #"Removed Columns" = Table.RemoveColumns(#"Changed Type1",{"Resource"}),
    #"Renamed Columns" = Table.RenameColumns(#"Removed Columns",{{"New Resource", "Resource"}}),
    #"Reordered Columns" = Table.ReorderColumns(#"Renamed Columns",{"Task", "Resource", "Start", "End"})
in
    #"Reordered Columns"

Following the instructions above we now have 2500 tasks divided to 254 resources with an average of 10 tasks per resource. The first 5 tasks are the original one we had. Then, we have 4 tasks for New Resource 0, and then a repetition of 10 tasks per new resource name until the 2500th task.

If you follow Part2 on the 2500 tasks, you will see that it can take ~12 seconds to detect and calculate the Overlapping Periods. You can download the slow performing report here.

After you experienced the slow refresh, you can assess the performance penalty when you double the number tasks (from 2500 to 5000). Open the report file, select Edit Queries to open Power Query Editor. Select the Tasks query and in Applied Steps, select Custom1. In the formula bar, change the second argument of Table.Repeat to 1000 to duplicate the 5 tasks into 5000 tasks.

When I doubled the number of tasks from 2500 to 5000, the import took 46 seconds. Running the ratio between the second and first imports (46 / 12 = 3.8) shows that as we increase the size by 2, we almost quadruple the refresh time.

I ran another test on 10000 tasks and was able to almost quadruple the refresh time again (refresh time increased by ~3.9 times)

This small test demonstrates that the performance of the Cartesian product acts as a square function on the number of rows. It is obvious that we would want to improve its performance, as the data increases. Learning how to calculate overlapping time periods on large number of tasks in an efficient way is what we will learn next.

The Solution

The heart of the solution is to generate a new custom function that applies to Cartesian product on a table, and then invoke it per resource and its associated tasks.

The Nested Cartesian Product

Create a blank query and open the Advanced Editor. Copy and paste the M code below and name the query FnMatchTasks

(resourceGroup as table) as table =>
let
    BufferedSource = Table.Buffer(resourceGroup),
    #"Added Custom" = Table.AddColumn(resourceGroup, "Source2", each BufferedSource),
    #"Expanded Source2" = Table.ExpandTableColumn(#"Added Custom", "Source2", {"Task", "Resource", "Start", "End"}, {"Source2.Task", "Source2.Resource", "Source2.Start", "Source2.End"}),
    #"Filtered Rows" = Table.SelectRows(#"Expanded Source2", each ([Task] <> [Source2.Task]))
in
    #"Filtered Rows"

Note: Here I could have shown you step by step how to generate this function and explain it, but if you followed Part 2 of this series, you can see that this code resembles the manual steps we followed to perform the Cartesian product on the entire Tasks table. The only difference is that we converted it into a custom function using the first line.

Let’s quickly review the code: The function receives a table as an input (in our case it will be a subset of the Tasks table). It holds the tasks in memory using Table.Buffer and adds a new column with the input table in each cell. Then it expands the new column (which leads to the square expansion in the number of rows in the table). Finally, the rows that have the same task value in both sides of the Cartesian product are filtered out.

It’s time to group the tasks by resource. In Queries pane, right click on the Tasks query and select Reference in the shortcut menu. Name the new query Overlapping Tasks.

Select the Resource column and in Transform tab, select Group By. In the Group By dialog box, ensure that Resource is selected in the Group by drop-down menu. Enter Resource Group in the New column name. text box and select All Rows in the Operation drop-down menu. Then click OK.

In Add Column tab, select Invoke Custom Function.

In Invoke Custom Function, enter Overlapping Tasks by Resource. Select FnMatchTasks in the Function query menu. Then, select Resource Group in the last drop-down menu and click OK.

Remove the Resource Group column and click the expand column control in the header of Overlapping Tasks by Resource column. In the drop-down pane uncheck Resource. Check Use original column name as prefix and click OK.

Click the filter control in the Tasks column header and select Remove Empty in the filter pane. This action will filter out the tasks of resources that had only one task in the original Tasks table. From here you are ready to proceed with the calculation of overlapping duration that was described in Part 2. Let’s repeat it here.

Calculating Overlapping Duration

To set the stage for the overlap calculation we will create two custom functions MinDate and MaxDate that find the minimal and maximal dates of two given Date/Time values.

In Home tab, select New Source drop-down menu, and select Blank Query.

Name the new query as MinDate. and copy and paste this formula into the Formula Bar:

= (date1, date2)=>
    if date1 < date2 then date1 else date2

Now go back to Home tab, New Source and select Blank Query. Name the new query MaxDate. Copy and paste this formula into the Formula Bar:

(date1, date2)=>
    if date1 >= date2 then date1 else date2

In Queries pane, select the Overlapping Tasks query and in Add Column tab select Custom Column. In the Custom Column dialog box follow these steps:

  1. Enter Overlap Duration in the New column name box.
  2. Enter the following formula below and click OK.

Here is code you can copy and paste into the Custom column formula box:

Duration.TotalHours(
        MinDate([End], [Source2.End]) -
        MaxDate([Start], [Source2.Start])
)

This code uses Duration.TotalHours to return the total hours as a decimal number of the overlapping duration. The main calculation in the formula is done by subtracting the minimal end date with the maximal start date of the two tasks, as shown here:

MinDate([End], [Source2.End]) -
MaxDate([Start], [Source2.Start])

Subtracting two Date/Time values in M returns a value whose type is duration. In our case, if the duration is positive, we have an overlap, but if the duration is negative, we don’t. You can now see that all the rows have a positive duration value because in our sample data, all our tasks that has the same resource are overlapping.

Select the filter control of Overlapping Duration column and in the filter pane select Number Filters, then select Great Than…

In the Filter Rows dialog box, enter 0 as shown in the screenshot below and click OK.

You can now create two new columns for the overlapping period start and end dates. This can be done in multiple ways. Since we already have the MinDate and MaxDate functions, let’s do it using Invoke Custom Function.

In Add Column tab, select Invoke Custom Function. In the Invoke Custom Function dialog box, enter Overlap Start as New column name. Next, in the Function query drop-down menu select MaxDate.

Make sure that under date1 drop-down menu you choose Column Name. Select Start in the drop-down menu which is next to date1.

Under date2 drop-down menu select Column Name. Select Source2.Start in the drop-down menu which is next to date2 and click OK.

In Add Column tab, select Invoke Custom Function. In the Invoke Custom Function dialog box, enter Overlap End as New column name. Next, in the Function query drop-down menu select MinDate.

Make sure that under date1 drop-down menu you choose Column Name. Select End in the drop-down menu which is next to date1.

Under date2 drop-down menu select Column Name. Select Source2.End in the drop-down menu which is next to date2 and click OK.

We are almost done. We can now remove the unnecessary columns and keep only Task, Overlapping Duration, Overlap Start and Overlap End. and changed the types of Overlapping Duration, Overlap Start and Overlap End to the corresponding types: Decimal Number, Date/Time and Date/Time.

Now, it is time to test our fast & furious query. You can download the Power BI report here and try it out. It took me ~6 seconds to load it!

In conclusion, running Cartesian product on large datasets is not efficient. Grouping the data into relevant subsets is the way to go. You should note that the optimization that was described here can be applied in many other scenarios in which you would need to apply a Cartesian product – especially when you do it on two different data sources. In our case we applied the Cartesian product on same table to find all combinations, but you can find many other implications (Check out Chapter 11 of my book, or Imke Feldmann’s and Chris Webb’s posts here and here).

Hope you enjoyed this article. Feel free to share with us in the comments below how you are going to use it or already used it.

Leave a Reply