In our last episode of the Pitfalls series here, we were introduced to Pitfall #2 – The automatic detection of column types, which can trigger common refresh errors. In today’s post, we will learn about the third pitfall, which is probably more dangerous than all the pitfalls combined!
Pitfall #3 is so dangerous. It is almost invisible. You will easily miss it when you create the report, or when you refresh it. Pitfall #3 will create a bias in your reports. But when you finally detect the wrong results, it may be too late. Pitfall #3 is fatal. It exposes you to biased reports, wrong insights, and potentially disastrous business decisions.
And it all starts with a filter in Power BI Query Editor. Such a trivial and common UI control, right? Almost all the queries that are out there include filter steps.
Let’s build the suspense, and reveal the third pitfall in a memorable way – through a short story. A fiction story about a Power BI / Excel analyst, called Bob.
It was a stormy morning at FakeCompany Inc. Bob, was working on his cool Power BI report, when his boss stormed into his cubicle.
“Look at this report”, said the boss, and handed a sheet of paper to Bob. With trembling hands, Bob unfolded the paper, and read this spreadsheet:
“Oh, I know this report, from last year! Every year all the employees receive a rank from 1 to 5”, said Bob, “and those who gets a rank of 1 are being laid off”.
“Yes, but this time, it’s different”, said the boss, with a sinister look. “There is a bigger layoff this time. The layoff will also include those with rank 2. Please get the list from HR, and send me a Power BI report of all the employees that stays with us”, demanded the boss. “Oh, and don’t worry, Bob. With your outstanding reporting talent, you got a rank of 5 this year, so get ready for your promotion as our new VP of Business Intelligence!”.
An adrenaline rush swept down on Bob. He opened Power BI Desktop and started to work.
Bob got the following Excel file from HR (Hey, this is an interactive story. You can download it from here as well).
He opened his latest version of Power BI Desktop, clicked the Get Data button and selected Excel.
After selecting the HR file (Hey, you can follow Bob’s steps. Haven’t you already download the HR file from here?), Bob selected the EmployeePerformance table in the Navigator window and clicked Edit.
Next, Bob selected the column Performance Rank, clicked on the filter control, and unchecked the ranks 1 and 2.
Finally, with a triumphant click on OK, he created a list of all the employees that will survive the layoffs, and emailed the Power BI Report to his boss.
“Dear boss, Attached is the Power BI report with the employees that will remain with the company this year. Looking forward to my VP promotion. Bob”.
Few hours later, Bob’s boss sent the file to Legal. Patrick, the Power BI specialist on the Legal team, applied Merge Queries, to combine Bob’s list, with the list of all the employees’ email addresses:
Using a Left Anti merge, Patrick created a list of all the employees that should receive a layoff letter.
A thunder stroke outside the narrow rusty windows in the open space of the BI department. But Bob didn’t flinch. He has been staring with horror at his Power BI report. On his second 30″ flickering monitor, a dramatic email has been staring back at Bob. A layoff letter.
Before Bob left the building, he sent his boss one last Power BI report:
Do you know what was done wrong in Bob’s report?
Power Query Editor allows you to filter the data by clicking on the highlighted control.
When you check or uncheck specific values in the filter pane, you should know that in different circumstances, your decision will be interpreted differently by the Query Editor.
Checking the formula bar, you’ll see that the following condition was auto-generated:
[Column1] <> 1 and [Column1] <> 2 and [Column1] <> 3
(Let’s call this condition type “the negative condition”, because it uses the not-equal <> sign).
But, an equivalent condition could be written as follows:
[Column1] = 4 or [Column1] = 5 or [Column1] = 6 or [Column1] = 7
(Let’s call this condition type “the positive condition”, because it uses the equal = sign).
Which formula is better? If your dataset can only have the values from 1 to 7 in Column1, both conditions are equal. But in many scenarios, you only see a preview of the data, so if you expect to find other values in the external data source, which are not seen in the preview, the two conditions will yield two different filtering results.
For example: If we have a new row with value 8 in Column1, the negative condition above will include that row, and the positive condition will filter it out.
Let’s see another example. This time Column1 contains only the values 1-6. On purpose I use a pair number of values.
Let’s filter Column1, and uncheck value 1.
The results make sense. We filtered out the rows with value 1, and the formula contains the negative condition:
[Column1] <> 1
Let’s repeat the test. This time, we will uncheck values 1 and 2.
Again, the results make sense. We have a negative condition that excludes values 1 and 2.
And now comes the surprise:
Let’s uncheck values 1, 2 and 3.
We expected to find a negative condition, right?
[Column1] <> 1 and [Column1] <> 2 and [Column1] <> 3
Instead, we got the positive condition:
[Column1] = 4 or [Column1] = 5 or [Column1] = 6
And this is where we can fall down the third pitfall. When we try to filter values in filter pane, and the number of unchecked values is equal to the number of checked values, Power Query will generate a positive condition. Even when we may need a negative condition.
Further tests that you can make with the filter pane, will reveal the following logic:
- When the unchecked and checked value are equal in number, Power Query will generate a positive condition.
- When the number of checked values is smaller than the number of unchecked values, Power Query will generate a positive condition.
- When the number of checked values is greater than the number of unchecked values, Power Query will generate a negative condition.
So, Bob unchecked the ranks 1 and 2, to exclude these ranks from his report. Unfortunately, the Query Editor loaded the first 1000 rows, excluding rank 5 from the preview. As a result, Power Query generated a positive condition, which filtered out all the employees with the rank of 5.
How to avoid the third pitfall
- Ensure that the Formula Bar is visible (Pitfall #1 here), and check the condition on the formula, after each filtering step.
- Make sure the auto-generated formula is correct.
- Use negative conditions, if you want to exclude specific values (For example: to exclude the years 2010 and 2011 in column Date use the negative condition: [Date] <> 2010 and [Date] <> 2011.
- Make sure you use positive conditions, if you want to include specific values (For example: to include the years 2012 and 2013 in column Date use the positive condition: [Date] = 2012 or [Date] = 2013.
- If the formula is incorrect, convert the formula from a positive condition to a negative one or vice versa.
- Pay attention to the warning “List may be incomplete”. If your data is relatively small, click Load more before you select the values for the filter.
- Consider using the Filters drop down menu (e.g. Number Filters), to gain more control on the filter condition, or when the values that you need in the filter, are not available in the preview.
Bob stepped into the big conference room in FakeCompany Inc. His former boss waved him to sit, as his CEO swooped in from a second door. “Bob, We are sorry for last year’s layoffs” said the CEO. “I am excited to tell you, that we will hire back all of our former employees.”, continued the CEO, “We applied your filtering fixes on all of our reporting solutions, and the results are staggering!!! Welcome back, Bob!”.
The story, all names, characters, and incidents portrayed in this article are fictitious. No identification with actual persons (living or deceased), places, buildings, and products is intended or should be inferred.
Hope you found this blog post useful, and beware of the third pitfall.
Stay tuned for Pitfall #4 – Hint: Column ordering can be a real chaos.