In our last episode of the Pitfalls series here, we were introduced to Pitfall #2 – The automatic detection of column types, which can trigger common refresh errors. In today’s post, we will learn about the third pitfall, which is probably more dangerous than all the pitfalls combined!
Pitfall #3 is so dangerous. It is almost invisible. You will easily miss it when you create the report, or when you refresh it. Pitfall #3 will create a bias in your reports. But when you finally detect the wrong results, it may be too late. Pitfall #3 is fatal. It exposes you to biased reports, wrong insights, and potentially disastrous business decisions.
And it all starts with a filter in Power BI Query Editor. Such a trivial and common UI control, right? Almost all the queries that are out there include filter steps.
Let’s build the suspense, and reveal the third pitfall in a memorable way – through a short story. A fiction story about a Power BI / Excel analyst, called Bob.
It was a stormy morning at FakeCompany Inc. Bob, was working on his cool Power BI report, when his boss stormed into his cubicle.
“Look at this report”, said the boss, and handed a sheet of paper to Bob. With trembling hands, Bob unfolded the paper, and read this spreadsheet:
“Oh, I know this report, from last year! Every year all the employees receive a rank from 1 to 5”, said Bob, “and those who gets a rank of 1 are being laid off”.
“Yes, but this time, it’s different”, said the boss, with a sinister look. “There is a bigger layoff this time. The layoff will also include those with rank 2. Please get the list from HR, and send me a Power BI report of all the employees that stays with us”, demanded the boss. “Oh, and don’t worry, Bob. With your outstanding reporting talent, you got a rank of 5 this year, so get ready for your promotion as our new VP of Business Intelligence!”.
An adrenaline rush swept down on Bob. He opened Power BI Desktop and started to work.
Bob got the following Excel file from HR (Hey, this is an interactive story. You can download it from here as well).
He opened his latest version of Power BI Desktop, clicked the Get Data button and selected Excel.
After selecting the HR file (Hey, you can follow Bob’s steps. Haven’t you already download the HR file from here?), Bob selected the EmployeePerformance table in the Navigator window and clicked Edit.
Next, Bob selected the column Performance Rank, clicked on the filter control, and unchecked the ranks 1 and 2.
Finally, with a triumphant click on OK, he created a list of all the employees that will survive the layoffs, and emailed the Power BI Report to his boss.
“Dear boss, Attached is the Power BI report with the employees that will remain with the company this year. Looking forward to my VP promotion. Bob”.
Few hours later, Bob’s boss sent the file to Legal. Patrick, the Power BI specialist on the Legal team, applied Merge Queries, to combine Bob’s list, with the list of all the employees’ email addresses:
Using a Left Anti merge, Patrick created a list of all the employees that should receive a layoff letter.
A thunder stroke outside the narrow rusty windows in the open space of the BI department. But Bob didn’t flinch. He has been staring with horror at his Power BI report. On his second 30″ flickering monitor, a dramatic email has been staring back at Bob. A layoff letter.
Before Bob left the building, he sent his boss one last Power BI report:
Do you know what was done wrong in Bob’s report?
Power Query Editor allows you to filter the data by clicking on the highlighted control.
When you check or uncheck specific values in the filter pane, you should know that in different circumstances, your decision will be interpreted differently by the Query Editor.
Let’s see what happens when we uncheck the values 1, 2 and 3 in the filter pane of the table above (with the values 1, 2, 3, 4, 5, 6, 7 in Column1).
Checking the formula bar, you’ll see that the following condition was auto-generated:
[Column1] <> 1 and [Column1] <> 2 and [Column1] <> 3
(Let’s call this condition type “the negative condition”, because it uses the not-equal <> sign).
But, an equivalent condition could be written as follows:
[Column1] = 4 or [Column1] = 5 or [Column1] = 6 or [Column1] = 7
(Let’s call this condition type “the positive condition”, because it uses the equal = sign).
Which formula is better? If your dataset can only have the values from 1 to 7 in Column1, both conditions are equal. But in many scenarios, you only see a preview of the data, so if you expect to find other values in the external data source, which are not seen in the preview, the two conditions will yield two different filtering results.
For example: If we have a new row with value 8 in Column1, the negative condition above will include that row, and the positive condition will filter it out.
Let’s see another example. This time Column1 contains only the values 1-6. On purpose I use a pair number of values.
Let’s filter Column1, and uncheck value 1.
The results make sense. We filtered out the rows with value 1, and the formula contains the negative condition:
[Column1] <> 1
Let’s repeat the test. This time, we will uncheck values 1 and 2.
Again, the results make sense. We have a negative condition that excludes values 1 and 2.
And now comes the surprise:
Let’s uncheck values 1, 2 and 3.
We expected to find a negative condition, right?
[Column1] <> 1 and [Column1] <> 2 and [Column1] <> 3
Instead, we got the positive condition:
[Column1] = 4 or [Column1] = 5 or [Column1] = 6
And this is where we can fall down the third pitfall. When we try to filter values in filter pane, and the number of unchecked values is equal to the number of checked values, Power Query will generate a positive condition. Even when we may need a negative condition.
Further tests that you can make with the filter pane, will reveal the following logic:
- When the unchecked and checked value are equal in number, Power Query will generate a positive condition.
- When you used the search box in the filter pane, and then checked or unchecked certain values, Power Query will always generate a positive condition.
- When the number of checked values is smaller than the number of unchecked values, Power Query will generate a positive condition.
- When the number of checked values is greater than the number of unchecked values, Power Query will generate a negative condition.
So, Bob unchecked the ranks 1 and 2, to exclude these ranks from his report. Unfortunately, the Query Editor loaded the first 1000 rows, excluding rank 5 from the preview. As a result, Power Query generated a positive condition, which filtered out all the employees with the rank of 5.
How to avoid the third pitfall
- Ensure that the Formula Bar is visible (Pitfall #1 here), and check the condition on the formula, after each filtering step.
- Make sure the auto-generated formula is correct.
- Use negative conditions, if you want to exclude specific values (For example: to exclude the years 2010 and 2011 in column Date use the negative condition: [Date] <> 2010 and [Date] <> 2011.
- Make sure you use positive conditions, if you want to include specific values (For example: to include the years 2012 and 2013 in column Date use the positive condition: [Date] = 2012 or [Date] = 2013.
- If the formula is incorrect, convert the formula from a positive condition to a negative one or vice versa.
- Pay attention to the warning “List may be incomplete”. If your data is relatively small, click Load more before you select the values for the filter.
- Consider using the Filters drop down menu (e.g. Number Filters), to gain more control on the filter condition, or when the values that you need in the filter, are not available in the preview.
Bob stepped into the big conference room in FakeCompany Inc. His former boss waved him to sit, as his CEO swooped in from a second door. “Bob, We are sorry for last year’s layoffs” said the CEO. “I am excited to tell you, that we will hire back all of our former employees.”, continued the CEO, “We applied your filtering fixes on all of our reporting solutions, and the results are staggering!!! Welcome back, Bob!”.
The story, all names, characters, and incidents portrayed in this article are fictitious. No identification with actual persons (living or deceased), places, buildings, and products is intended or should be inferred.
Hope you found this blog post useful, and beware of the third pitfall.
Keep reading Pitfall #4 – Hint: Column ordering can be a real chaos.
Very clear thanks 🙂
Look forward to reading the other articles of your serie !!
Thank you Tristan. Get ready to learn how to rename columns in a robust way. That is my next part.
My first attempt to combine fiction with tech writing. Thank you for the feedback.
To give your thrilling story an extra dimension you should have named your main character Bob De Morgan. Bob for short. Thank you for the insight and for sharing it with the community.
Thank you Oxenskiold,
I thought of Bob Sponge. Who is Bob De Morgan? Sorry if it’s a dummy question.
Augustus De Morgan came up with this wonderful law or theorem in propositional logic: NOT(A and B) NOT(A) or NOT(B)
This is the very law that the Power Query team is using in arriving at what you’re are calling the ‘the positive condition’. The team wants to apply all those ranks that are NOT different from the ranks 3,4 and 5.
That wish in propositional logic would be:
NOT(rank 3 and rank 4 and rank 5)
By applying De Morgan’s law this would be equal to:
NOT(rank3) or NOT(rank4) or NOT(rank5)
… and taking it one step further:
(rank=3 or rank=4 or rank=5)
Thus arriving at ‘the positive condition’.
Now imagine if Bob really was the grand-grand-son of Augustus De Morgan!
Sorry it seems that all inequality and implicate operators are stripped when I post the reply. That makes it a little difficult to see what I explain. Well anyway it was a good story 🙂
Oh, this De Morgan 🙂 Now I get it. The one behind the famous De Morgan Theorem https://en.wikipedia.org/wiki/De_Morgan's_laws
And I was googling for Bob De Morgan…
Augustus was turning in his grave, when he had heard about his grand-grand son mistake.
Pingback: Split & Merge Columns - Pitfalls #7 and #8 - DataChant
Interesting, and very well explained 🙂
Thank you Mahima
Pingback: The Ten Pitfalls of the Data Wrangler (aka the Wielder of M/Power Query) - PowerPivotPro
Great posts Gil! thanks so much for sharing 🙂
These are great! Nice job! Where can I find pitfalls 4-10?
Thank you. Here you go: https://datachant.com/tag/pitfalls/
Where’s the next pitfall?
You can find them all here: https://datachant.com/tag/pitfalls/
Very interestng. May be, a good advice would be “never use filters in Power query?”
Why never? Can you elaborate?
Great post, highly appreciated. Can you please elaborate on what to do if all collumns display “List may be incomplete”? Should you really run through all columnms and display all data before filtering? Thank you in advance.
If you know the data, I recommend using the Text Filters instead of manually checking values in the list. Displaying all data may help, but will not solve future potential errors when new values are added to your data and the M logic was not generated correctly.