Remove Duplicates, Lookup Tables and Pitfall #10 in #PowerBI and #PowerQuery

In today’s post, we will learn about the tenth pitfall in Power BI and Power Query in Excel (Read all pitfalls here). For this special occasion, we’ll get some help from R2-D2 and his friends, and demonstrate the pitfall with Star Wars data. Are you a Power BI Jedi? Can you guess what is the 10th Pitfall from this clue?

r2-d2 + space <> R2-D2

Background

You can encounter Pitfall #10 whenever you create lookup tables in Power BI. Whether you import the lookup table directly from an external data source, or create it from a fact table using the Remove Duplicates transformation in the Query Editor, there are several steps that are crucial to ensure that your report won’t fail to refresh.

You are probably aware of Power Query’s great feature that allows you to remove duplicates. This functionality comes in handy when you need to create a lookup table with a relationship to the fact table. On many self-service scenarios, you start with a single big and messy table, and create an efficient and well-designed data model, with a fact table and connecting lookup tables.

In case the Remove Duplicates is new to you, you can find it in Power BI –> Edit Queries –> Transform tab –> Remove Rows –> Remove Duplicates:

Remove Duplicates in Power BI and Power Query in Excel - Pitfall #10

You can also select any column in the Query Editor, right click on the header and select Remove Duplicates:

Remove Duplicates in Power BI and Power Query in Excel - Pitfall #10

The Star Wars Example

So, what’s the issue with Remove Duplicates, and how it is all related to R2-D2? Let’s find out.

Before we start, you can download the Excel workbook from here to follow the steps. In this section, we will build a Power BI Report and create the necessary steps that will lead to the pitfall. You can skip this section, if you are just too curious.

We have an Excel file with two worksheets. In the first, we have names of Star Wars characters with their height, mass, and film id. The film id will be unique on each row, so Luke Skywalker appears in five rows, each with a different film id, as shown here:

Remove Duplicates in Power BI and Power Query in Excel - Pitfall #10
Star Wars Characters and their film number (From swapi.co, The Star Wars API)

The second worksheet includes data of the Star Wars films (Nothing on Rogue One, Sorry… I haven’t watched it yet).

Remove Duplicates in Power BI and Power Query in Excel - Pitfall #10

Let’s import the data to Power BI Desktop. Click Get Data icon, select Excel, and click Connect.

Remove Duplicates in Power BI and Power Query in Excel - Pitfall #10

Save the Excel file from here and select it.

In the Navigator window, select Sheet1 and Sheet2 and click Load.

Remove Duplicates in Power BI and Power Query in Excel - Pitfall #10

Rename the two tables to Characters and Films, and Films.

Remove Duplicates in Power BI and Power Query in Excel - Pitfall #10

Click Relationships view. You’ll notice that Power BI automatically detected a relationship between the film id fields in both tables. Using this model will not be efficient, as we have duplicate rows with the same character names. While many basic reporting needs can be answered at this stage, without further improvements, this model will break, and will require over-complicated measures, when we start scaling it up to meet advanced reporting needs.

Taking care for a good design for your model, using lookup and fact tables, can boost up the effectiveness, re-usability and agility of your reporting solution. So, let’s modify our main table Characters and Films and extract the character names and their stats out of it.

Remove Duplicates in Power BI and Power Query in Excel - Pitfall #10

Click Edit Queries in Power BI Home tab, and create a reference of the first query, as shown in this screenshot:

Remove Duplicates in Power BI and Power Query in Excel - Pitfall #10

Now, let’s disable the load by unchecking Enable Load.

Remove Duplicates in Power BI and Power Query in Excel - Pitfall #10

In the new query that we have just created, let’s click Choose Columns in Home tab, and select the columns name and film id.

Next, rename the first query as Characters and Films – Base, and the third and new query as Characters and Films.

Remove Duplicates in Power BI and Power Query in Excel - Pitfall #10

The new query Characters and Films, now represents the association between characters and the films they appeared on. Now, it’s time to create a new query with a reference to the base query to extract the characters. Right click on the first query, and select Reference.

Remove Duplicates in Power BI and Power Query in Excel - Pitfall #10

Rename the new query as Characters. Select the column name, right click its header and select Remove Duplicates.

Remove Duplicates in Power BI and Power Query in Excel - Pitfall #10

Remove the column film_id (We have already kept the association of characters to films in the query Characters and Films). Note that we could remove the column film_id before we remove duplicates. For large datasets, we may improve the refresh time.

Remove Duplicates in Power BI and Power Query in Excel - Pitfall #10

After you click Close & Apply in the Query Editor, you will see the following relationship created automatically between Characters and Characters and Films:

You can also create the relationship by dragging and dropping the field name from Characters to Characters and Films.

Note: In Power BI Desktop February 2017 Update, you may get the following error when you try to create relationships through the drag and drop experience.

To avoid the bug above, you can switch to Data view, click Manage Relationships, and click New.. in Manage Relationships window.

Then, you can create a one-to-many relationship between Characters and Characters and Films in the Create Relationship window, as shown in the following screenshot:

We can now create a nice report that shows the number of appearances in Star Wars by characters. For example: The visual below shows that R2-D2 appeared in 6 movies.

Remove Duplicates in Power BI and Power Query in Excel - Pitfall #10

 

The Tenth Pitfall – When R2-D2 turns into r2-d2

Drilling down to the data, you will notice that R2-D2 isn’t associated to the 7th movie – Star Wars: The Force Awakens. My data was wrong. R2-D2 did appear in that movie. There is also a big mystery around the reason he had awaken from his big slumber in that movie here.

 

So, let’s go back to our data source and add a new row for r2-d2 with film id 7. We will use the lowercase version intentionally to show you what would happen next.

After we added the new row, clicking Refresh in Power BI Desktop, will yield the following error:

Looking in the query Characters, you can see that the merge duplicates step was case sensitive. We now have two rows. One for R2-D2 and one for r2-d2. But from the error message above, you can see that on the relationship side, Power BI (and the implicit Power Pivot engine) is case insensitive. When R2-D2 and r2-d2 are detected in the refresh, Power BI aborts the refresh, as it expects unique values in the column name of table Characters, as this table has a one-to-many relationship with the table Characters and Films.

How to avoid the Pitfall – Part 1

Assuming you don’t have a good “data contract” that will ensure that your lookup table has unique case-insensitive values, you can apply a transformation step prior to the removal of duplicates. You can choose between three transformation steps: lowercase, UPPERCASE, and Capitalize Each Word.

In our example, select the base query Characters and Films – Base, select the column name, right click on its header and select Transform. Then click UPPERCASE.

You can now refresh the query. The report will get fixed. The only caveat is that all the character names will now be in uppercase letters. You can use lowercase, or Capitalize Each Word instead.

As an exercise, you can try to apply an uppercase transformation on the robots, and capitalize each word for the other characters. Try the SWAPI API to get the species of the characters and apply this conditional transformation.

Pitfall #10 Part 2 – Trailing Space

Now let’s see what happens when we add a space for one of the names. In our example, I’ve added a space after r2-d2 in cell A19, as shown here:

Clicking Refresh will lead to the same error:

Removing duplicates is applied with case sensitive logic, and leaves R2-D2 and R2-D2[space] as separate unique values, but the Power Pivot engine trims the trailing space, and consider the two values as the same. Thus, the one-to-many relationship is broken, and the refresh fails.

How to avoid Pitfall #10 – Part 2

Fortunately, there is an easy fix. We should also add a Trim step prior to the removal of duplicates. In our example, we can select the base query, right click on column name, and select Transform and Trim.

The third part of Pitfall #10 occurs when you perform Remove Duplicates to create your lookup tables on large data sets. If you have a solid “data contract” or confidence that your lookup table is relatively static, and is rarely changed. For example: a lookup table of countries.

To improve the performance time, you can exclude the lookup table from the refresh sequence, by un-checking Include in Report Refresh in the query level.

Conclusion

When handling your lookup tables, always apply lowercase, uppercase or capitalize each word on the column with unique values, and don’t forget to trim them as well, before you apply Remove Duplicates.

Stay tuned for the Pitfalls Wrap Up on PowerPivotPro.com.

Hope you enjoyed this series.

Gil

3 comments

  1. Pingback: The Ten Pitfalls of the Data Wrangler (aka the Wielder of M/Power Query) - PowerPivotPro

  2. Danny DennisonReply

    GIL*,
    I finally had a chance to review the series in-depth… lots of useful info! I especially like “Data Contract” over my terminology: “Future Proofing” 🙂

    In reference to the “Power BI Desktop February 2017 Update” relationship bug, I have discovered a simpler solution… try dragging joins in the opposite direction.

    Appreciate you posting the series,
    Danny

    *apologies for accidentally “Transforming” your name to UPPERCASE. 😉

    • Gil Raviv Post authorReply

      Thank you for the feedback, Danny.
      BTW, March update fixed the relationship problem.

Leave a Reply