In today’s episode of the Pitfalls Series (here), we will briefly discuss the fifth pitfall, which is one of the most common mistakes that you always do, when you import data to Power BI, or to Excel (using Power Query / Get & Transform) – You remove columns.
While the removal of unnecessary columns is a crucial part in building efficient and effective data model in Power BI, each time you delete a column, you weaken your query, and expose it to a future refresh failure.
In the following screenshot, in the Query Editor of Power BI or Power Query in Excel, I deleted Column2. As a result, the Query Editor created a formula that explicitly refers to Column2.
It makes sense, right? After all, I have just deleted Column2, so it shouldn’t be a big surprise to find it in the formula above. Right.
Still, each time you remove a column in the Query Editor, you take the risk that the removed column will be missing in the external data source. Would you like your query to fail refreshing on a production environment, or on your CEO’s dashboard, because of a careless deletion of columns?
The Simple Solution – See the glass half full
The following solution is also applicable to many other things in life. See the glass half full, and focus on the columns you want to keep, rather than the ones you want to remove.
Whenever you find unnecessary columns, try to establish a “data contract”, even a virtual one, and ask yourself if these unnecessary columns will always remain in the external data source. In many cases, you would do better in selecting the few columns you want to keep, rather than selecting the ones you want to remove.
For example, in the table below, you can see that I have four columns which only contain null values. In the following step, I remove the four columns.
The removal of the last four columns had created a formula that explicitly refers to those columns. So, we have 4 weaknesses in our query – 1 weakness for each column name we refer to.
The solution in our case, is to select the first two columns, which we need, and select Remove Other Columns.
The selection of the first two columns, had created a different formula which refers to the first two columns instead of the last four. While the results are same, the second query is much more robust.
The following query has only 2 weaknesses, in comparison to the 4 weaknesses above, and I could even argue that there is a higher chance for the columns you keep to remain untouched, in comparison to the columns you remove. So, all in all, selecting columns is usually better than removing columns, if you need to create a robust query.
In many cases, you will find it easier to establish a “data contract” with the owner of the external data source, to commit on keeping the columns you need, rather than to maintain the columns that you don’t need.
Another option to select the columns to keep is by clicking Choose Columns in Home tab.
Solution #2 – Ignore the missing column
Often times, it doesn’t matter if you either remove or select columns. You will eventually deal with external data sources that are likely to change. Fortunately (for some), there is a third and optional argument in Table.RemoveColumns and Table.SelectColumns that allows you to ignore errors, instead of failing to refresh. The third argument is MissingField.Ignore or MissingField.UseNull.
The optional field is also available in Table.ReorderColumns which was covered here.
Let’s quickly demonstrate the use of each MissingField option.
In the query above, we selected Column1 and Column2. Let’s see what happens when Column1 was turned into New Column1.
The Table.SelectColumns step fails with the following error:
Now, if we add the third optional parameter MissingField.Ignore, the step won’t fail, but if there are further steps in the query, that rely on Column1, this query will probably fail down the road, as Column1 is missing.
If we add the third optional parameter MissingField.UseNull, the step won’t fail, and we will have null values in Column1.
I think that MissingField.UseNull is more practical than his sibling MissingField.Ignore. But both options may expose to query to other failures, which may be difficult to debug and fix.
Solution #3 – Use Column Indices
Sometimes, we have a pretty solid “data contract” for some crucial columns in our tables, but not by their name, but by their position in the table. For example, let’s use our example above, and select the first two columns by their indices.
We will replace the original formula that was created as Solution #1 above:
= Table.SelectColumns(Source,{"Column1", "Column2"})
With a robust query that uses the function Table.ColumnNames to get the list of all column names, and uses List.Range to retrieve a subset of the columns (In our example, we will keep the first two columns by using index 0 as the second argument of List.Range, and 2 as the third argument for the number of items we keep):
= Table.SelectColumns ( Source, List.Range(Table.ColumnNames(Source), 0, 2) )
To keep the code simple for our specific case, we can use List.FirstN to keep the first two columns.
= Table.SelectColumns ( Source, List.FirstN(Table.ColumnNames(Source), 2) )
You can also select individual columns. The formula below is equivalent to the one above.
= Table.SelectColumns ( Source, { Table.ColumnNames(Source){0}, Table.ColumnNames(Source){1} } )
The code above is practical when you need to select nonadjacent columns. For example: Selecting the first the third columns can be written by using Table.ColumnNames with index 0 and 2:
= Table.SelectColumns ( Source, { Table.ColumnNames(Source){0}, Table.ColumnNames(Source){2} } )
Oh, and you can also apply the techniques above on Table.RemoveColumns. For example, we can remove the last four columns with this formula:
= Table.RemoveColumns ( Source, List.LastN(Table.ColumnNames(Source), 4) )
Conclusions
Removing and selecting columns is probably the most common transformation step you do in Power BI and Power Query. You have the choice to select between the two, and ignoring that choice is a common mistake we do. A mistake, which may lead to weak queries that fail when the column names change. In many cases the selection of columns can help you to gain a robust query.
You can use the optional field MissingField.Ignore or MissingField.UseNull to avoid errors, but you will find out that the use of Table.ColumnNames and List.Range, can literally be your best friends, when you have a solid “data contract” which is based on the columns’ positions, rather than their names.
Coming Next – The sixth pitfall – Column Renaming.
Pingback: Column Renaming and List.Zip To The Rescue - Pitfall #6 in #PowerBI #PowerQuery - DataChant
Absolutely awesome ! Thank you Gil.
I love your concept of “Data contract”. I have spent quite a few months learning DAX and put aside Power Query Formula language. But I realise now how powerful it can be to combine these 2… 🙂
This is fantastic. I am not sure I used it in the right way or if there is a better way, but after getting my data the way I wanted it, the last step was to PIVOT the data based on the month end, so I would get 12 columns of data in a flat table for a human to look at. We didn’t want a Pivot Table in Excel.
This created a problem with a bunch of nulls in the resulting table, and of course when I replace them, PQ by default named every column explictly in the replace – so “1/31/2017′, “2/28/2017″, etc,
But of course the dates would constantly change as we moved through time, so it would break next month.
I got rid of that part of the Table.ReplaceValue and replace it with Table.ColumnNames(#”Pivoted Column”){4},Table.ColumnNames(#”Pivoted Column”){5},Table.ColumnNames(#”Pivoted Column”){6}, etc. based on your post above.
What if you work with data of 100+ fields and those field indexes keep changing. How do you automate finding the right Index value based on if the field name contains “Day,” “Tech”, “Fin”, etc?
This expression returns the index of the column “Day”:
List.PositionOf(Table.ColumnNames(Source), “Day”)
Excellent – Answered my question of how to reference the first few columns without hard coding column names – thanks!.
Excellent! Thank you for sharing your comment.
I thank you for all this exemples
I would like to combine 2 of your cases
Get columns form position 0 to 3 and from position 17 to 29
I tried this which is not working :
= Table.SelectColumns
(
Source,
{
List.Range(Table.ColumnNames(Source), 0, 3),
List.Range(Table.ColumnNames(Source), 17, 29)
}
)
Thanks a lot
= Table.SelectColumns
(
Source, List.Combine(
{
List.Range(Table.ColumnNames(Source), 0, 3),
List.Range(Table.ColumnNames(Source), 17, 29)
})
)
Thaaaaaaaaaaaaaaaaaank you so much 😉
Chris