Guest Story: Unpivot Pairs of Columns

Our guest today is Sandeep Pawar. A Simulation & Data Analytics Engineer at Cree Lighting in Racine, WI. Sandeep applies Data Analytics, Statistics, and Machine Learning to improve Manufacturing operations. You can contact him here.

I wanted to share with you today an interesting challenge that I solved recently. I received quality inspection data (You can download the sample CSV file here) and was asked to create a report that will enable users to select from hundreds of different attributes (that are represented as columns in the source table) and use a dynamic measure against the selected attribute, so all the visuals in the report should update dynamically. For example, in the screenshot below, the Quality_Measure slicer represents four of these attributes (Of course in the real data, I had hundreds of these attributes).

When you have a limited set of columns that you want to choose from as your measure, you can create a dynamic measure in DAX using SWITCH and SELECTEDVALUE on top of a disconnected table (See examples here and here), but in my particular case, I had hundreds of unique attributes, so I was looking for a different way.

I knew how this could be achieved in Power Query using Unpivot.

But in my case, the data was in a format that was not easily unpivoted. For each attribute, I had a second dependent attribute with two optional values In-Spec / Out-of-Spec, as shown in this table:

This diagram illustrates the paired columns and the desired transformation.

Because of the dependent attribute (which is marked as C1…CN in the diagram above), I could not unpivot the columns. A simple unpivot will lose the In-Spec / Out-of-spec context altogether.

Approach 1: Create temporary separate tables for each pair and append them

I have more experience with complex data manipulation using Python Pandas than M. In Pandas, I am used to creating data frames as needed and merging them together to create the final table. I thought of doing the same thing in PowerQuery using M. I followed chapters 3 & 4 of Gil’s book (link) to select the attribute columns and appended the tables together to get the final result following these high-level steps:

  1. Keep the source query as your base query, and create a reference.
  2. Select the anchored columns (e.g. Data, SKU, Vendor as shown in the table above). Then, select Color and Color_Check.
  3. Rename Color_Check to Check.
  4. Save this query as Query 1.
  5. Go back to your source query and create a reference. Select the anchored columns (e.g. Data, SKU, Vendor as shown in the table above). Then, select Oxidation_Level and Oxidation_Check.
    Rename Oxidation_Check to Check.
    Save this query as Query 2.
  6. Go back to your source query and create a reference. Select the anchored columns (e.g. Data, SKU, Vendor as shown in the table above). Then, select Film_Thickness and Film_Thickness_Check.
    Rename Film_Thickness_Check to Check.
    Save this query as Query 3.
  7. Append Query 1, Query 2 and Query 3.

Since I preferred to keep a minimal number of queries in my report, I decided to avoid using Power Query GUI, instead, I created this M expression that consolidates all the logic above into a single query:

let
    // Create Color Table, Add Columns for Color to create Color sub-table
    ColorTable1 = Table.SelectColumns(Source,{"ID","Date","SKU","Vendor", "Color","Color_Check"}),
    ColorTable2 = Table.AddColumn(ColorTable1,"Quality_Measure", each "Color"),
    ColorTable = Table.RenameColumns(ColorTable2,{{"Color", "Quality_value"}, {"Color_Check", "Quality_Check"}}),

    // Create Oxidation Table, Add Columns for Oxidation to create Oxidation sub-table
    OxidationTable1 = Table.SelectColumns(Source,{"ID","Date","SKU","Vendor", "Oxidation_level","Oxidation_Check"}),
    OxidationTable2 = Table.AddColumn(OxidationTable1,"Quality_Measure", each "Oxidation"),
    OxidationTable = Table.RenameColumns(OxidationTable2,{{"Oxidation_level", "Quality_value"}, {"Oxidation_Check", "Quality_Check"}}),

    // Create Concentration Table, Add Columns for Concentration to create Concentration sub-table
    ConcentrationTable1 = Table.SelectColumns(Source,{"ID","Date","SKU","Vendor", "Concentration","Concentration_Check"}),
    ConcentrationTable2 = Table.AddColumn(ConcentrationTable1,"Quality_Measure", each "Concentration"),
    ConcentrationTable = Table.RenameColumns(ConcentrationTable2,{{"Concentration", "Quality_value"}, {"Concentration_Check", "Quality_Check"}}),

    // Create Film Thickness Table, Add Columns for Film Thickness to create Film Thickness sub-table
    FilmthkTable1 = Table.SelectColumns(Source,{"ID","Date","SKU","Vendor", "Film_Thickness","Film_Thickness_Check"}),
    FilmthkTable2 = Table.AddColumn(FilmthkTable1,"Quality_Measure", each "Film_Thickness"),
    FilmthkTable = Table.RenameColumns(FilmthkTable2,{{"Film_Thickness", "Quality_value"}, {"Film_Thickness_Check", "Quality_Check"}}),

    // Append all tables to create Final Table
    FinalTable = Table.Combine({ColorTable,OxidationTable,ConcentrationTable,FilmthkTable})
in
    //Return Final Table
    FinalTable

And here is the M query for my Source (Pointing to the CSV file I shared with you earlier).

let
    Source = Csv.Document(Web.Contents("https://datachant.com/public/UnpivotPairs.csv"),[Delimiter=",", Columns=12, Encoding=1252, QuoteStyle=QuoteStyle.None]),
    #"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
    #"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"ID", Int64.Type}, {"Date", type date}, {"SKU", type text}, {"Vendor", type text}, {"Color", type text}, {"Color_Check", type text}, {"Oxidation_level", type number}, {"Oxidation_Check", type text}, {"Film_Thickness", type number}, {"Film_Thickness_Check", type text}, {"Concentration", type number}, {"Concentration_Check", type text}})
in
    #"Changed Type"

Now the data is in a format I could use. I named the query “By Combining in M” load the data and built the following dynamic measure to count the number of SKUs that are out of spec.


CALCULATE (
    COUNT ( ‘By Combining in M'[Quality_Value] ),
    ‘By Combining in M'[Quality_Check] = “Out-of-Spec”
)

You can download the solution file from here and see how I used this measure to find the SKUs that are out-of-spec based on the selections in the slicers.

When I used this technique, I found out two main drawbacks: The query is very time-consuming when used on large amount of pairs; The process to separate the table into multiple tables is manual and error-prone. I reached out to Gil for his help to see if there is a better, faster approach than mine. Below is Gil’s elegant solution.

Approach 2: Unpivoting Column Pairs

The following technique is preferred if you have a high number of column pairs. For example, in this PBIT file, we generated 100 rows with 16 column pairs and places the two techniques side-by-side. You will find out that the Unpivot technique is significantly faster when you load the template.

However, if you only use a few pairs, you can stick with the first method.

To start the second technique, load the source data and follow these steps for each column pairs: Select the current pair of columns. Right-click on one of their column headers and select Merge Columns in the shortcut menu.

In the Merge Columns dialog box, select Comma as the separator (or another separator that is not used in the values of your data). Enter the attribute name in New column name text box and click OK.

After you merge all the column pairs, you are ready for the unpivot. Select the anchor columns, and then right-click Unpivot Other Columns.

Right-click on the Value column, and select Split Column, By Delimiter.. in the shortcut menu.

Rename the columns Value.1 and Value.2. In my case, since I deal with quality attributes I chose the column names Quality_Value and Quality_Check. Here is the final table.

You can download the PBIX file from here.

Is the second technique faster? Well, it depends on your data. In the data that I shared with you here, we have 148800 rows and only 4 pairs of columns. As you can find out, with this dataset the first approach is faster.

But in the PBIT file, we selected only the first 100 rows, and duplicated the 4 column pairs 4 times, resulting in 16 column pairs. In this dataset, the second approach was significantly faster – even when we used Table.Buffer to ensure that the access to the data source is done only once in the first approach.

You are welcome to share your feedback in the comments below, and try to explain the performance differences of the two approaches. In my next blog post, I will share with you a dynamic M code to merge the column pairs dynamically instead of the manual merge that we applied here. I hope you find this story useful.

Leave a Reply