Automatic Detection of Column Types in #PowerQuery

Working with column types can be challenging in Power Query. The default conversion of types may not always work well for you, and may create future refresh errors, when the column names change. I have blogged about it in details here, and usually recommend to use explicit type conversion, and remove the automatic Changed Type detection step.

Nevertheless, there are advanced scenarios (especially for me when I explore new datasets with dozens of numeric columns), where the explicit manual conversion of types is too time consuming, but going back to the automatic detection is not feasible (For example: When the first rows are not sufficient to detect the column types, and fill with null values, Power Query will end up with an Any type, which is not very helpful).

Wouldn’t it be useful to have your own automatic detection of columns? A technique that you can control, and will not fail to refresh, when the column names changes?

Here is the solution:

You can copy paste the following function into a blank query:

(SourceTable)=>
let
    
FnDetectType =
    (SourceTable, ColumnName)=>
    let
        ListColumn = Table.Column(SourceTable,ColumnName),
        NumberPercentage =
            List.Count(
                List.Select(ListColumn , each _ is number)
            ) / List.Count(ListColumn),
        IntPercentage =
            List.Count(
                List.Select(ListColumn , each _ is number and _ = Number.Round(_, 0))
            ) / List.Count(ListColumn),
        DatePercentage =
            List.Count(
                List.Select(ListColumn , each _ is date)
            ) / List.Count(ListColumn),
        DateTimePercentage =
            List.Count(
                List.Select(ListColumn , each _ is datetime)
            ) / List.Count(ListColumn),
        Max = List.Max(
            {NumberPercentage, DatePercentage, DateTimePercentage, IntPercentage}
        ),
        Result =
            if Max < .95 then
                {ColumnName, type any}
            else if DateTimePercentage = Max then
                {ColumnName, type datetime}
            else if DatePercentage = Max then
                {ColumnName, type date}
            else if IntPercentage = Max then
                {ColumnName, Int64.Type}
            else
                {ColumnName, type number}
    in
        Result,

    ConvertTypesList =
        List.Transform(
            Table.ColumnNames(SourceTable),
            each FnDetectType(SourceTable, _)
        ),
    ConvertedTable = Table.TransformColumnTypes(SourceTable, ConvertTypesList)
in
    ConvertedTable

You can run it on any table, and it will convert columns to DateTime, Date, Whole Number and Decimal Number types, if more than 95% of the values are matching the relevant type. You can also change the threshold from 95% in the code.

To run this function, follow these steps:

Create a blank query in Power Query Editor, and paste this function into the Advanced Editor. Name the function as FnDetectAndTransformTypes. Select the query with the table you want to convert, and select the fx button (right to the formula bar in Power Query Editor. Oh, and if you don’t see the formula bar, enable it in View tab).

Now, in the formula bar, let’s say that you see the formula bar:

= Source

All you need to do, is change it to:

= FnDetectAndTransformTypes(Source)

You are done. You can now see that the type detected automatically.

Automatic Detection of Column Types in Power Query using M

Note: This technique will be slow on large datasets. To run it on a preview of the data – let’s say 1000 rows – you can change the following line:

ListColumn = Table.Column(SourceTable,ColumnName),

to

ListColumn = List.FirstN(Table.Column(SourceTable,ColumnName), 1000),

Deep Dive – Understanding the M Expression

The FnDetectAndTransformTypes function receives a table as an argument, to detect the column types and convert them. In its inner implementation, there is another function – FnDetectType, which receives the table, and a single column to detects its type. FnDetectType is divided into two sections. In the first section, we calculate the percentages of decimal numbers, whole numbers, dates, and datetime in the column. For example, if your column consists of 10 rows, with two date values in it, the percentage of dates is 20%, and the DatePercentage identifier below will hold the 20%.

        ListColumn = Table.Column(SourceTable,ColumnName),
        NumberPercentage =
            List.Count(
                List.Select(ListColumn , each _ is number)
            ) / List.Count(ListColumn),
        IntPercentage =
            List.Count(
                List.Select(ListColumn , each _ is number and _ = Number.Round(_, 0))
            ) / List.Count(ListColumn),
        DatePercentage =
            List.Count(
                List.Select(ListColumn , each _ is date)
            ) / List.Count(ListColumn),
        DateTimePercentage =
            List.Count(
                List.Select(ListColumn , each _ is datetime)
            ) / List.Count(ListColumn),

In the preceding section, you can see that for each percentage type, we count the number of items in ListColumn (which is the list-representation as the column) in the denominator. In the numerator, we count the number of items in ListColumn, after we apply a filter on the list, and keep only items of a specific type. For example: We apply the condition is number to filter the list, and keep only decimal numbers in the expression:

List.Select(ListColumn , each _ is number)

To calculate the percentage of whole numbers, we first check if the value is a number, and then check if it is equal to the value of we round it down (Because when you round down a whole number, you get the same whole number as a result).

List.Select(ListColumn , each _ is number and _ = Number.Round(_, 0))

The reason why we first check if the value is a number, before we round it down, is to avoid errors, when Number.Round will be applied on values which are not numbers. When the first condition is not met, M will not evaluate the second condition.

In the next section of FnDetectType round, we find the maximal percentage that was detected. Then, if the maximal percentage is lower than 0.95 (You can set a different threshold if you wish, by replacing .95 with a different threshold), we’ll return a strange artifact: A list with two elements: The column name, and type any, for example: { Column1, type any }. Why do we need to return such a list? You will understand it at the end of this post.

If the maximal percentage is above or equal to 0.95, and the DateTimePercentage is the maximal percentage, we will return the list with the column name, and the datetime type. Otherwise, if the DatePercentage is the maximal percentage, we will return a list with the column name with the date type. Otherwise, if IntPercentage is the maximal percentage, we will return the list with the column name and Int64.Type (whole number). Otherwise, we will return the list with the column name and the number type (decimal number).

Do you know why, we first check if the maximal percentage is a whole number, before we check if the maximal percentage is decimal number?

        Max = List.Max(
            {NumberPercentage, DatePercentage, DateTimePercentage, IntPercentage}
        ),
        Result =
            if Max < .95 then
                {ColumnName, type any}
            else if DateTimePercentage = Max then
                {ColumnName, type datetime}
            else if DatePercentage = Max then
                {ColumnName, type date}
            else if IntPercentage = Max then
                {ColumnName, Int64.Type}
            else
                {ColumnName, type number}

Now, when FnDetectType returns the a list with the column name and the type of the column, we can run it on all the columns. This is done using List.Transform. which is invoked with the list of column names (Table.ColumnNames) as the first argument, and FnDetectType as the second argument. The result in ConvertTypesList is a list of nested lists, each with the column name and corresponding column type.

Finally, we apply column type conversion by running the function Table.TransformColumnTypes. This function is used, when you apply a manual column conversion. The second argument of Table.TransformColumnTypes, requires the exact format that we created in ConvertTypesList

(SourceTable)=>
let
...
    ConvertTypesList =
        List.Transform(
            Table.ColumnNames(SourceTable),
            each FnDetectType(SourceTable, _)
        ),
    ConvertedTable = Table.TransformColumnTypes(SourceTable, ConvertTypesList)
in
    ConvertedTable

The result is the converted column types. Hope you find it helpful.

Looking forward for your feedback in the comments section.

12 comments

  1. GMF Reply

    A good way to demonstrate the abilities of PowerQuery. I have data sets with a fair number of null values however so the 95% target is hard to match even if the non-nulls are a consistent data type. Is there a way to eliminate nulls from the denominator? It’s probably a variation of the List.Count command but I’m not sure which side of the parentheses it should be on…

  2. GMF Reply

    Here’s something that appears to work. But I would like to be able to distinguish between Any and Text types…

    (SourceTable, NumRows)=>
    let

    FnDetectType =
    (SourceTable, ColumnName)=>
    let
    ListColumn = if NumRows = 0 then
    Table.Column(SourceTable,ColumnName)
    else List.FirstN(Table.Column(SourceTable, ColumnName), NumRows),
    ListNulls = List.Count(List.Select(ListColumn , each _ is null)),
    ListNonNullCols = List.Count(ListColumn) – ListNulls,
    NumberPercentage =
    List.Count(
    List.Select(ListColumn , each _ is number)
    ) / ListNonNullCols,
    IntPercentage =
    List.Count(
    List.Select(ListColumn , each _ = Number.Round(_, 0))
    ) / ListNonNullCols,
    DatePercentage =
    List.Count(
    List.Select(ListColumn , each _ is date)
    ) / ListNonNullCols,
    DateTimePercentage =
    List.Count(
    List.Select(ListColumn , each _ is datetime)
    ) / ListNonNullCols,
    Max = List.Max({NumberPercentage, DatePercentage, DateTimePercentage}),
    Result =
    if Max < .95 then
    {ColumnName, type any}
    else if DateTimePercentage = Max then
    {ColumnName, type datetime}
    else if DatePercentage = Max then
    {ColumnName, type date}
    else if IntPercentage = Max then
    {ColumnName, Int64.Type}
    else
    {ColumnName, type number}
    in
    Result,

    ConvertTypesList =
    List.Transform(
    Table.ColumnNames(SourceTable),
    each FnDetectType(SourceTable, _)
    ),
    ConvertedTable = Table.TransformColumnTypes(SourceTable, ConvertTypesList)
    in
    ConvertedTable

    • Gil Raviv Post authorReply

      Thank you for sharing. What is your logic to differentiate between Any and Text?

  3. GMF Reply

    I have no idea how to distinguish between them. I suppose like Sherlock Holmes, once you have eliminated the impossible (numbers and dates) whatever remains is text! Text may be a safer bet and require fewer reversals than Any… Just about any data type will load as Text. I don’t know if you could test for Boolean, Duration, or Binary either since Boolean would be more likely to be a textual “TRUE” “FALSE” choice. Duration or Boolean would more likely be the outcome of a Power Query operation too rather than an input.

  4. John Thomas Reply

    So I’m a bit confused as to why this function is needed. It’s a nice trick and shows off some cool Power Query M code, however, because you only need to change data types once when creating the query, why not just use the Change Types step and be done with it. I use Ctrl+Click to select all “text” columns and change them, then use Ctrl+Click to select all the Numbers, then Currency, then Percent, and finally Dates. You only have to do this once and you are finished. So what gives?

    • Gil Raviv Reply

      Thank you John for sharing your approach. Most of the times I will convert the types exactly the same way as you do.

      Still, there are four reasons to consider this code:
      1- You suspect that some column names may change in the source table, but you still want to build a robust query that will work. Your approach will require the use of explicit references to the column names in M. Refresh will fail.
      2- You have plenty of columns, and you are too lazy to handpick all columns, as you described. If you have 100 columns, and the columns with dates and numbers are randomly spread in the table, this is a tedious work.
      3- You are a consultant working with new datasets all the time, and need to find some insights quickly. You want to explore the data and prepare nice visuals and find some interesting correlations between different columns, but you have too many columns to work with. The automatic detection can be a first step, for an unpivot by column types. Then you can find cool things and relationships between columns in scatter charts.
      4- The default automatic detection doesn’t work well for you. Because you have mixed types. You want to customize the detection by your own logic.

      But in the end, it is all a matter of taste and preferences. I am not recommending to prefer this technique over yours. Most of the time, when the data is known and well defined, I would prefer your approach. Still, the readers can learn M from this example, even if they will hardly use it as-is.

      Hope is now clear.

  5. Cameron Wallace Reply

    Hi Gil, I’ve tried this function out but it crashes on the Number.Round function where the value is not a number. Replacing it with (try Number.Round(_, 0) otherwise false) fixes it. Also List.NonNullCount could also be used instead of List.Count, but it would need extra logic in case that was 0. Also the “else if DatePercentage” should be above “else if DateTimePercentage”.

    Colin Banfeild made a really good function at https://social.technet.microsoft.com/Forums/en-US/ee911661-6cb1-48ac-ae46-d70979b35cb7/homogeneous-list-types-in-m . You will need to search for “Table_ConvertColumnTypes” in the forum post. It doesn’t do the 95% thing but looks at the first 200 rows (and that number can be changed). It is very effective and allows for other cultures as well.

    • Cameron Wallace Reply

      Just to make a correction, the comment: “Also the “else if DatePercentage” should be above “else if DateTimePercentage”” is not true, it doesn’t actually matter.

    • Gil Raviv Post authorReply

      Thank you Cameron for sharing this feedback. I have fixed the Number.Round following your finding. Instead of try/otherwise, I added a condition (_ is number) prior to the Number.Round condition.

      • Cameron Wallace

        That’s pretty cool. I sometimes forget that M is a lazy language and will only evaluate statements if it needs to.

Leave a Reply