“If 80 percent of our work is data preparation, then ensuring data quality is the important work of a machine learning team.” Andrew Ng
Data quality is one of the highest challenges in any BI project. ETL developers tend to spend a lot of time creating data preparation processes with complex and expensive pro-code tools or long manual testing procedures that are constantly hammered by messy data, edge cases, and lack of good business requirements. These issues result in major delays in project release, and failure to deliver the highest value to the users over time. But this can all be mitigated if we choose to use Power Query low-code ETL capabilities to deliver agile and low-cost data quality automation.
Encouraged by my readers and Andrew NG’s recent article, I have decided to kick off a new series to discuss data quality automation. And the perfect way to start it is to announce the release of an awesome and free app on Microsoft AppSource. Dataflow Snapshots Analysis, which is available to download here. The app allows you to use a connected ADLS Gen2 Power BI dataflow as the source for auditing and data quality automation use cases. Every time you refresh the dataflow, a snapshot is created on ADLS Gen2 and the app will profile it and detect changes in records from the previous version.
In today’s post, I will guide you through the installation steps and describe how you can connect the app to your dataflow’s snapshots to monitor data changes over time and find trends on key data profiling metrics such as Min, Average, Max, and Standard Deviations.
If you have any technical issues connecting to the app, please read the known limitations below and contact DataChant with more details here.
To install the app you need to have a Power BI Pro account and the permissions from your Power BI administrator to install Power BI template apps from Microsoft AppSource. This app is a Power BI Template App. You can learn more about Power BI Template Apps here.
After you install the app, you can review it with the built-in demo data before you connect to your data. In the Summary page you can view the main data changes, missing or new records, and find trends of key profiling metrics.
In the Differences view, you can review all data changes in more details in a tabular format and slice & dice the changes by tables, change types, or time periods. For exmaple: You can review only missing records, or changes to a specific table in a given period. You can expand the Filter pane on the right to filter the report by specific dates and tables.
The Profiles view highlights selected profiling metrics by tables, columns, and snapshot dates, and allows you to find anomalies in any column in your tables based on min, average, max, standard deviation, null count, distinct count, and row count.
The app was designed to track the data in your Power BI dataflow’s snapshots. When you connect your workspace to ADLS Gen2, every time you refresh the dataflows it creates a snapshot on ADLS Gen2. This app takes leverage of the snapshots to track changes and profiles your data. To connect your workspace to ADLS Gen2 go here.
Now that you have a workspace that is connected to ADLS Gen2 and you have a dataflow that is refreshed, you can follow these steps to monitor the selected dataflow’s snapshots.
Download the Excel file from here.
This workbook is important to define the keys in your dataflow’s tables to track the changes in the snapshots. For example, in the demo data of the app, our dataflow imports data from 4 tables: Revenues, Colors, Cities, and Categories. The Revenues table has no single column that can be used as a unique key, but every row in the Revenues table has a unique combination of the columns: Color, Parent Category, Category, Country, State, and City. So, the Excel table includes the rows below to define a composite key that will help the app to compare records with the matching values of these columns.
Let’s go over two more tables to make sure the format is clear: The Colors table has a key column. Its name is Index. Categories table has no unique key but the combinations of Parent Category and Category can serve us as the composite key.
Now, you can review the tables of your dataflow, and fill in the Excel table with the relevant table names and their associated single or combined columns that can be associated as unique keys. Once you complete the Excel table, please save it on your OneDrive for Business or a SharePoint site and follow the instructions here to get the file path for the Excel file. You will need to provide it as a value to the parameter Table Keys Path.
Connecting the App
In Power BI service, go to the Apps section and select Dataflow Snapshots Analysis app.
Click Connect your data.
Enter the paramaters in the Connect to Dataflow Snapshots dialog box, following the instructions below.
The first parameter is ADLS Gen2 Account Name. Enter the account name of your ADLS Gen2 storage account. You can find it on the Azure Portal when you select the ADLS Gen2 resource under Storage accounts as highlighted below.
The account name you should use is also available on Azure Storage Explorer as highlighted below. Make sure you see powerbi container under the Blob Containers section.
The next parameter is Table Keys Path. Enter the file path (without “?web=1 suffix”), following the Initial Preparations sections above.
Scroll down to find the next parameter: Top Rows. Select the maximal number of rows to analyze in each snapshot. For performance reasons, it is advised to monitor small tables. Very large snapshots may time out.
Then, set Max Recent Snapshots. Select the number of latest snapshots to monitor. For performance reasons, it is advised to monitor a small number of snapshots.
Show Only Changes: You can set this parameter to Yes if you want to look only for changes and ignore new and missing records. Set the parameter to No, if you wish to monitor both changes, missing records, and new records.
Workspace Name: Your Power BI monitored workspace that is connected to ADLS Gen2 and contains a dataflow.
Dataflows Name: Your Power BI dataflow in the workspace you monitor.
For exmaple, in the screenshot above, you see the Azure Storage Explorer with the parameters:
- ADLS Gen2 Account Name: datachantlake
- Workspace Name: Test Automation
- Dataflow Name: C07E01
As the owner of the app, you can edit the parameters at any later time from the Workspace settings. Learn more here.
The Parameters section will be available in the Settings of the dataset. From there you can also set a scheduled refresh to have the app constantly refreshed.
Known Limitations & Best Practices
The Dataflow Snapshots Analysis app will not work well on very large data sets. It will most likely timeout due to the refresh limitations of the Power BI Service. The app is looking for differences by matching the keys in records in each pair of consecutive snapshots. This matching can be significantly time-consuming on large datasets or a high number of snapshots. Follow these recommendations to avoid timeouts:
- It is recommended to connect this app to small datasets (less than 1 million rows). If you have large data sets, you can use Power Query online to sample the data in another dataflow and connect the new dataflow to the app. It is better to audit a sampled data than avoid automated auditing altogether.
- For refresh performance improvements it is recommended to use tables with a single unique key, and update it in the Excel file that was described above.
- If you set the app’s dataset for a scheduled refresh, it is advised to set a limit to the parameter Max Recent Snapshots.
If you receive errors during the refresh of the app, please confirm that the keys are set correctly and that Power BI can connect to the Excel file and Azure Data Lake. It is also recommended to confirm using Azure Storage Explorer that your dataflow has snapshots stored on the associated ADLS Gen2. Note: The current solution uses a Pipe character (“|”) to concatenate multiple columns that are defined as keys.
If your keys contain the pipe character (“|”) in one of their columns, it is recommended that you create a new single-column key for your data. Otherwise, the app will fail to refresh or get wrong results. In future versions, we will remove this limitation.
The app connects to a single dataflow. To connect to multiple dataflows or even multiple workspaces you can install multiple instances of the app, or contact me in this form.
If you still have technical issues connecting to the app or have specific requirements that the app doesn’t meet, please contact DataChant via this form and share feedback. Make sure you check out our other apps here.