Now that we have introduced the technology in the previous blog post, Introduction to Power BI Dataflows, this post will cover topics about the authoring experience, some important considerations, basic architecture, and Do’s and Dont’s.
Options to create tables
As you can see in the image above, there are 4 available options to choose from when authoring a new Dataflow.
Define new tables:
If it’s the first time you’re adding tables to a Dataflow, this is the option you would choose. Very similar experience to the Power BI Desktop “Get Data” dialog box, you can choose the data source(s) and tables, and apply any transformations.
Tip: If you have created this table in the Power BI Desktop and want to leverage Dataflows to share the logic across other reports, you can choose blank query in the data sources page and copy and paste the M code in the Advanced Editor in the Power BI Desktop. Most of the transformations available in the Power BI Desktop app are also supported in Dataflows.
Link tables from other Dataflows:
If you have access to Power BI Premium (Dedicated Capacity or Premium Per User license) and have previously added the tables in another Dataflow then choose this option to reference those tables. You can continue to apply more complex transformations to the tables you have already loaded to the cloud by creating computed tables. To create a computed table you would only need to reference/duplicate the linked table added to your Dataflow.
Some of the benefits of using linked tables are: reducing the load on source systems, utilizing the data that is already loaded on the cloud (removing the time Power BI would take to connect to the data source and load time), and leveraging the Power BI Dataflows Enhanced Compute Engine.
Linked Tables and the Enhanced Compute Engine
The Enhanced Compute Engine was designed by the Engineering Team to decrease the time it takes to apply transformations in a Dataflow. You can leverage this engine by creating computed tables and making sure the Enhanced Compute Engine is enabled for the Dataflow. You can create a computed table by applying transformations to linked tables referenced in the Dataflow.
As you start authoring Dataflows and create your own complex solution, you may start to think about backing up your work or deploying your solution from different environments (Development, Test, Production).
If Power BI Deployment Pipelines are not part of your Power BI Architecture and you need to deploy your Dataflow to another environment, need to back up your progress, or simply looking to recreate a Dataflow that will have similar logic, you should consider this option.
This option provides the option to upload a .json file. Each time you create a Dataflow, you can download the .json file. This file stores the metadata and code for the Dataflow you created.
When you select this option, you would be recreating that Dataflow defined in the .json file.
*Whenever you export and import a Dataflow, it gives the Dataflow a new ID. If you are using this feature to deploy from one environment to the next, make sure that the .pbix file is pointing to the new Dataflow ID. This also applies to Linked tables, if the Dataflow was recreated using this feature, you would need to manually update the ID in the query that connects to the tables.
How to export a .json file:
Reference .json file
Let’s take a closer look at the sections outlined in the image below.
Section 1: This section represents the M code generated by the Power Query Online engine when you create a Dataflows. Essentially you could copy and paste this query in the Advanced Editor to create this same table defined in the code. For this example, this is only a simple table connecting to an Excel spreadsheet.
Section 2: This section lists all the entities (tables) that were defined in the Dataflow and include details such as the names and attributes.
*Queries metadata above Section 1 also lists table names. Also, the table name is defined in Section 1. Section 2 was highlighted to show the detail of the columns metadata that is stored as part of the .json file.
Section 3: This section is part of the Entities list and defines the refresh type (full load or incremental) and the location. In this specific case, the location is the Microsoft cloud since for this example, there was no ADLS Gen 2 storage account added for Dataflows storage
Attach a Common Data Model folder
If you have other Azure Services that write data to the Azure Data Lake Gen 2 account instance and you would like to create a Dataflows to then share with other Power BI users, you can choose this option.
In order to implement this option, the following requirements need to be met:
– Power BI requires read access to the Azure Data Lake Gen 2 account.
– User creating the Dataflow requires read access to the Azure Data Lake Gen 2 account.
– URL should be a direct file path to the .json file and use the ADLS Gen 2 endpoint.
Remember the basic architecture for Dataflows referenced in the first blog post? Link to blog post.
Let’s add another layer of complexity to explain how you can implement Dataflows and some important considerations.
Above, you can see there are essentially three different workspaces:
- WORKSPACE #1 (Orange shade) which hosts 4 Dataflows, 1 Dataset, 1 Report and 1 Dashboard.
- WORKSPACE #2 (Pink shade) which hosts 1 Dataflow, 1 Dataset, 2 Reports and 1 Dashboard.
- WORKSPACE #3 (Red shade) which hosts 1 Dataflow.
Referencing a Table across Workspaces
In any organization there is data that will be shared across departments but also data that is very specific to a department and wouldn’t be shared. In the image above, you can visualize a common scenario where Dataflows are shared across workspaces to be used in specific Datasets.
*When referencing a Dataflow from another workspace remember you need to set a refresh schedule on the Dataflow where you added the linked tables since it behaves as an external data source.
How Refresh works
In the common Architecture referenced above, tables and linked tables are used across Dataflows and in different workspaces.
For linked tables in the same workspace, the recommendation by Microsoft is to set the Refresh schedule at the Extract Dataflow, which is the one that connects directly to the data source. The refresh triggers a refresh process on any Dataflow referencing tables from the Extract Dataflow.
When a Refresh is triggered at the Extract Dataflow, all downstream Dataflows are locked to ensure data consistency. If you would prefer to avoid this locking behavior, you have the option to move the upstream Dataflows to a separate Workspace and set a Refresh schedule.
For linked tables across different workspaces, a separate refresh schedule would need to be added as the Dataflows are treated as external sources.
Collaboration and Permissions in Dataflows
In the Power BI Service, there are currently 4 different roles that can be assigned to users: Admin, Member, Contributor, and Viewer. As you may know, these are set at the Workspace level.
To author/edit any Dataflow, Contributor level permissions or higher are required and same as Datasets, only one user can be listed as an owner at any point in time. To make any modification, the user would first need to take over the ownership of the Dataflow.
With these access constraints, you may feel comfortable that only the users granted Contributor or above permission, would edit the extraction and transformation logic created in your Dataflows. To reference any Dataflow in datasets or another workspace, only the Viewer role is required in the workspace that hosts the Dataflows.
In this scenario, you likely have two types of workspaces:
- Back-end workspace: Created to host Dataflows that extract, apply transformations and may even filter the data for others to connect to.
- Self-Service workspace: Created to provide users with the ability to connect to Dataflows and create their own content.
*To provide users with Read access to the Dataflows in this scenario, there are two options:
- Grant viewer access to users in the back-end workspace so they can create their own reports connecting to the Dataflows.
- Create Dataflows in the Self-Service workspace that connect to the data hosted in the Back-end workspace by leveraging Linked tables.
*Remember that Linked Tables are a Premium feature.
Do’s and Dont’s
- DO separate tables from Cloud and On-Premises sources in 2 different Dataflows. You wouldn’t want to force your Dataflow to connect to a Cloud source from the Gateway and causing more traffic. Also it will only help troubleshooting when you experience any issue.
- DON’T connect to two different On-Premises sources from the same Dataflow. Only one Gateway is supported at the Dataflow level. You can connect to different sources that leverage the same Gateway, however, to make troubleshooting easier you can split the sources on different Dataflows. So essentially you would have 1 Dataflow that connects to on-premises Teradata and 1 Dataflow that connects to on-premises Oracle.
- DON’T blame Sharepoint/Dataflows connection when experiencing issues. SharePoint throttles requests by design and if you are trying to connect to multiple files or SharePoint lists at the same time, your requests will time out.
- Following the above statement, DO stagger your Dataflow Refreshes that connect to SharePoint to avoid the Throttling issues.
- DO create separate Dataflows for each Cloud source, it will only help troubleshoot when you experience any issue during a Refresh. It would be easier to find the source and investigate accordingly.
- DO set Refresh schedules at the Extract Dataflow level. When the Refresh for this Dataflow is completed, it will trigger the Refresh on all Dataflows referencing linked tables.
- DO set a Refresh schedule for Dataflows referencing linked entities on another Dataflow.
I hope you found this article helpful as you continue your journey with Power BI Dataflows! Please feel free to reach out directly on LinkedIn if you have any questions or comments.
Follow me on Twitter