Web Scraping Power BI Custom Visuals from Office Store


As promised here, today we will web-scrap Office Store website, and extract the information to build a catalog of Power BI Custom Visuals. You can share this tutorial with colleagues, and use it as an exercise to improve your Power BI skills. Another great benefit of the tutorial, is its outcome. Following the steps below, you will build the queries which are used by the report, and gain new ways to explore the Custom Visuals of Power BI.

You can use such report, whenever you want to find new insights, and explore how new visuals will highlight your data. You can also use this report, when you are pitching Power BI to your clients or decision makers in your organization.

Want to have the full solution? You can now buy the full report here for $9.99.

Introduction

In the first blog post on Web Scraping, here, we learned how to scrap tables from the Web, by navigating through the HTML tags. Unfortunately, this approach doesn’t allow us to extract links from tags. In Web Scraping Part 2, here, we learned how to extract links in a different method – by parsing the HTML as a text file. In today’s post, we will repeat this technique and scrap all the Power BI Custom Visuals from Office Store here.

Office Store Power BI Custom Visuals

Ready to Start?

Creating the base query

Open Power BI Desktop, click Get Data, and select Web.

In From Web, paste the following URL, and click OK.

Ensure you are selecting the Anonymous tab, in Access Web content dialog, and click Connect.

In Navigator dialog, select Document, and click Edit.

In Query Editor window, delete the Navigation step in Query Settings right pane.

In Home tab, click Advanced Editor, and delete the text “Web.Page(” from the query expression.

Then, delete the closing parenthesis at the end.

Here is the modified query expression.

Click Done in the Advanced Editor, and right click on the office.store.com icon.

In the context menu, select Text.

Now, that we have lines of text. We should find where the Custom Visual tiles are located. We will apply a filter to find these rows. Click on the filter control of Column1, click Text Filters and click Contains.

Reviewing the source code, we can see that the prefix AppCell is used for each tile. Let’s filter the table by this prefix.

In Filter Rows, set AppCell in the highlighted text box, and click OK.

You would expect now to have multiple rows – one for each Custom Visuals. But all our visuals are on the same row. 

Right click on the header of Column1 and select Split Column, then click By Delimiter...

In Split Column by Delimiter, set <div as a –Custom– delimiter. Expand Advanced options, and select Rows. Then click OK.

We now have a base table with multiple rows for the Custom Visuals. Some of the rows contain the image of the Custom Visual, other contain the link to the visual web page. Let’s keep our query as a base query, and use it as a reference by the other queries to come.

Getting the Custom Visual Links

Right click on Document in the queries pane, and click Reference.

Click the filter control and select Text Filters. Then click Contains…

In Filter Rows set href in the highlighted text box and click OK.

Right click on the header of Column1 and select Split Column, then click By Delimiter...In Split Column by Delimiter, set href= as a –Custom– delimiter, and set Split at to the left-most delimiter, and click OK.

Right click on Column1.2, and select Split Column, then click By Delimiter...

In Split Column by Delimiter, set as a –Custom– delimiter, and set Split at to the left-most delimiter, and click OK.

Remove the first and third columns. 

You may now notice that we have two type of href links in the preview. We need the ones that don’t lead to the page search.aspx. Let’s filter those out of our query. Click the filter control, select Text Filters and click Does Not Begin With…

Set the highlighted box as https://store.office.com/search.aspx, and click OK.

Let’s rename our column to Link.

Now can clean our links and remove unnecessary parameters.

Right click on the header of our column, and select Split Column, then click By Delimiter...In Split Column by Delimiter, set & as a –Custom– delimiter, and click OK.

Remove the second column.

We now have a set of links. Each lead to a Custom Visual. Let’s test it. Click on the first cell, and copy and paste the link to your browser.

It works!

Getting the Custom Visuals Image Links, Description & Names

You can repeat the last part, create 3 more reference queries from Document, and apply different filters to locate the Custom Visuals names, description and links to their images.

If you find it difficult, use the following query expressions:

The Images query extracts the image links.

The Names query extracts the names of the custom visuals.

The Description query extracts the description of the custom visuals.

Paginating over the entire set of Custom Visuals

At this stage you have a Document base query and four other queries: Links, Images, Names & Description. But your queries can only load a single page of results.

Let’s learn how to load the entire dataset. The first step is to create a parameter that will replace the static page parameter.

Right click anywhere in the background of the Queries pane, and select New Parameter…

In Parameters dialog, set PageNum as Name, enable Required, set Text as Type, and set 1 as Current Value. When you are done, click OK.

Click on Document in Queries pane, and select the step Source in the Query Settings right pane. Now, modify the formula, and replace page=1 with page=” & PageNum & “

Before:

After:

Now right click on Document in Queries pane, and click Create Function…

In Create Function dialog, set FnLoadPage as Function name, and click OK.

Create a new Blank Query, by right clicking anywhere in the background of the Queries pane, and select New Query, and then Blank Query

In the formula bar, create the following list

= {1..3}

Note: for simplicity reasons, we assume that the entire dataset can be loaded by fetching three consecutive pages. You can change the formula to larger number of pages. For example:

={1..10} will create the list {1,2,3,4,5,6,7,8,9,10}.

We will use this list to paginate over the Custom Visual result sets. You can buy the full template that dynamically knows the number of pages needed to fetch the entire data set.

Now, let’s convert the list into a table. Click To Table in the List Tools / Transform tab.

In To Table dialog, click OK.

Rename the column to PageNum and change its type to Text.

Click Invoke Custom Function in Add Column tab.

Select FnLoadPage in the Function query drop down menu, and click OK.

Expand the column FnLoadPage.

In the expand pane, select Column1 and click OK.

Remove the first column, and rename the query Query1 as AllDocument.

Now, uncheck Enable Load for AllDocument.

On each of the queries Images, Links, Names & Description, repeat the following steps:

  1. Click on the query in the Queries pane.
  2. Select the first step in the Queries Settings pane.
  3. Change the formula from Document to AllDocument.
  4. Rename the column as Value.

For each of the queries Images, Links, Names & Description, let’s add an index column. To do it, select each query in the Queries left pane, and click Index Column in Add Column tab. Ensure that all indices starts with zero.

On each of the queries above, let’s add a custom column with their name as the value, and the header Column Name.

For example, in Links, click the Custom Column in Add Column tab, set Column Name as New column name and =”Link” as the formula. When you are done, click OK, and repeat on the queries: Images, Names & Description.

Next step, is to append the four queries together. Ensure that each query as a Value, Index and a Column Name headers.

Select any of the queries, and click Append Queries in Home tab, then select Append Queries as New.

In the Append dialog, add the highlighted queries to the right box, and click OK. By the way, if the OK button is disabled, change the order of the tables, and the button will be enabled (I think there is a bug in this dialog).

Rename the new query as Custom Visuals. Select the third column, and click Pivot Column in Transform tab.

Select Value as Values Column, and select Don’t Aggregate in Advanced options, then click OK.

remove Index.

Select the highlighted queries and uncheck Enable Load.

And finally, we have a single table that loads the entire set of Custom Visuals with their names, image URL, page URL and description.

From here, you can click Close & Apply to load the entire table to Power BI, and start building your stunning Custom Visuals report.

Conclusions

This article can walk you through the web scraping of Office Store to build a catalog of the Custom Visuals of Power BI. You can buy the full report (including many hidden gems that will not be published elsewhere), learn more here.

2 comments

Leave a Reply