Web Scraping in Power BI and Excel Power Query


There are abundant public data sources on the Web, and it is so tempting to import and mash them up with your existing data sets in Power BI and Excel Power Query to gain new correlations and insights. While it’s easy to import static tables from many websites in Power BI and Excel Power Query, you’ve probably found out by now that in too many cases, the tables you want to import are illusive, and inaccessible in Power BI.

Have you ever encountered the following dead end experience, with no tables to import?

In today’s blog post, I will share a technique that will allow you to scrap tables from web pages, by navigating the web page elements in the Query Editor, and mastering the craftsmanship of Children expansion. The good news is that you’ll not need prior knowledge of HTML.

We will demonstrate the technique by web scraping the Microsoft Find an MVP search results here.

Web Scraping in Power BI and Power Query

To motivate you to keep reading, here is the Power BI report, that I could prepare with the data that was extracted in the page above.

The simple scenario

Before we start our challenge, let’s briefly review the simple scenario, when the import of tables is easy. We will demonstrate it with this population table from Wikipedia (Yes, most, if not all, of the tables in Wikipedia are easily imported to Power BI). Feel free to skip this part if you are familiar with the Web connector in Power BI or Power Query in Excel.

So, here is a screenshot of the table we are going to import from here:

Web Scraping in Power BI and Power Query

Open Power BI Desktop, and click Get Data, then click Web. If you are using Excel (with Power Query Add-In installed) you can find the Web connector in the Power Query tab. If you have Excel 2016, click Data –> New Query –> From Other Sources –> From Web. We will use Power BI Desktop in this tutorial, but the steps are the same in Excel.

Web Scraping in Power BI and Power Query

In From Web dialog, paste the URL below and click OK.

https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)

Web Scraping in Power BI and Power Query

In Navigator dialog, select the table Countries and areas ranked by population… and click Edit, to clean the data in the Query Editor before it is loaded to your report.

Web Scraping in Power BI and Power Query

The example above was simple. Power Query identifies the table in the HTML page, and allows you to import it in a relatively easy experience. You can find many other simple web pages to import. These web pages are usually public. They will not require to log in, and their tables are implemented using the <table>…</table> tags.

Unfortunately, many tables and search results in HTML pages are implemented in different ways.

When you try to import the challenging websites in Power BI or Excel Power Query, you will end up having the Document entity in the Navigator. In the next part of this article, we will demonstrate how to extract tables through by editing the query and extracting the relevant elements in the HTML.

Web Scraping in Power BI and Power Query

Web Scraping

In this part, we will learn how to import a table from web pages, and demonstrate it on the Microsoft MVP website: https://mvp.microsoft.com/en-us/MvpSearch

In this website, you can search for Microsoft MVPs (Most Valuable Professional) by a keyword, award category and country or region.

Before we dive into the actual extraction, let’s focus on getting the entire data set (and not just a subset).

Pagination Trick

When we search for “Power BI” MVPs, we can see in the screenshot there are 161 results, but we only see the first page.

Web Scraping in Power BI and Power Query

In such web pages, it is recommended to navigate to the next page, and confirm that you can find the page number as a parameter in the URL. In our example, when you click on the second page, you will find this part in the URL pn=2

With a simple reverse engineering, you can find out that this attribute is used to navigate to the relevant page (pn = page number).

When you find the page number in the URL, it’s a good sign. You can iterate over all the pages with Power Query. There is a great post by Matt Masson here that will show you how to do it, and if you want to iterate over multiple pages with minimal manipulations of the query expression, you can find my blog post here useful. Fortunately, we will not need it in our example.

Web Scraping in Power BI and Power Query

The Page Size Trick

The second trick, is to find in the web page how you can define the page size. In our example, there are 3 links at the bottom right corner of the search results, that allow you to define the search results. Clicking in 40 will expand the page results to 40 MVPs. Luckily, in URL will now include the argument ps=40. We can now try to change the number from 40 to 161 (or more), to get the entire table.

Note: As I am writing this article, the number of Power BI MVP keep growing. We now have 164 MVPs, so my screenshots are not up to date.

Web Scraping in Power BI and Power Query

Now that we have our URL, we can use it in Power BI or Excel Power Query:

https://mvp.microsoft.com/en-us/MvpSearch?&kw=”Power BI”&ps=164&pn=1

Let’s open Power BI Desktop, click Get Data –> Web and paste the URL above. When you are done, click OK.

Web Scraping in Power BI and Power Query

In the Access Web content, select Anonymous and click Connect.

Navigating the HTML Document

In the Navigator dialog, check the Document table and click Edit.

Web Scraping in Power BI and Power Query

From here, we will embark in a journey of drill down to Table objects till we reach our desired table.

Each time we will click a Table object, we will get new rows with Kind, Name, Children and Text columns. To find out the correct path in our tree, we should apply a quick review of the HTML source. The following steps were done using Google Chrome. You can follow the steps on any other browser.

Open Google Chrome with this URL: https://mvp.microsoft.com/en-us/MvpSearch?&kw=”Power BI”&ps=164&pn=1

Hover over one of the MVP names, right click and select Inspect.

Web Scraping in Power BI and Power Query

You can now see how a single element (MVP) in the table is represented in HTML. Notice below how many <div> elements we have. Each <div> node can be considered as a row in the table we would like to import.

Web Scraping in Power BI and Power Query

Let’s scroll up till we identify the <html> tag. The children of <html> are the tags between <html> and the closing tag </html>. We will now drill down to the child <body>, as this is the tag that contains the search results. Before we drill down to <body> in the Query Editor, you can see in the following screenshot (inside the red rectangle) the entire tree-like structure of the tags that encapsulate our data. In the next series of steps, we will navigate the tree-like maze.

Web Scraping in Power BI and Power Query

Back in the Query Editor, let’s click the Table element (We are now in the HTML tag).

Web Scraping in Power BI and Power Query

Remember the <body> tag above? In the Query Editor, find the row with the BODY in column Name. Click on the Table object to drill down to the underlying elements.

Web Scraping in Power BI and Power Query

Now let’s drill down to the children of <body> in Google Chrome. We have a div, header, nav, div and another div, which is the parent tag of our table. So, we will need to drill down to that element in the Query Editor. Note: The collapse/expand icons in Google Chrome will help you to identify the parent tag.

Web Scraping in Power BI and Power Query

Back in the Query Editor, let’s count the children of <body> till we find our div tag. Notice that in Name column we have the values: DIV, HEADER, NAV, DIV, DIV  The latter is our guy. Let’s click on the Table object of that DIV tag (As highlighted in red rectangle below).

Web Scraping in Power BI and Power Query

Now, back in the browser, you will notice that there are four nested DIV tags, each is the first child of the parent DIV.

Web Scraping in Power BI and Power Query

Now we can see a series of tags: link, link, link, link, script, script and div. The div contains our search results. Find the first DIV in the Query Editor and click its Table object.

To match the next three nested DIV tags, click in Table object of the Children column in the first row three times. If by mistake, you click the Table element one more time, you can click the settings icon in APPLIED STEPS pane (as highlighted below), and then in the Navigation dialog, you can roll back to the 7th Children node below Document (as highlighted in the left red rectangle).

Web Scraping in Power BI and Power Query
Click the settings icon to roll back, if you drilled down too many steps.

We can now see in the browser that the next tag that contains our desired data is the second div tag.

Let’s click on the second Table element in the Query Editor that is corresponding to the div tag in the browser.

Next, in the browser we can identify the first div, as the element we should drill down to.

So, let’s click on the one and only Table element in the Query Editor.

Next, in the browser we can identify the second div, as the element we should drill down to.

Back again in the Query Editor, let’s find the second DIV and click its Table element.

Finally, by drilling down the Table objects in the Query Editor that match the tags in the browser, we reached a crucial step. The Children column contains our data.

For the first time, we have reached into a step with a large set of Table objects in the Children column. Now it is a good time to delete the columns Kind, Name and Text.

And let’s remove all the nulls. Since we have Table objects in the column, the Query Editor shows the Expend control instead of the Filter control, so there is no easy access to the filter pane. We can still filter out the nulls. In Home tab, click Remove Rows, and then select Remove Blank Rows.

Now, when we have a single Children column, we have two different method to reach our data.

Method 1 – Expand Children and Text, Remove Blanks and Repeat

In this method, we keep expanding the Children and Text fields from the parent Children column.

If Text column contains only nulls or empty values, we remove that column. Don’t apply a filter on Text to remove blank values. We need to keep the Table objects in Children.

After the second expand, we will get the award category and countries in Text column.

Another expand of Children, and we have the MVP names in Text.1

 

From here you can remove the Children column, merge the Text columns (without using a delimiter), trim the merged column, and reach a single list of MVP names, award categories and countries. We still need to filter out empty values, “Award Category”, “Country or region” and “In the Spotlight” to get the following result:

Now, when we have a single columns with trios of name, award category and country, we can apply a variation of the technique that I shared here.

First, let’s create a new column that will map the data we have into columns of Name, Award Category and Country.

In Add Column tab, click Index Column.

Select the Index column, and click Standard, and then Modulo in Transform tab.

In the Modulo dialog enter 3 as the Value and click OK. Rename the Index column to Column.

Now, let’s define a new column that will map our records into row numbers. In Add Column tab, click Index Column. Select the Index column, and click Standard, and then Integer-Divide in Transform tab.

In the Integer-Divide dialog enter 3 as the Value and click OK. Rename the Index column to Row.

Now, let’s pivot the table. Select the Column column, and click Pivot Column in Transform tab.

In Pivot Column dialog, select Merged as the Values Column, expand Advanced options, and select Don’t Aggregate in Aggregate Value Function drop down menu. When you are done, click OK.

We are almost ready. Remove the first column and rename column 1 to Name, column 2 to Award Category and column 3 to Country.

We can now click Close and Apply in Home tab, and load our MVPs to Power BI.

Method 2 – Apply Query Function

Here is a cleaner method to extract our MVPs. Let’s roll back our query to the first step we removed the blank rows. Here is the query expression after the roll back:

We will now drill down to the first Table object to work on a single MVP entry, and extract the MVP name. Then we will create a generic function that can be reused on all the MVP entries to extract the names.

Right click on the first Table object, and select Add as New Query.

Now, let’s find the MVP name in the HTML. We can see that we need to drill down to the second div tag.

Back to Query Editor, we will click the second Table object that matches our finding above.

Back in Chrome, we can see that the next tag is span.

In the Query Editor, you can find the SPAN tag. Let’s click its Table object.

Finally, we reached the actual tag that contains the MVP name. It’s an <a> tag.

The name of the MVP is wrapped between the opening <a …> and the closing </a>.

Note: In this article, we will not cover the method to get hyperlinks. There is a different method to do it. Unfortunately, in this technique the href attribute is ignored by the Query Editor.

Back in the Query Editor we can click the Table object of tag A

Finally, we found our MVP name in the Text column. We can right click in the cell and select Drill Down.

Now, let’s click Advanced Editor in Home tab. 

Here is the query expression that you will find:

Let’s delete all the rows from Source to Children7:

Now, let’s add a first line:

… and replace Children7 with Source:

We can now rename the new function query as FnGetMVPName, return to the first query and click Invoke Custom Function in Add Column tab.

In Invoke Custom Function dialog, select FnGetMVPName as the Function query, ensure that Children column is selected as Source (optional) and click OK.

Now we have the MVP names in the new column.

To obtain the Award Categories, we can repeat the method that was applied for the names. Use the browser to learn how to find the relevant tag that contains the award category, and navigate to it in the Query Editor. Then create a function that extracts the award.

Here is the final function to get the award category. Let’s call it FnGetAwardCategory:

Repeat similar steps to obtain the country. Here is the final function. Let’s call it FnGetCountry.

Now, use the Invoke Custom Function on the two new functions. Remove the Children column and we are done.

And here is the final expression:

Wrap Up & Conclusions

For basic tables that are implemented with the <table> tag (as typically found on Wikipedia), you can enjoy the built-in capability in Power BI and Power Query in Excel, and navigate directly to the tables. However, in many cases, the tables are dynamic in nature and implemented in a tree-like structure which is not easily accessible through the Query Editor.

In this blog post, we figured out how to navigate the tree-like maze of Children/Table elements and extract dynamic table-like search results from web pages. We demonstrated the technique on the Microsoft MVP website, and showed two methods to extract the data.

We briefly discussed how to refine the search to include all the results in a single page, and shared some pointers that will show you how to iterate over multiple pages, if you cannot get the entire data set in a single call.

Finally, two disclaimers that are worth sharing:

  1. Don’t scrap data from a web page, if it violates the terms of use and copyright of the page owner.
  2. Understand that almost any minor change in the structure of the HTML page, will lead to refresh errors. You will often need to go back to the queries and fix the Children/Table navigation sequence. So, web scraping is perfect for ad-hoc reporting, and less suitable for wide scale production-level reporting.

What’s next in web scraping? How to extract href attributes. Hope you enjoyed this blog post.

 

24 comments

  1. OlaReply

    I’ve done this so many times; drill down/up retry…clean up etc.

    I just wish there was a simpler solution.
    F.ex. an graphical interface – to just point and click at the right level.

  2. STReply

    Hey Gil!

    Nice article!
    I wonder if there is a solution to access password protected websites with power query?
    In case of negative answer, do you know if there is any resolution on this by power query develop team?

    Thank you!

  3. RajeshReply

    Thanks for the walk through.

    Is there a way to scrape the web if the URL is embedded in the dataset I’m loading?

    Regards,
    Rajesh

  4. samReply

    Can this technique be extended to pull data from a Excel File stored on One-drive Personal

  5. STReply

    Hey Gil! Thank you for your feedback. Unfortunately, my data providers are “Giants” in the market they operate in, and so are quite inflexible to the idea of a web API. I’ve had that experience in the past, when we asked for a web service to one of our suppliers to transfer data from they website to our databases with the support of my IT colleagues and all (I work in a financial department). So the web API isn’t a solution for us. Though, when you say “… import to CSV that you can automate using UI capturing tools.” what are those UI tools? Best I could do, was develop some macros in Excel. Those macros login in the website and download the files into our local network. As you can imagine, this code is very very sensitive to website changes and sometimes I have to adjust the code here and there to keep it working. Also, in one of the websites, and because I couldn’t come up to any better solution, I have to use send Keys instructions ( and if you’re a macro coder you know that’s not the best of practices!!). Thank you once more! =)

  6. Pingback: Exploring the best of April update of #PowerBI Desktop - DataChant

  7. samReply

    @Gil – Used to Work – Microsoft blocked this around a year ago

    • Gil Raviv Post authorReply

      Hi Sam,
      Could you elaborate what’s not working, and blocked by Microsoft?

  8. JulianReply

    Hi Gil,

    Remember I’ve been waiting for this post from you since early Feb? Finally I found it the other day. However, to my expectation it’s not enough : In addition to the 3 items covered in this post could you please add two more columns – image url and profile url? You know even with the simple scenario I still wondered how to get the country icons from Wikipedia not to mention the Microsoft MVP website.

    Julian

  9. JulianReply

    I could not get “find the MVP” dataset. Could you share it with me by publishing it?

    • Gil Raviv Post authorReply

      Sorry, not yet. But you can get it by following the steps of this tutorial, and then extract the profile and photo URLs as shown in the next post.

  10. JulianReply

    Good, That’s what I really need to learn instead of just harvesting a finished goods from expert like you. I can’t wait to see your next post. Thank you again.

  11. samReply

    Hi Gil – Are you able to pull data from an Excel File – stored on one drive personal using the URL hack mentioned by Melissa’s blog – Could you share the steps.
    If now – Is it possible to use the Web scrapping techniques mentioned in this blog one Drive personal

    Sam

    • Gil Raviv Post authorReply

      Hi Sam,
      It’s better to move the file to OneDrive for Business and avoid this technique. To check if it is possible, try to find your Excel data in the browser’s HTML source, when you view the source.

  12. Pingback: Vote for Your Favorite Power BI Custom Visuals - DataChant

  13. Pingback: Web Scraping In Power BI And Excel Power Query - Part 2 - Extracting links - DataChant

  14. samReply

    @Frank WOW – WOW – WOW !!!
    This is brilliant !!!! – How on earth did you figure out the way to phrase this kind of a URL from the one that is displayed by default !!! – Genius

Leave a Reply