Web Scraping in Power BI and Excel Power Query


There are abundant public data sources on the Web, and it is so tempting to import and mash them up with your existing datasets in Power BI and Excel Power Query to gain new correlations and insights. While it’s easy to import static tables from many websites in Power BI and Excel Power Query, you’ve probably found out by now that in too many cases, the tables you want to import are illusive, and inaccessible in Power BI.

Have you ever encountered the following dead-end experience, with no tables to import?

In today’s blog post, I will share a technique that will allow you to scrap tables from web pages, by navigating the web page elements in the Query Editor, and mastering the craftsmanship of Children expansion. The good news is that you’ll not need prior knowledge of HTML.

We will demonstrate the technique by web scraping the Microsoft Find an MVP search results here.

Web Scraping in Power BI and Power Query

To motivate you to keep reading, here is the Power BI report, that I could prepare with the data that was extracted in the page above.

The simple scenario

Before we start our challenge, let’s briefly review the simple scenario, when the import of tables is easy. We will demonstrate it with this population table from Wikipedia (Yes, most, if not all, of the tables in Wikipedia are easily imported to Power BI). Feel free to skip this part if you are familiar with the Web connector in Power BI or Power Query in Excel.

So, here is a screenshot of the table we are going to import from here:

Web Scraping in Power BI and Power Query

Open Power BI Desktop, and click Get Data, then click Web. If you are using Excel (with Power Query Add-In installed) you can find the Web connector in the Power Query tab. If you have Excel 2016, click Data –> New Query –> From Other Sources –> From Web. We will use Power BI Desktop in this tutorial, but the steps are the same in Excel.

Web Scraping in Power BI and Power Query

In From Web dialog, paste the URL below and click OK.

https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)

Web Scraping in Power BI and Power Query

In Navigator dialog, select the table Countries and areas ranked by population… and click Edit, to clean the data in the Query Editor before it is loaded to your report.

Web Scraping in Power BI and Power Query

The example above was simple. Power Query identifies the table in the HTML page, and allows you to import it in a relatively easy experience. You can find many other simple web pages to import. These web pages are usually public. They will not require to log in, and their tables are implemented using the <table>…</table> tags.

Unfortunately, many tables and search results in HTML pages are implemented in different ways.

When you try to import the challenging websites in Power BI or Excel Power Query, you will end up having the Document entity in the Navigator. In the next part of this article, we will demonstrate how to extract tables through by editing the query and extracting the relevant elements in the HTML.

Web Scraping in Power BI and Power Query

Web Scraping

In this part, we will learn how to import a table from web pages, and demonstrate it on the Microsoft MVP website: https://mvp.microsoft.com/en-us/MvpSearch

In this website, you can search for Microsoft MVPs (Most Valuable Professional) by a keyword, award category and country or region.

Before we dive into the actual extraction, let’s focus on getting the entire data set (and not just a subset).

Pagination Trick

When we search for “Power BI” MVPs, we can see in the screenshot there are 161 results, but we only see the first page.

Web Scraping in Power BI and Power Query

In such web pages, it is recommended to navigate to the next page, and confirm that you can find the page number as a parameter in the URL. In our example, when you click on the second page, you will find this part in the URL pn=2

With a simple reverse engineering, you can find out that this attribute is used to navigate to the relevant page (pn = page number).

When you find the page number in the URL, it’s a good sign. You can iterate over all the pages with Power Query. There is a great post by Matt Masson here that will show you how to do it, and if you want to iterate over multiple pages with minimal manipulations of the query expression, you can find my blog post here useful. Fortunately, we will not need it in our example.

Web Scraping in Power BI and Power Query

The Page Size Trick

The second trick, is to find in the web page how you can define the page size. In our example, there are 3 links at the bottom right corner of the search results, that allow you to define the search results. Clicking in 40 will expand the page results to 40 MVPs. Luckily, in URL will now include the argument ps=40. We can now try to change the number from 40 to 161 (or more), to get the entire table.

Note: As I am writing this article, the number of Power BI MVP keep growing. We now have 164 MVPs, so my screenshots are not up to date.

Web Scraping in Power BI and Power Query

Now that we have our URL, we can use it in Power BI or Excel Power Query:

https://mvp.microsoft.com/en-us/MvpSearch?&kw=”Power BI”&ps=164&pn=1

Let’s open Power BI Desktop, click Get Data –> Web and paste the URL above. When you are done, click OK.

Web Scraping in Power BI and Power Query

In the Access Web content, select Anonymous and click Connect.

Navigating the HTML Document

In the Navigator dialog, check the Document table and click Edit.

Web Scraping in Power BI and Power Query

From here, we will embark in a journey of drill down to Table objects till we reach our desired table.

Each time we will click a Table object, we will get new rows with Kind, Name, Children and Text columns. To find out the correct path in our tree, we should apply a quick review of the HTML source. The following steps were done using Google Chrome. You can follow the steps on any other browser.

Open Google Chrome with this URL: https://mvp.microsoft.com/en-us/MvpSearch?&kw=”Power BI”&ps=164&pn=1

Hover over one of the MVP names, right click and select Inspect.

Web Scraping in Power BI and Power Query

You can now see how a single element (MVP) in the table is represented in HTML. Notice below how many <div> elements we have. Each <div> node can be considered as a row in the table we would like to import.

Web Scraping in Power BI and Power Query

Let’s scroll up till we identify the <html> tag. The children of <html> are the tags between <html> and the closing tag </html>. We will now drill down to the child <body>, as this is the tag that contains the search results. Before we drill down to <body> in the Query Editor, you can see in the following screenshot (inside the red rectangle) the entire tree-like structure of the tags that encapsulate our data. In the next series of steps, we will navigate the tree-like maze.

Web Scraping in Power BI and Power Query

Back in the Query Editor, let’s click the Table element (We are now in the HTML tag).

Web Scraping in Power BI and Power Query

Remember the <body> tag above? In the Query Editor, find the row with the BODY in column Name. Click on the Table object to drill down to the underlying elements.

Web Scraping in Power BI and Power Query

Now let’s drill down to the children of <body> in Google Chrome. We have a div, header, nav, div and another div, which is the parent tag of our table. So, we will need to drill down to that element in the Query Editor. Note: The collapse/expand icons in Google Chrome will help you to identify the parent tag.

Web Scraping in Power BI and Power Query

Back in the Query Editor, let’s count the children of <body> till we find our div tag. Notice that in Name column we have the values: DIV, HEADER, NAV, DIV, DIV  The latter is our guy. Let’s click on the Table object of that DIV tag (As highlighted in red rectangle below).

Web Scraping in Power BI and Power Query

Now, back in the browser, you will notice that there are four nested DIV tags, each is the first child of the parent DIV.

Web Scraping in Power BI and Power Query

Now we can see a series of tags: link, link, link, link, script, script and div. The div contains our search results. Find the first DIV in the Query Editor and click its Table object.

To match the next three nested DIV tags, click in Table object of the Children column in the first row three times. If by mistake, you click the Table element one more time, you can click the settings icon in APPLIED STEPS pane (as highlighted below), and then in the Navigation dialog, you can roll back to the 7th Children node below Document (as highlighted in the left red rectangle).

Web Scraping in Power BI and Power Query
Click the settings icon to roll back, if you drilled down too many steps.

We can now see in the browser that the next tag that contains our desired data is the second div tag.

Let’s click on the second Table element in the Query Editor that is corresponding to the div tag in the browser.

Next, in the browser we can identify the first div, as the element we should drill down to.

So, let’s click on the one and only Table element in the Query Editor.

Next, in the browser we can identify the second div, as the element we should drill down to.

Back again in the Query Editor, let’s find the second DIV and click its Table element.

Finally, by drilling down the Table objects in the Query Editor that match the tags in the browser, we reached a crucial step. The Children column contains our data.

For the first time, we have reached into a step with a large set of Table objects in the Children column. Now it is a good time to delete the columns Kind, Name and Text.

And let’s remove all the nulls. Since we have Table objects in the column, the Query Editor shows the Expend control instead of the Filter control, so there is no easy access to the filter pane. We can still filter out the nulls. In Home tab, click Remove Rows, and then select Remove Blank Rows.

Now, when we have a single Children column, we have two different method to reach our data.

Method 1 – Expand Children and Text, Remove Blanks and Repeat

In this method, we keep expanding the Children and Text fields from the parent Children column.

If Text column contains only nulls or empty values, we remove that column. Don’t apply a filter on Text to remove blank values. We need to keep the Table objects in Children.

After the second expand, we will get the award category and countries in Text column.

Another expand of Children, and we have the MVP names in Text.1

From here you can remove the Children column, merge the Text columns (without using a delimiter), trim the merged column, and reach a single list of MVP names, award categories and countries. We still need to filter out empty values, “Award Category”, “Country or region” and “In the Spotlight” to get the following result:

Now, when we have a single columns with trios of name, award category and country, we can apply a variation of the technique that I shared here.

First, let’s create a new column that will map the data we have into columns of Name, Award Category and Country.

In Add Column tab, click Index Column.

Select the Index column, and click Standard, and then Modulo in Transform tab.

In the Modulo dialog enter 3 as the Value and click OK. Rename the Index column to Column.

Now, let’s define a new column that will map our records into row numbers. In Add Column tab, click Index Column. Select the Index column, and click Standard, and then Integer-Divide in Transform tab.

In the Integer-Divide dialog enter 3 as the Value and click OK. Rename the Index column to Row.

Now, let’s pivot the table. Select the Column column, and click Pivot Column in Transform tab.

In Pivot Column dialog, select Merged as the Values Column, expand Advanced options, and select Don’t Aggregate in Aggregate Value Function drop down menu. When you are done, click OK.

We are almost ready. Remove the first column and rename column 1 to Name, column 2 to Award Category and column 3 to Country.

We can now click Close and Apply in Home tab, and load our MVPs to Power BI.

Method 2 – Apply Query Function

Here is a cleaner method to extract our MVPs. Let’s roll back our query to the first step we removed the blank rows. Here is the query expression after the roll back:

let
    Source = Web.Page(Web.Contents("https://mvp.microsoft.com/en-us/MvpSearch?&kw=""Power BI""&ps=164&pn=1")),
    Data0 = Source{0}[Data],
    Children0 = Data0{0}[Children],
    Children2 = Children0{2}[Children],
    Children10 = Children2{10}[Children],
    Children13 = Children10{13}[Children],
    Children1 = Children13{1}[Children],
    Children3 = Children1{1}[Children],
    Children4 = Children3{1}[Children],
    Children = Children4{3}[Children],
    Children5 = Children{1}[Children],
    Children6 = Children5{3}[Children],
    #"Removed Columns" = Table.RemoveColumns(Children6,{"Kind", "Name", "Text"}),
    #"Removed Blank Rows" = Table.SelectRows(#"Removed Columns", each not List.IsEmpty(List.RemoveMatchingItems(Record.FieldValues(_), {"", null})))
in
    #"Removed Blank Rows"

We will now drill down to the first Table object to work on a single MVP entry, and extract the MVP name. Then we will create a generic function that can be reused on all the MVP entries to extract the names.

Right click on the first Table object, and select Add as New Query.

Now, let’s find the MVP name in the HTML. We can see that we need to drill down to the second div tag.

Back to Query Editor, we will click the second Table object that matches our finding above.

Back in Chrome, we can see that the next tag is span.

In the Query Editor, you can find the SPAN tag. Let’s click its Table object.

Finally, we reached the actual tag that contains the MVP name. It’s an <a> tag.

The name of the MVP is wrapped between the opening <a …> and the closing </a>.

Note: In this article, we will not cover the method to get hyperlinks. There is a different method to do it. Unfortunately, in this technique the href attribute is ignored by the Query Editor.

Back in the Query Editor we can click the Table object of tag A

Finally, we found our MVP name in the Text column. We can right click in the cell and select Drill Down.

Now, let’s click Advanced Editor in Home tab. 

Here is the query expression that you will find:

let
    Source = Web.Page(Web.Contents("https://mvp.microsoft.com/en-us/MvpSearch?&kw=""Power BI""&ps=164&pn=1")),
    Data0 = Source{0}[Data],
    Children0 = Data0{0}[Children],
    Children2 = Children0{2}[Children],
    Children10 = Children2{10}[Children],
    Children13 = Children10{13}[Children],
    Children1 = Children13{1}[Children],
    Children3 = Children1{1}[Children],
    Children4 = Children3{1}[Children],
    Children = Children4{3}[Children],
    Children5 = Children{1}[Children],
    Children6 = Children5{3}[Children],
    #"Removed Columns" = Table.RemoveColumns(Children6,{"Kind", "Name", "Text"}),
    #"Removed Blank Rows" = Table.SelectRows(#"Removed Columns", each not List.IsEmpty(List.RemoveMatchingItems(Record.FieldValues(_), {"", null}))),
    Children7 = #"Removed Blank Rows"{0}[Children],
    Children8 = Children7{3}[Children],
    Children9 = Children8{1}[Children],
    Children11 = Children9{1}[Children],
    Text = Children11{0}[Text]
in
    Text

Let’s delete all the rows from Source to Children7:

let
    Children8 = Children7{3}[Children],
    Children9 = Children8{1}[Children],
    Children11 = Children9{1}[Children],
    Text = Children11{0}[Text]
in
    Text

Now, let’s add a first line:

(Source)=>

… and replace Children7 with Source:

(Source)=>
let
    Children8 = Source{3}[Children],
    Children9 = Children8{1}[Children],
    Children11 = Children9{1}[Children],
    Text = Children11{0}[Text]
in
    Text

We can now rename the new function query as FnGetMVPName, return to the first query and click Invoke Custom Function in Add Column tab.

In Invoke Custom Function dialog, select FnGetMVPName as the Function query, ensure that Children column is selected as Source (optional) and click OK.

Now we have the MVP names in the new column.

To obtain the Award Categories, we can repeat the method that was applied for the names. Use the browser to learn how to find the relevant tag that contains the award category, and navigate to it in the Query Editor. Then create a function that extracts the award.

Here is the final function to get the award category. Let’s call it FnGetAwardCategory:

(Source) =>
let
    Children8 = Source{5}[Children],
    Children9 = Children8{5}[Children],
    Text1 = Children9{0}[Text],
    #"Trimmed Text" = Text.Trim(Text1)
in
    #"Trimmed Text"

Repeat similar steps to obtain the country. Here is the final function. Let’s call it FnGetCountry.

(Source)=>
let
    Children8 = Source{7}[Children],
    Children9 = Children8{5}[Children],
    Text = Children9{0}[Text]
in
    Text

Now, use the Invoke Custom Function on the two new functions. Remove the Children column and we are done.

And here is the final expression:

let
    Source = Web.Page(Web.Contents("https://mvp.microsoft.com/en-us/MvpSearch?&kw=""Power BI""&ps=164&pn=1")),
    Data0 = Source{0}[Data],
    Children0 = Data0{0}[Children],
    Children2 = Children0{2}[Children],
    Children10 = Children2{10}[Children],
    Children13 = Children10{13}[Children],
    Children1 = Children13{1}[Children],
    Children3 = Children1{1}[Children],
    Children4 = Children3{1}[Children],
    Children = Children4{3}[Children],
    Children5 = Children{1}[Children],
    Children6 = Children5{3}[Children],
    #"Removed Columns" = Table.RemoveColumns(Children6,{"Kind", "Name", "Text"}),
    #"Removed Blank Rows" = Table.SelectRows(#"Removed Columns", each not List.IsEmpty(List.RemoveMatchingItems(Record.FieldValues(_), {"", null}))),
    #"Invoked Custom Function" = Table.AddColumn(#"Removed Blank Rows", "MVP Name", each FnGetMVPName([Children])),
    #"Invoked Custom Function1" = Table.AddColumn(#"Invoked Custom Function", "Award Category", each FnGetAwardCategory([Children])),
    #"Invoked Custom Function2" = Table.AddColumn(#"Invoked Custom Function1", "Country", each FnGetCountry([Children])),
    #"Removed Columns2" = Table.RemoveColumns(#"Invoked Custom Function2",{"Children"})
in
    #"Removed Columns2"

Wrap Up & Conclusions

For basic tables that are implemented with the <table> tag (as typically found on Wikipedia), you can enjoy the built-in capability in Power BI and Power Query in Excel, and navigate directly to the tables. However, in many cases, the tables are dynamic in nature and implemented in a tree-like structure which is not easily accessible through the Query Editor.

In this blog post, we figured out how to navigate the tree-like maze of Children/Table elements and extract dynamic table-like search results from web pages. We demonstrated the technique on the Microsoft MVP website, and showed two methods to extract the data.

We briefly discussed how to refine the search to include all the results in a single page, and shared some pointers that will show you how to iterate over multiple pages, if you cannot get the entire data set in a single call.

Finally, two disclaimers that are worth sharing:

  1. Don’t scrap data from a web page, if it violates the terms of use and copyright of the page owner.
  2. Understand that almost any minor change in the structure of the HTML page, will lead to refresh errors. You will often need to go back to the queries and fix the Children/Table navigation sequence. So, web scraping is perfect for ad-hoc reporting, and less suitable for wide scale production-level reporting.

What’s next in web scraping? How to extract href attributes. Hope you enjoyed this blog post.

67 comments

  1. Ola Reply

    I’ve done this so many times; drill down/up retry…clean up etc.

    I just wish there was a simpler solution.
    F.ex. an graphical interface – to just point and click at the right level.

  2. ST Reply

    Hey Gil!

    Nice article!
    I wonder if there is a solution to access password protected websites with power query?
    In case of negative answer, do you know if there is any resolution on this by power query develop team?

    Thank you!

  3. Rajesh Reply

    Thanks for the walk through.

    Is there a way to scrape the web if the URL is embedded in the dataset I’m loading?

    Regards,
    Rajesh

  4. sam Reply

    Can this technique be extended to pull data from a Excel File stored on One-drive Personal

  5. ST Reply

    Hey Gil! Thank you for your feedback. Unfortunately, my data providers are “Giants” in the market they operate in, and so are quite inflexible to the idea of a web API. I’ve had that experience in the past, when we asked for a web service to one of our suppliers to transfer data from they website to our databases with the support of my IT colleagues and all (I work in a financial department). So the web API isn’t a solution for us. Though, when you say “… import to CSV that you can automate using UI capturing tools.” what are those UI tools? Best I could do, was develop some macros in Excel. Those macros login in the website and download the files into our local network. As you can imagine, this code is very very sensitive to website changes and sometimes I have to adjust the code here and there to keep it working. Also, in one of the websites, and because I couldn’t come up to any better solution, I have to use send Keys instructions ( and if you’re a macro coder you know that’s not the best of practices!!). Thank you once more! =)

  6. Pingback: Exploring the best of April update of #PowerBI Desktop - DataChant

  7. sam Reply

    @Gil – Used to Work – Microsoft blocked this around a year ago

    • Gil Raviv Post authorReply

      Hi Sam,
      Could you elaborate what’s not working, and blocked by Microsoft?

  8. Julian Reply

    Hi Gil,

    Remember I’ve been waiting for this post from you since early Feb? Finally I found it the other day. However, to my expectation it’s not enough : In addition to the 3 items covered in this post could you please add two more columns – image url and profile url? You know even with the simple scenario I still wondered how to get the country icons from Wikipedia not to mention the Microsoft MVP website.

    Julian

  9. Julian Reply

    I could not get “find the MVP” dataset. Could you share it with me by publishing it?

    • Gil Raviv Post authorReply

      Sorry, not yet. But you can get it by following the steps of this tutorial, and then extract the profile and photo URLs as shown in the next post.

  10. Julian Reply

    Good, That’s what I really need to learn instead of just harvesting a finished goods from expert like you. I can’t wait to see your next post. Thank you again.

  11. sam Reply

    Hi Gil – Are you able to pull data from an Excel File – stored on one drive personal using the URL hack mentioned by Melissa’s blog – Could you share the steps.
    If now – Is it possible to use the Web scrapping techniques mentioned in this blog one Drive personal

    Sam

    • Gil Raviv Post authorReply

      Hi Sam,
      It’s better to move the file to OneDrive for Business and avoid this technique. To check if it is possible, try to find your Excel data in the browser’s HTML source, when you view the source.

  12. Pingback: Vote for Your Favorite Power BI Custom Visuals - DataChant

  13. Pingback: Web Scraping In Power BI And Excel Power Query - Part 2 - Extracting links - DataChant

  14. sam Reply

    @Frank WOW – WOW – WOW !!!
    This is brilliant !!!! – How on earth did you figure out the way to phrase this kind of a URL from the one that is displayed by default !!! – Genius

  15. Chris Speights Reply

    @Gil – Using this method, when the table you want is in the default landing page, it works great. But, let’s say that you are going to a site that has a different table for each month of the year (and the URL doesn’t change based on the month). The default landing page and table is January. But, if I am wanting to drill into the data and find the table for March, how is that done?

    • Gil Raviv Post authorReply

      Can you share the page? If the page URL is missing the month information, I would try some reverse engineering, and check the HTTP headers. Perhaps the data will be in a cookie. But most likely, you will not be able to get this data.

  16. Philippe Reply

    @Gil – Thanks for your blog and detailed explanation! I am facing the following challenge: I need to scrap the data from over 1,000 webpages. Each page covers technical data of a certain vessel. I have the URLs of each webpage and if the pages would have been built up of tables… no problem. But the webpages do not use tables. Your tutorial helped me to grab the data of one webpage and now I duplicate the query and replace the URL in the source line of the query code. But, well, doing this a thousand times seems ridiculous. Do you know a way?

  17. Jeff Weir Reply

    This is fantastic. Here’s a problem though: different computers might see different element structure in the HTML, meaning a file that works on one might not work on another.

    A query I wrote to scrape info from the Excel Uservoice site works on the PC I constructed it on, but not on the other PC I subsequently sent the file to. Looking into this, I see that HTML structure for the same site differs between computers, meaning the query bombs out.

    On the first computer, Inspect Element shows that the HTML structure is identical between IE and Chrome browsers. Whereas on the second computer the IE structure differs from that seen under either IE or Chrome on the first comptuer.

    So I presume both machines use some kind of IE browser, and that one of browsers was an earlier version. Subsequently:
    * A file that works on PC1 gives the error message “[Expression.Error] We cannot convert the value null to type List.” when refreshed on PC2
    * A file that works fine on PC2 says “[Expression.Error] There weren’t enough elements in the enumeration to complete the operation” when refreshed on PC1

    • Gil Raviv Post authorReply

      Hi Jeff,
      Thank you for the feedback. You shared a great scenario. First thing you can try, is to use Web Sniffer like Fiddler to compare the HTTP requests and responses in the two computers. The immediate suspect is the use of Cookies. You can reset the cookies, or use the same one in all computers by setting hte HTTP header. In my blog post in the link below, I showed how to set the HTTP headers to remove cached content, which by the way, can also cause your problem, if one computer is using a coprorate web proxy with cached data. https://datachant.com/2016/12/20/how-to-enforce-power-querypower-bi-to-import-uncached-data-from-the-web/

      To ensure you use the same cookie across different computer, you can detect the desired cookie using Fiddler, and then setup the cookie in Web.Contents(“http://your-website.com”, [Headers=[Cookie=”my-cookie”]]). The easiest approach will be to rmeove all cookies and ansure that it works for you. I think you can remove the cookies as follows: Web.Contents(“http://your-website.com”, [Headers=[Cookie=””]])

      • Jeff Weir

        Okay…a little over my head with this. Adding [Headers=[CacheControl=”no-cache, no-store, must-revalidate”]] to my query doesn’t break it, but doesn’t fix the issue on the other computer when I email the file across. And replacing that snippet with [Headers=[CacheControl=”no-cache, no-store, must-revalidate”]] returns the error “[DataSource.Error] An unknown error occurred when navigating to the web page.” on both computers, so I presume the syntax isn’t quite right.

        I also note that on the second computer, using Inspect Element shows that *all* the web browsers I have tested differ from the ‘schema’ structure shown in PowerQuery. So from this, I presume that PowerQuery uses a built in browser in Excel, and so I’m unsure whether it is possible to manually clear cookies from that browser.

        Any thoughts? Happy to email you a couple of files. I’m trying to scrape all the data from the Excel UserVoice site…thought it would make for an interesting blogpost. The complete function I’m using is below. Note that I’ve pramatized it so that it accepts both a PageNumber and a Category. I’m using 143376 to test this (PivotTables) and you can see how the url is constructed by the below link:

        https://excel.uservoice.com/forums/304921-excel-for-windows-desktop-application/category/143376/filters/top?page=2

        (pageNum as number, catNum as number)
        =>
        let
        Source = Web.Page(Web.Contents(“https://excel.uservoice.com/forums/304921-excel-for-windows-desktop-application/category/” & Number.ToText(catNum) & “/filters/top?page=” & Number.ToText(pageNum), [Headers=[Cookies=””]] )),
        Data0 = Source{0}[Data],
        Children0 = Data0{0}[Children],
        Children2 = Children0{2}[Children],
        Children3 = Children2{3}[Children],
        Children1 = Children3{3}[Children],
        Children4 = Children1{1}[Children],
        Children5 = Children4{1}[Children],
        Children6 = Children5{2}[Children],
        Children10 = Children6{10}[Children],
        Children7 = Children10{3}[Children],
        #”Removed Other Columns” = Table.SelectColumns(Children7,{“Children”}),
        #”Removed Blank Rows” = Table.SelectRows(#”Removed Other Columns”, each not List.IsEmpty(List.RemoveMatchingItems(Record.FieldValues(_), {“”, null}))),
        #”Expanded Children” = Table.ExpandTableColumn(#”Removed Blank Rows”, “Children”, {“Children”, “Text”}, {“Children.Children”, “Children.Text”}),
        #”Removed Columns” = Table.RemoveColumns(#”Expanded Children”,{“Children.Text”}),
        #”Expanded Children.Children” = Table.ExpandTableColumn(#”Removed Columns”, “Children.Children”, {“Children”, “Text”}, {“Children.Children.Children”, “Children.Children.Text”}),
        #”Removed Columns1″ = Table.RemoveColumns(#”Expanded Children.Children”,{“Children.Children.Text”}),
        #”Expanded Children.Children.Children” = Table.ExpandTableColumn(#”Removed Columns1″, “Children.Children.Children”, {“Children”, “Text”}, {“Children.Children.Children.Children”, “Children.Children.Children.Text”}),
        #”Expanded Children.Children.Children.Children” = Table.ExpandTableColumn(#”Expanded Children.Children.Children”, “Children.Children.Children.Children”, {“Children”, “Text”}, {“Children.Children.Children.Children.Children”, “Children.Children.Children.Children.Text”}),
        #”Removed Columns2″ = Table.RemoveColumns(#”Expanded Children.Children.Children.Children”,{“Children.Children.Children.Children.Children”}),
        #”Trimmed Text” = Table.TransformColumns(#”Removed Columns2″,{{“Children.Children.Children.Children.Text”, Text.Trim}, {“Children.Children.Children.Text”, Text.Trim}}),
        #”Cleaned Text” = Table.TransformColumns(#”Trimmed Text”,{},Text.Clean),
        #”Renamed Columns” = Table.RenameColumns(#”Cleaned Text”,{{“Children.Children.Children.Children.Text”, “1”}, {“Children.Children.Children.Text”, “2”}}),
        #”Filtered Rows” = Table.SelectRows(#”Renamed Columns”, each ([2] “” and [2] “responded” and [2] “·”) and ([1] “” and [1] “Admin →” and [1] “Delete…” and [1] “Flag idea as inappropriate…” and [1] “votes” and [1] “vote” and [1] “·”)),
        #”Added Custom” = Table.AddColumn(#”Filtered Rows”, “Custom”, each if [1] null then [1] else [2]),
        #”Removed Other Columns1″ = Table.SelectColumns(#”Added Custom”,{“Custom”}),
        #”Filtered Rows1″ = Table.SelectRows(#”Removed Other Columns1″, each ([Custom] null)),
        #”Added Index” = Table.AddIndexColumn(#”Filtered Rows1″, “Index”, 0, 1),
        #”Inserted Modulo” = Table.AddColumn(#”Added Index”, “Inserted Modulo”, each Number.Mod([Index], 4), type number),
        #”Renamed Columns1″ = Table.RenameColumns(#”Inserted Modulo”,{{“Custom”, “Merged”}, {“Inserted Modulo”, “Column”}}),
        #”Integer-Divided Column” = Table.TransformColumns(#”Renamed Columns1″, {{“Index”, each Number.IntegerDivide(_, 4), Int64.Type}}),
        #”Renamed Columns2″ = Table.RenameColumns(#”Integer-Divided Column”,{{“Index”, “Row”}}),
        #”Reordered Columns” = Table.ReorderColumns(#”Renamed Columns2″,{“Merged”, “Column”, “Row”}),
        #”Pivoted Column” = Table.Pivot(Table.TransformColumnTypes(#”Reordered Columns”, {{“Column”, type text}}, “en-NZ”), List.Distinct(Table.TransformColumnTypes(#”Reordered Columns”, {{“Column”, type text}}, “en-NZ”)[Column]), “Column”, “Merged”),
        #”Removed Columns3″ = Table.RemoveColumns(#”Pivoted Column”,{“Row”}),
        #”Renamed Columns3″ = Table.RenameColumns(#”Removed Columns3″,{{“0”, “Title”}, {“1”, “Votes”}, {“2”, “Comments”}, {“3”, “Category”}}),
        #”Replaced Value” = Table.ReplaceValue(#”Renamed Columns3″,” comments”,””,Replacer.ReplaceText,{“Comments”}),
        #”Replaced Value1″ = Table.ReplaceValue(#”Replaced Value”,”comment”,””,Replacer.ReplaceText,{“Comments”}),
        #”Changed Type” = Table.TransformColumnTypes(#”Replaced Value1″,{{“Comments”, Int64.Type}, {“Votes”, Int64.Type}})
        in
        #”Changed Type”

  18. jeffrey Weir Reply

    Awesome…thanks Gil. Just after I posted this, one of my colleagues noticed that on his PC there was a banner to do with cookies, and that was the next avenue I was going to tackle. I’ll let you know how I get on. Awesome resource, this.

    • Gil Raviv Post authorReply

      Are you asking about the ability to post comments below the home page? Or to see all recent comments across all pages? Are you looking for some information, or want to share?

  19. Kevin Reply

    This is great, but I am wondering if anyone has any shortcuts on getting to the right place on the tree. I am pulling from a website that I just can’t get the correct branch on.

  20. Anonymous Reply

    Hi Gil,
    Thank you for the post, Can I grab data from a published Power BI report to web by using this technique?, or maybe shall I use some extension to the browser, if it is so, do you know any?

  21. Majd Yazigi Reply

    Thank you Gil, can I use this technique to extract data from Power BI published reports to web, or should I use a capture extension on Chrome for example, in case yes do you know any good one?
    thank you again

    • Gil Raviv Post authorReply

      I am not sure you can. With the rapid changes you can expect to have in the page, what is the point in scraping it?

  22. LENZY Reply

    Great Article! Thank You!
    The following is a link to a website that I use to find “interchangeable or equivalent” parts. For example, if I enter another company’s part number on the Federal Mogul website, it will give me the equivalent Federal Mogul part number.
    1. http://www.fme-cat.com/index.aspx

    My issue is that I have to search 100s of part numbers, and can only do one part number at a time on the website. Is there a way to use web scraping to get a “table” of Federal Mogul part numbers and the “interchangeable or equivalent” part numbers from the other companies for which Federal Mogul has a matching part number?

    Thanks

  23. Anonymous Reply

    Hi, thank you very much for the article. I am a beginner and would like to extract data from google reviews. Please assist with an example of how to get this data into a table?
    Thank you 🙂

    https://www.google.co.za/search?q=cafe%20capellini&oq=cafe+c&aqs=chrome.1.69i57j0l5.4768j0j7&sourceid=chrome&ie=UTF-8&safe=active&npsic=0&rflfq=1&rlha=0&rllag=-33964988,25549126,2622&tbm=lcl&rldimm=11251643037476930108&ved=0ahUKEwjUqbf14t7bAhWF1RQKHbnQAncQvS4ITDAA&rldoc=1&tbs=lrf:!2m4!1e17!4m2!17m1!1e2!2m1!1e2!2m1!1e3!3sIAE,lf:1,lf_ui:9#rlfi=hd:;si:11251643037476930108;mv:!1m3!1d16476.42056361036!2d25.54546425!3d-33.964988600000005!2m3!1f0!2f0!3f0!3m2!1i333!2i239!4f13.1;tbs:lrf:!2m1!1e2!2m1!1e3!2m4!1e17!4m2!17m1!1e2!3sIAE,lf:1,lf_ui:9

  24. Fah Zah Reply

    Quite an interesting post for web scraping and thank you for sharing!

    As of now, the total number of MVPs is 233 persons. From what I understand, the results loading is manual in the suggested process so in your example the ps=164 and now it should be ps=233

    what if we assume that the number of MVPs is updated regularly e.g. weekly and we want to capture the latest #. if we do it like this, we would be missing the new MVPs details. correct? also, don’t you think that this “drilling” is time consuming vs the scraping using parameters?

    What if we used parameters and capture all the pages ? how we could capture the latest #?

    • Fah Zah Reply

      I just realized that the MVP results does not load at all so the parameter solution I guess it would not work in that page. So, we definitely need to master your suggested approach! 🙂

  25. Norman Reply

    Great guide.
    I tried to follow it, however I could not get to the result I wanted.
    When I drill down to the section I am after, I just get to a dead end… the section I want doesn’t have a table…
    I can drill down through the five DIV’s, but then it stops
    I am trying to grab the 5 rows from “Dividend growth summary” on this page:
    https://seekingalpha.com/symbol/MSFT/dividends/dividend-growth

    • Gil Raviv Post authorReply

      Hi Norman

      When I tried solving this challenge, I found out that for some reason Web.Page is not consistent and misses the relevant DIV elements in your page. Then I found out that the new method Web.BrowserContents is doing a better job https://docs.microsoft.com/en-us/powerquery-m/web-browsercontents

      I sent you my solution including two methods to extract the data and an additional function to get more than one stock.
      Hope you find it useful. I may blog about it someday 🙂

  26. Christian Reply

    Hi Gil-

    Want to say thank you for your excellent book. Its content has helped a few people around here substantially. Thank you!

    I have a scenario. We’re trying to scrape company reviews off of GlassDoor for an ad hoc HR project. GlassDoor has a feature that generates some relevant evaluative phrases from the review body, stored in an HTML element called a “FlexGrid.” Each phrase (“Recommends”, “Great Pay”, “Bad Management”, etc.) is a separate element in the FlexGrid element, and there’s 1 FlexGrid per review. I’ve attempted to pull those phrases with a List.Generate calc column that evaluates the number of keyword Children in the FlexGrid for every review and applies a drilldown sequence to get at the phrases in each Child and make that an item in a Flexgrid phrase list for that review.

    In the below, [Review Body] is a calc column containing a table with elements for the FlexGrid (if present), review body text and elements for other stuff. My List.Generate calc column is returning lists with 1 error item that runs “Expression.Error. The field ‘Review Body’ of the record wasn’t found. Details: count =0).” I think the issue has to do with the use of brackets and with scope changes when referring to a Column inside of List.Generate. Any comments on how I can modify the code to get what I want? Thanks.

    if List.Count(List.FindText([Review Body][Name], “DIV”)) > 1 then // 2 DIVs mean FlexGrid is present

    List.Generate(

    ()=>[count = 0],

    each [count] < List.Count([Review Body][Children]{0}[Children]), // # of children in FlexGrid

    each [count = [count] + 1],

    each [Review Body][Children]{0}[Children]{[count]}[Children]{0}[Children]{1}[Text]{0}) //drillthrough to FlexGrid child text

    else null

    • Gil Raviv Post authorReply

      Hi Christian,
      First of all, thank you for reading the book. I am glad it helped you and your colleagues. Would be awesome if you share your review on Amazaon.
      Regarding the web scraping. I am not sure I can find sufficient time to look at your code and help, but feel free to send it by email: gilra@datachant.com.
      I recommend that you won’t do the web scarping on Glassdoor, and instead consider using their APIs https://www.glassdoor.com/developer/index.htm. For two reasons: 1. Not sure web scraping is meeting their terms of use. 2. Web sites can change. APIs are a better choice.

  27. Christian Reply

    Your book’s been a real resource. Let me know if you have a preferred method of shout-out or social media support!

    I think my question above on methods to run multiple parallel drill-drowns on one table to get at similar elements that are at the same stepwise depth is important here. Thanks again.

  28. Mark Reply

    Hi Gil Raviv, I’m trying to get data from yahoo finance into PBI. I need the Income Statement (Quarterly data) for Apple from this link: https://finance.yahoo.com/quote/AAPL/financials?p=AAPL&amp;.tsrc=fin-srch

    Any idea on how to manipulate the table selector (Annual /Quarterly) to extract Quarterly data instead of Annual data which appears in the initial page?

    Regards,
    Mark

    • Gil Raviv Post authorReply

      Hi Mark
      I am sorry, but I don’t have the bandwidth to analyze this website and provide any advise, beyond this one: Please consider using Web APIs instead of web scraping. Especially for public financial data.
      Good luck,
      Gil

  29. Rajender Singh Negi Reply

    Hi Gil,

    I have a query , Suppose if are dynamically importing the data from multiple web pages. But in my case there may be cases where particular site link no more valid/ exist in that case , How can I modify the query so that it ignore if any source data is not exist ?
    Below is the M-Code which I am using :-
    ——————————————————————————————————————————————-
    let GetResults=(URL) =>
    let
    Source = Json.Document(Web.Contents(“https://XX-XX.XXXXXX.XXXXXXX.com/unknown/unknown/people/search/”& URL)),
    #”Converted to Table” = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #”Expanded Column1″ = Table.ExpandRecordColumn(#”Converted to Table”, “Column1”, {“profileID”, “chargedCC”, “managerProfileID”, “managerFullName”, “legalManagerFullName”, “assistantProfileID”, “assistantFullName”, “companyName”, “companyType”, “jobArea”, “jobFamily”, “job”, “firstName”, “lastName”, “localFirstName”, “localLastName”, “fullName”, “localName”, “company”, “department”, “telephone”, “telephoneSource”, “mobile”, “mobileSource”, “fax”, “ecnExtension”, “email”, “emailSource”, “displayName”, “roomNumber”, “jobTitle”, “positionId”, “costCentre”, “XXXXXXXConnection”, “employeeNumber”, “payrollNumber”, “status”, “initials”, “country”, “street”, “city”, “postalCode”, “countryCode”, “buildingNumber”, “operationalUnit”, “organizationalUnitShortName”, “organizationalUnitId”, “employeeType”, “isOpManager”, “hrContact”, “signType”, “nickName”, “funcidowner”, “distinguishedName”, “owner”, “member”, “mailNickname”, “managedBy”, “memberOf”, “otherTelephone”, “info”, “assistantPhone”, “ownerDisplayname”, “objectClass”, “objectCategory”, “authOrig”, “authOrigBL”, “unixUid”, “destinationIndicator”, “srchMOD”, “srchONLYMGR”, “srchONLYEMP”, “functionalUserIdList”, “pdlDddlList”, “emailHelp”, “telephoneHelp”, “mobileHelp”, “faxHelp”, “cityHelp”, “zipHelp”, “officeHelp”, “countryHelp”, “streetHelp”, “contractEndDateHelp”, “office”, “resignDate”, “manager”, “modifyTimestamp”, “funcIdOwner”, “operationalManagerFullName”, “operationalManager”, “id”}, {“profileID”, “chargedCC”, “managerProfileID”, “managerFullName”, “legalManagerFullName”, “assistantProfileID”, “assistantFullName”, “companyName”, “companyType”, “jobArea”, “jobFamily”, “job”, “firstName”, “lastName”, “localFirstName”, “localLastName”, “fullName”, “localName”, “company”, “department”, “telephone”, “telephoneSource”, “mobile”, “mobileSource”, “fax”, “ecnExtension”, “email”, “emailSource”, “displayName”, “roomNumber”, “jobTitle”, “positionId”, “costCentre”, “XXXXXXXConnection”, “employeeNumber”, “payrollNumber”, “status”, “initials”, “country”, “street”, “city”, “postalCode”, “countryCode”, “buildingNumber”, “operationalUnit”, “organizationalUnitShortName”, “organizationalUnitId”, “employeeType”, “isOpManager”, “hrContact”, “signType”, “nickName”, “funcidowner.1”, “distinguishedName”, “owner”, “member”, “mailNickname”, “managedBy”, “memberOf”, “otherTelephone”, “info”, “assistantPhone”, “ownerDisplayname”, “objectClass”, “objectCategory”, “authOrig”, “authOrigBL”, “unixUid”, “destinationIndicator”, “srchMOD”, “srchONLYMGR”, “srchONLYEMP”, “functionalUserIdList”, “pdlDddlList”, “emailHelp”, “telephoneHelp”, “mobileHelp”, “faxHelp”, “cityHelp”, “zipHelp”, “officeHelp”, “countryHelp”, “streetHelp”, “contractEndDateHelp”, “office”, “resignDate”, “manager”, “modifyTimestamp”, “funcIdOwner”, “operationalManagerFullName”, “operationalManager”, “id”})
    in
    #”Expanded Column1″

    in GetResults
    ——————————————————————————————————————————————-

    Second Query :-
    ———————————————————————————————————————————————
    let
    Source = Excel.CurrentWorkbook(){[Name=”Table1″]}[Content],
    #”Changed Type” = Table.TransformColumnTypes(Source,{{“SIGNUM”, type text}}),
    #”Renamed Columns” = Table.RenameColumns(#”Changed Type”,{{“SIGNUM”, “URL”}}),
    #”Added Custom” = Table.AddColumn(#”Renamed Columns”, “Result Data”, each FPeople_Finder([URL])),
    #”Expanded Result Data” = Table.ExpandTableColumn(#”Added Custom”, “Result Data”, {“profileID”, “chargedCC”, “managerProfileID”, “managerFullName”, “legalManagerFullName”, “assistantProfileID”, “assistantFullName”, “companyName”, “companyType”, “jobArea”, “jobFamily”, “job”, “firstName”, “lastName”, “localFirstName”, “localLastName”, “fullName”, “localName”, “company”, “department”, “telephone”, “telephoneSource”, “mobile”, “mobileSource”, “fax”, “ecnExtension”, “email”, “emailSource”, “displayName”, “roomNumber”, “jobTitle”, “positionId”, “costCentre”, “ericssonConnection”, “employeeNumber”, “payrollNumber”, “status”, “initials”, “country”, “street”, “city”, “postalCode”, “countryCode”, “buildingNumber”, “operationalUnit”, “organizationalUnitShortName”, “organizationalUnitId”, “employeeType”, “isOpManager”, “hrContact”, “signType”, “nickName”, “funcidowner.1”, “distinguishedName”, “owner”, “member”, “mailNickname”, “managedBy”, “memberOf”, “otherTelephone”, “info”, “assistantPhone”, “ownerDisplayname”, “objectClass”, “objectCategory”, “authOrig”, “authOrigBL”, “unixUid”, “destinationIndicator”, “srchMOD”, “srchONLYMGR”, “srchONLYEMP”, “functionalUserIdList”, “pdlDddlList”, “emailHelp”, “telephoneHelp”, “mobileHelp”, “faxHelp”, “cityHelp”, “zipHelp”, “officeHelp”, “countryHelp”, “streetHelp”, “contractEndDateHelp”, “office”, “resignDate”, “manager”, “modifyTimestamp”, “funcIdOwner”, “operationalManagerFullName”, “operationalManager”, “id”}, {“profileID”, “chargedCC”, “managerProfileID”, “managerFullName”, “legalManagerFullName”, “assistantProfileID”, “assistantFullName”, “companyName”, “companyType”, “jobArea”, “jobFamily”, “job”, “firstName”, “lastName”, “localFirstName”, “localLastName”, “fullName”, “localName”, “company”, “department”, “telephone”, “telephoneSource”, “mobile”, “mobileSource”, “fax”, “ecnExtension”, “email”, “emailSource”, “displayName”, “roomNumber”, “jobTitle”, “positionId”, “costCentre”, “ericssonConnection”, “employeeNumber”, “payrollNumber”, “status”, “initials”, “country”, “street”, “city”, “postalCode”, “countryCode”, “buildingNumber”, “operationalUnit”, “organizationalUnitShortName”, “organizationalUnitId”, “employeeType”, “isOpManager”, “hrContact”, “signType”, “nickName”, “funcidowner.1”, “distinguishedName”, “owner”, “member”, “mailNickname”, “managedBy”, “memberOf”, “otherTelephone”, “info”, “assistantPhone”, “ownerDisplayname”, “objectClass”, “objectCategory”, “authOrig”, “authOrigBL”, “unixUid”, “destinationIndicator”, “srchMOD”, “srchONLYMGR”, “srchONLYEMP”, “functionalUserIdList”, “pdlDddlList”, “emailHelp”, “telephoneHelp”, “mobileHelp”, “faxHelp”, “cityHelp”, “zipHelp”, “officeHelp”, “countryHelp”, “streetHelp”, “contractEndDateHelp”, “office”, “resignDate”, “manager”, “modifyTimestamp”, “funcIdOwner”, “operationalManagerFullName”, “operationalManager”, “id”})
    in
    #”Expanded Result Data”
    ———————————————————————————————————————————————

    • Gil Raviv Post authorReply

      In the second query in the Table.AddColumn line make this change instead of the section “each FPeople_Finder([URL])”

      each try FPeople_Finder([URL]) otherwise null

  30. Luc Reply

    Morning Gil,
    Great tutorial, easy to follow and very helpful thank you!
    I tried to replicate the methodology using https://www.morningstar.com/stocks/xnas/aapl/valuation as an example, however I could not get to the result I wanted. Indeed, say I want to drill down until I make the “2017” cell of the table visible in the Power Query. I can drill down a few , however at a point I reach a which Power Query does not seem to find and just gives a “null” for it.
    Any idea where this could come from and how to make it work? Thanks a lot in advance!

    • Gil Raviv Post authorReply

      Hi Luc,
      I can take a look at it, but it can take me time. I recommend getting a consultant to try it out. Contact me at gilra@datachant.com for good references or to help me prioritize it 🙂

  31. Anthony Snow Reply

    Hi – Similar to Luc, using PowerQuery in Excel not BI, can only Transform not Edit then only get so through the tables before you just get nulls. Is it possible to specify the DIV ID in the URL and pull from there

  32. Anthony Snow Reply

    Hi – Same issue as Luc, Using Excel Power Query not BI, no option to edit, just Transform, then go down through the page div elements to find correct table (it’s in Confluence) and end up at nulls and no actual data. Presumably it is not as easy as appending the DIV ID to the url (here’s hoping!)

  33. Joe Reply

    Thanks For the information, but it’s not working now. 23.05.2020

    • Gil Raviv Post authorReply

      Hi Joe,
      This is an old blog post. The website has probably changed since I wrote it. I may publish a new version in the future.

  34. Sergio Reply

    Best post I found on the subject. Rather complicated for a newbie. I am trying to get to the data of the second chart on this webpage. I can find the info when I do inspect in a browser, I can see there is a table for it, but I get lost in the multiple div reference that it has. I’ve spent hours trying to figure it out without success. I am trying to get that info into excel.
    Could you please give me some pointers ?
    Thanks

    • Gil Raviv Post authorReply

      You may consider importing to Power Bi and from there to Excel using Analyze in Excel feature. Power BI has a better web scraping capabilities in its version of Power Query that were not migrated to Excel

  35. Sergio Reply

    Thanks ! I had a look at it, but it actually doesn’t list the entry table in Power Bi. If I can see the data when doing an inspect in the browser, it should eventually be possible to find the table in Excel, right ? Just don’t want to try something that could be impossible. Is there any way they could still hide the table somehow ?

  36. Anonymous Reply

    Great article! I follow through the example and it works. Many thanks. However, when I apply to the following website, https://www1.hkexnews.hk/listedco/listconews/index/lci.html?lang=en, I got the dead end as when I drill down the Table, I did not get to a step with a large set of Table objects in the Children column. Instead, they all showed null. They were supposed to be the data that I would like to extract.

  37. YW Reply

    Great article! I follow through the example and it works. Many thanks. However, when I apply to the following website, https://www1.hkexnews.hk/listedco/listconews/index/lci.html?lang=en, I got the dead end as when I drill down the Table, I did not get to a step with a large set of Table objects in the Children column. Instead, they all showed null. They were supposed to be the data that I would like to extract.

    Could you please show me how I can proceed and extract the information?

  38. gn Reply

    Hey, just leaving a comment saying you rock buddy. Easy to understand and even easier to follow. Leaving a comment is the least I can do. Thanks!!!

Leave a Reply