Import WHO COVID-19 data from PDF

In the last tutorials of the COVID-19 dashboard series, you learned how to build your own COVID-19 dashboard from CSV files that were provided by Johns Hopkins University. In today’s tutorial, we will import one of the WHO situation reports and learn how to apply Power Query PDF connector.

Ready to start?

Importing the PDF

Go here to WHO COVID-19 situation reports and select Situation report – 177.

The PDF will open on your browswer. Scroll down to find Table 1. This is the data we are going to import to Power BI. Copy the URL address.

Remove the parameter section from the URL (trim everything after the question mark). Here is the modified URL:

Open Power BI Desktop and select Get Data. Here you may be tempted to select PDF. In the next section, I will show you what happens when you select PDF. I recommend that you select Web instead and then paste the URL.

But for the sake of following some bad practices to help us learn, let’s select the PDF connector and see what happens.

The following dialog box will open. If you are wise, you would understand that this is not the best option for you, and revert back to the previous step to select the Web connector.

But let’s try and see what happens if we paste the URL in the File name. Paste the URL above and click Open.

It works! The Navigator dialog opens. Here you can select Table002 (Page 6-15) and select Transform Data. You should note that if you select Web instead of PDF, you will get to this screen and can follow the same step.

In Power Query Editor, rename the query to COVID-19.

If you selected PDF earlier instead of Web, you should go to the Source step in Applied Steps and check out the formula bar. Can you see what is wrong in the formula?

The formula uses File.Contents to access the PDF from a local cache folder. Refreshing this report in the Power BI service will be a challenge.

= Pdf.Tables(File.Contents("C:\Users\[username]\AppData\Local\Packages\Microsoft.MicrosoftPowerBIDesktop_8wekyb3d8bbwe\AC\INetCache\5OD31K0S\20200715-covid-19-sitrep-177[1].pdf"), [Implementation="1.1"])

You can fix it by replacing File.Contents with Web.Contents and then use the URL instead of the file path. Here is the modified formula:

= Pdf.Tables(Web.Contents(""), [Implementation="1.1"])

Click Edit Credentials.

Since the PDF is available online on WHO website and does not require users to authenticate for access, we can keep the default settings in the next screen. In the Access Web content dialog, select Anonymous, and click Connect.

You can now see the data in Power Query Editor preview pane.

Cleaning the table

Before we start, let’s delete the Changed Type step in Applied Steps. This step was created automatically and converted two columns to Whole Number. But there are three additional columns that were not converted. We will need to shape the data a bit before we change the column types and it is a better practice to avoid having the automatically changed type step altogether.

Note: I have written about the challenge of working with the automatic type changes here and in Chapter 10 of my book. Microsoft has just released this month the option to disable this feature. Make sure you have the July 2020 version, or above (It is currently the latest version when I write this article).

To disable the auto-type change from all your future reports, in File, select Options and settings and then select Options. In the Options dialog box under Global, Data Load. Select Never detect column types and headers for unstructured sources, and click OK.

Now comes the fun part. It is common to get messy data when you import tables from PDF documents. In this section, you will learn how easy it is to clean the data.

The first challenge you may notice is that some rows represent titles or headers in the original PDF and are missing values. For example, in the first row you can find Africa with null values. We can filter rows that have blank values in the Total confirmed cases column.

Select the filter control in the header of Total confirmed cases column and select Remove Empty in the filter pane.

The second challenge is that space characters are used as the thousands separator. There are several ways to solve it. Here is one:

Select all the numerical columns using the CTRL button. Right-click on the header of one of the columns and select Replace Values.

In the Replace Values dialog box, enter a single Space character in Value to Find and click OK. This action will replace the spaces with an empty string and fix the format of the numbers.

Now you can convert the numerical columns. Click the left icon in each of the numerical column headers and select Whole Number.

Finally, you can scroll down in the preview and see that the last three rows represent total numbers. It is not advised to load such records into Power BI. Let Power BI calculate these numbers for you and avoid situations where you do double or triple summation due to having subtotals and totals in the same table as the items you want to aggregate.

To remove the last three rows, in Home tab, select Remove Rows, and then select Remove Bottom Rows.

In Remove Botttom Rows dialog, enter 3 as Number of rows and click OK.

What’s next?

In the next part of this tutorial, you will learn how to improve the query code and convert it into a function that can read multiple PDF files from the WHO COVID-19 situation reports.

1 Comment

Leave a Reply

Your email address will not be published. Required fields are marked *