Book Excerpt: Basic Text Analytics using Power BI and Power Query in Excel

Today I am excited to share a second excerpt from the draft manuscript of my book. Today’s excerpt is the first part of chapter 11, which discusses basic text analytics, and a cool technique to detect multiple keywords in textual feeds using Cartesian Product. To read more excerpts, as they will be published on the blog go here.

You can pre-order the book on Amazon here.

 

Excerpted from the draft manuscript of Collect, Transform and Combine Data using Power BI and Power Query in Excel (Pearson Education, 2018).

Chapter 11

Basic Text Analytics

Words empty as the wind are best left unsaid.
–Homer

IN THIS CHAPTER, YOU WILL:

  • Learn how to search for keywords in textual columns and apply this technique as a simple topic detection method.
  • Learn how to load external list of keywords and apply them dynamically in the search.
  • Use Cartesian Product or custom function to search for keywords.
  • Use Table.Buffer and List.Buffer to improve keyword detection performance.
  • Learn several methods to split words, ignore common words, and count word occurrences in your analysis.
  • Learn how to prepare your textual data and load it in Word Cloud custom visual in Power BI.

In this chapter, you will harness Power Query to gain fundamental insights into textual feeds. Many tables in your reports may already contain abundant textual columns that are left barren, untapped and devoid of meaning. As a paraphrase to Homer’s saying that empty words are best left unsaid, you may find many words in your reports that are “empty as the wind”. These textual columns are most likely ignored in your report, or are kept in, just for some context. Fortunately, in next two chapters, you will learn very useful methods to take leverage of textual columns and tap into new insights as you unravel these hidden treasure troves.

The techniques in this chapter will be extremely useful in analyzing tables with textual input, such as client feedback in surveys, operational comments, or social network feeds. You will learn how to apply common transformations to extract meaning from words. We will start the journey with basic techniques to detect keywords, and gradually improve our toolset to solve common text-related analytical challenges. Our text analytics journey will continue in the next chapter with advanced text analytics on Azure, including text translation, sentiment analysis and detection of key phrases, with no help from data scientists or software developers.

Searching for Keywords in Textual Columns

Our first text analytics challenge is very common – How to find specific keywords in textual feeds. While this task may seem trivial, it will become more challenging when there are many keywords to search, or when you would like to dynamically load the keywords from an external source, without modifying your queries. In the first exercise you will learn how to detect a small set of predefined keywords.

Exercise 11-1: Basic Detection of Keywords

In this exercise you will search for keywords in messages that are posted by Microsoft Press on their official Facebook page (https://www.facebook.com/microsoftpress). Once you detect the keywords, you will be able to find out which topics are promoted the most by Microsoft Press’ social team.

Search for keywords using Power Query in Microsoft Press Facebook page

To analyze Microsoft Press’ Facebook posts, you will need to download the source workbook C11E01.xlsx from here, and save it in C:\Data\C11.

You can open the file and review its main table. It includes the following four columns: Facebook messages (Message column); The date that the post was published (Date column); the post’s unique identifier (Post ID column) and the URL to the actual post on Facebook (URL column).

  • You can click Refresh All in Data tab, to update the workbook with the latest Facebook posts.
  • In chapter 13, you will learn how to import the actual Facebook posts with Power Query. For the time being, our focus in this chapter is text analytics. As such, we will start our exercise with the actual dataset, stored in the Excel file C11E01.xslx. If you cannot wait, you can review the query, which imports the posts from Facebook.

Imagine you are a data analyst, working on Microsoft Press social media team. Your first assignment is to estimate which of the following book topics are promoted the most on Facebook:

  • Microsoft Excel
  • Visual Studio
  • Azure
  • Windows

Can you detect these four keywords in the Message column of your dataset?

  1. Open a blank workbook or Power BI Desktop report and import the workbook C11E01.xslx to Query Editor.
  2. In Navigator dialog, select Sheet1 and click Edit.
  3. The Query Editor will open. You will notice that Microsoft Press Facebook posts are stored in Message column.

 

Due to the fact, that Power Query is case-sensitive in searches, we can simplify our method by working on a duplicated lowercased version of the messages. Thus, you will keep the original messages intact, for cases where you should include the original text in your report and maintain the specific uppercased characters in the messages.

  1. Select Message column. In Add Column tab, click Duplicate Column.
  2. Select the Message – Copy column. In Transform tab, click Format, then select Lowercase.
  3. Rename the column Message – Copy to Lowercased.
We will now create the column Topic and populate it with one of the target topics: Microsoft Excel, Visual Studio, Azure and Windows.
  1. In Add Column tab, click Conditional Column. The Add Conditional Column dialog box will open. As shown in the figure below, you will now create a new column that will return one of the four topics, if they are found in the Lowercased column.

  1. Repeat the following steps:
    1. Set Topic as New Column Name.
    2. Set Lowercased as Column Name.
    3. Set Contains as Operator
    4. Set microsoft excel as Value
    5. Set Microsoft Excel as Output.

Note In step 8.4 we used a lowercase version for the topic (e.g. microsoft excel), as the search is case sensitive. In step 8.5 we used the capitalized version, for reporting purposes.

  1. While you are still on Add Conditional Column dialog box, click Add Rule. A new condition row will be added in the dialog box. Repeat step 8, with the following exceptions: In step 8.4, use the keyword visual studio as Value, and in step 8.5 set Visual Studio as Output.
  2. Click Add Rule. Repeat step 8 with the following exceptions: In step 8.4, use the keyword azure as Value, and in step 8.5, set Azure as Output.
  3. Repeat step 8 with the following exceptions: In step 8.4, use the keyword windows as Value, and in step 8.4, set Windows as Output.
  4. Set Other as Otherwise. Verify that you have configured the dialog box as shown in the figure above and click OK.

The preview pane will include the new Topic column. Obviously, this method is not very effective in detecting all the book topics in Microsoft Press Facebook posts. Many rows are classified as Other. Some of the posts promote books in topics such as Office 365, or SQL Server. Other posts promote general sales discount. For example:

“visit www.microsoftpressstore.com today—and discover deals, new books, and special promotions!”

To keep things simple, let’s assume that this approach met our objectives to detect the main book topics. You can now filter out the rows with Other in Topic column.

  1. Click on the filter control of Topic column header, then uncheck Other in the filter pane and click OK.
  2. Finally, remove Lowercased column, rename the query to Microsoft Press Posts and load the report.

Now, when we have accomplished our goal, let’s review the main caveats of the method we used in this exercise:

  • Caveat #1: Topic detection is limited to short list of keywords: If we have a massive number of keywords, it will not be practical to manually add many keywords in the Add Conditional Column dialog box. We should find an alternative method that can approach this challenge in a scalable way.
  • Caveat #2: Topic detection is limited to a static list of keywords. In our method, we had to type the keywords in the dialog box. Alternatively, we could edit the underlying M formula, but the result was the same. The keywords were hardcoded in the formula. If you are required to load the list of keywords from an external source, and you don’t know in advance which keywords should be detected, this method will not be effective. We will need to find a method that will allow us to dynamically load a list of keywords from an external source (and ensure that the performance of this operation doesn’t slow us down).
  • Caveat #3: Topic detection is limited to one keyword per message. Some Facebook posts may contain multiple keywords. For example, look at the following post:

“New book! Programming for the Internet of Things: Using Windows 10 IoT Core and Azure IoT Suite”.

In this post we missed the Windows keyword, as our logic prioritized the Azure keyword in the Add Conditional Column dialog, and in its derived M formula. In some cases, you would do better if you allocate multiple keywords to the same message, to ensure that the message will be counted for each keyword.

  • Caveat #4: False detection of words: Since we applied the Contains operator, we may detect keywords in that are substrings in longer words. For example: Let’s consider the keyword Excel. What will be the impact on these two fake messages?

“This SQL Server book is excellent.”

“Read this C# book to excel in software development”.

In both examples, we have missed the real topic of the book. To avoid false detection of topics, you will need to split the messages into words, and match the keywords with the split words. We will address this challenge later in this chapter.

Let’s address the first three caveats and introduce a great method to compare multiple datasets – The Cartesian Product.

Using Cartesian Product to Detect Keywords

In this exercise we will learn an effective method to detect large number of keywords. In the previous exercise we used a conditional column to detect four keywords. Would you consider using the same approach to search for a large set of keywords? Adding a long list of keywords manually, will be time consuming, and quite challenging in case you need to periodically detect different keywords.

To scale up our solution used in exercise 11-1 and detect large number of topics from a dynamic list of keywords, we will apply a Cartesian Product between Microsoft Press posts and the list of keywords. Using the Cartesian Product, a temporary larger table will be created. The table will consist of all the combinations between Facebook posts and keywords.

A screenshot of three tables. At the top two tables with an X icon in between. The first table with Post ID and Message columns, with three rows and the following values in Message column: "New Book! Master Powerpoint", "Special Offer: Microsoft Word" and "Excel the art of Microsoft Excel". The second table on the top right side include a single Keyword column, and the values: "excel", "powerpoint" and "microsoft word". An arrow points down to a table at the bottom, which represents all the combinations between single rows from the top tables.

  • Note: In Set theory, Cartesian Product is a method that is applied on multiple sets to produce a set of all the combinations of members from the different sets – one from each set. In Power Query you can apply a Cartesian Product on two tables (and even on the same table), by adding a Custom Column into the first table with a reference to the second table. Then, by expanding the new column you can reach the Cartesian product, as shown in the figure above.

In the figure above, you can find three messages in the top left table. To the right, you can see the keywords table. The Cartesian Product will create a new table with all the combinations of message-keyword pairs. This technique can help you to explore all the combinations, and then apply further computation on each pair. In our case, the computation will be to search for the keyword in the message, and then filter the matching rows.

  • Tip: There are many applications to Cartesian Product, which can help you to gain new insights to your data. By mapping all the combinations between multiple datasets or entities, you can apply what-if calculations, basket analysis, and computations of methods from Graph theory (e.g. finding the shortest path).

If you are not sure yet how to apply a Cartesian Product, or why it is useful in achieving a large-scale keyword search, follow the next exercise to answer these questions.

Exercise 11-2: Implementing Cartesian Product

As the Chief Analyst of Microsoft Press social team (In case you missed the change in your title since exercise 11-1, you have just been promoted to Chief Analyst – Congratulations), your next challenge is to analyze Microsoft Press Facebook posts by topic, as demonstrated in the figure below. To detect the topics, you will apply a Cartesian Product between the Facebook posts and a large set of keywords, as seen in the slicer of this figure:

A screenshot of a PivotChart and a Slicer. The PivotChart shows four trend lines of Microsoft Press Facebook Posts by Topic. The four topics are: dax, excel, power bi and power query. To the right a slicer with many topics. The four topics are selected.
Analyzing Microsoft Press Facebook Posts by Topic – Available to download here

Initial Preparations

  1. In case you skipped Exercise 11-1, download the workbook C11E01.xslx and save it to C:\Data\C11\
  2. Download the text file Keywords.txt from here and save it in your local folder C:\Data\C11\
  3. Open a blank workbook or Power BI Desktop report and import the workbook C11E01.xslx from C:\Data\C11\ to Query Editor.
  4. In Navigator dialog, select Sheet1 and click Edit.
  5. Rename the query to Microsoft Press Posts.
  6. You will now create a reference to Microsoft Press Posts query to build a new table of posts and their topic, where the same Post ID can have multiple rows.
  7. In Queries pane, right click on Microsoft Press Posts query and click Reference.
  8. Rename the new query to Post Topics
  9. While Post Topics is selected in Queries pane, click on Select Columns in Home tab.
  10. In Choose Columns dialog box, select Post ID and Message and click OK.
  11. Select Message column. In Add Column tab, click Duplicate Column.
  12. Select the column Message – Copy. In Transform tab, click Format, then select Lowercase.
  13. Rename the column Message – Copy to Lowercased.
  14. It’s time to load our keywords as a new query. While the Query Editor is still open, click New Source in Home tab, and import Keywords.txt using File, Text/CSV.
  15. The Keywords.txt dialog box will open with a preview of the file. Click OK.
  16. The Keywords query will be loaded. Rename its column to Keyword.
  17. To lowercase all the keywords, select the Keyword column, then in Transform tab, click Format, and select Lowercase.

The Cartesian Product

  1. In Queries pane, select Post Topics query.
  2. In Add Column tab, click Custom Column. The Custom Column dialog box will open.
    1. Set Cartesian as New Column Name.
    2. Set the following formula in Custom Column Formula and click OK.

  1. The Cartesian column will be added, with the Keywords table in each row. At this stage each Keywords table is represented as [Table] object in the Query Editor. It’s time to expand it.
  2. Click on the expand icon in the Cartesian header name (As an alternative, you can select the Cartesian column, and click Expand in Transform tab). In the expand pane, keep Keyword checked, and uncheck Use Original Column Name As Prefix. Then, click OK.

You can now see in the preview pane that the Cartesian Product was successfully implemented. It is time to learn why the Cartesian Product is so helpful. With all the combinations between messages and keywords, you can apply a filter to keep only the matching rows. We can use the Add Conditional Column transformation step, that will be followed by a filter on the new conditional column.

  1. In Add Column, click Conditional Column. The Add Conditional Column dialog box will open. Set Topic as New Column Name.
    1. Set Lowercased as Column Name. Set Contains as Operator. In the drop down below Value click on Select a Column.
    2. Set Keyword as Value. In the drop down below Output click on Select a Column. Set Keyword as Output.
    3. Set null as Otherwise and click OK.
A screenshot of Add Conditional Column dialog box. Topic is set in the top text box of New column name. Lowercased is set as Column Name, contains as the Operator, Keyword column is set as both Value and Output. Null is set as Otherwise.
Add Conditional Column handling the matching between each Microsoft Press Facebook post and a Keyword value in the Cartesian Product.

 

Following the preceding step, the Topic column was added. It contains many null values for all the combinations of posts that didn’t contain the given keywords. It’s time to filter out all the rows with null values.

  1. Click on the filter icon of Topic column and select Remove Empty in the filter pane.
  2. Remove the columns Message, Lowercased and Keyword.

Let’s review what you have achieved so far. Using the Cartesian Product, you were able to map posts to specific topics from a dynamic list. You have two tables which will be needed in your report. The first table is Microsoft Press Posts, with the columns Post ID, Message, and other columns. The second table is Post Topics with the columns Post ID and Topic.

Relationships

You can now create the relationship between Microsoft Press Posts and Post Topics

(The instructions were removed from this blog post to fit the scope of a blog post).

A screenshot of the Diagram View of Power Pivot in Excel. Two boxes that are labeled as Microsoft Press Posts and Post Topics are connected via Post ID fields which are highlighted in both tables.

You are ready to start the analysis. Download the solution workbook C11E02 – Solution.xlsx and see how these connected tables in Data Model can be used in a PivotChart and a Slicer to find out how many Facebook posts are published by topic over time.

  • Please save your report. There is one important performance improvement that comes next.

Before we do a deep dive on performance considerations, let’s see what we were able to achieve so far. Going over the caveats that were described at the end of exercise 11-1, the Cartesian Product method addressed the following caveats:

  • We were able to load a larger list of keywords, without applying each one of them manually in the Add Conditional Column dialog box. Caveat #1 was addressed.
  • We were able to dynamically load the list of keywords from an external file, without knowing in advance which keywords will be used. Caveat #2 was addressed, but can be improved performance-wise, so please keep the Query Editor open, as we will discuss the performance improvement in the next section.
  • We were able to associate multiple topics to single posts in cases where two or more keywords contain in a single message. Caveat #3 was addressed.

Performance Improvement

Let’s consider for a moment the performance of our solution. When you refresh the report, you will notice that the query loads relatively fast. The Cartesian Product, combined with the filtering of matching rows is implemented in an efficient method that doesn’t require high memory consumption. When we created the Cartesian Product, and expanded all the combinations between Facebook posts and keywords, the expanded table was not entirely consumed in memory. Instead, the M engine performed the transformations on small portions of two tables.

This is an important and powerful capability of Power Query. Whenever possible, multiple transformation steps are applied on a small portion of the data and iterate over the dataset, till the entire dataset is transformed. This approach ensures that the M engine will not load the entire dataset in memory, unless it is necessary required. As a result, the Cartesian Product doesn’t leave significant memory blueprint.

Nevertheless, we can still significantly improve the performance of the Cartesian Product.

  1. Open your workbook or Power BI report following exercise 11-2 steps 1-24.
  2. In Query Editor, select the Post Topics query, and click on Added Custom in Applied Steps. You will find the following formula in the formula bar:

In this step, the M engine accesses the keywords for each row. Surprisingly disappointing, the implementation of this step along with the expand step that follows lead to redundant loads of the keywords from the external text file.

To fix this step and enforce the M engine to load the Keywords from the external text file only once, you can use the M function Table.Buffer on Keywords. By doing so, you explicitly instruct the M engine to store the entire Keywords table in memory and ensure that the M engine will not access the text file multiple times.

  1. To add Table.Buffer, click the Advanced Editor, and add this row after the let line.

  1. Next, find the following line:

  1. Change the line above to the following one by replacing Keywords with KeywordsBuffer:

When you load the query now, you may not notice the improvement in load time, as the dataset is relatively small. To compare the load time of Post Topics query before and after Table.Buffer, you can download the workbook C11E02 – Refresh Comparison.xlsx from here.

In C11E02 – Refresh Comparison.xlsx workbook, you can measure how fast the query is loading with Table.Buffer. As demonstrated in the screenshot below, you can compare the load time of Post Topics query (without Table.Buffer) and Post Topics – Faster query (with Table.Buffer) and witness the improvement in load time of the second query.

To perform the comparison, open the workbook, click in Queries & Connections in Data tab and then click Refresh All. You can notice that Post Topics – Faster loads faster than Post Topics.

A screenshot of an Excel spreadsheet. The Queries & Connections pane is open. There is a spinner icon in Post Topics query, with the label: "42,347 rows loaded". Below, the Post Topics - Faster query finished loading, with the label: "78,100 rows loaded"
The query Book Reviews – Faster is loaded faster using Table.Buffer
  • To measure the improvement in refresh time, it is recommended that you test it on large-enough datasets. To artificially increase your dataset, you can multiply your table using the M function Table.Repeat.

In C11E02 – Refresh Comparison.xlsx workbook, we have multiplied Microsoft Press Posts 100 times, using the following function as the last step of the query:

In the next exercise you will learn an alternative method to detect keywords. Using a custom function.

 


That’s it for today. Hope you liked the preview of my book. In the next book excerpt, I will share cool techniques to perform split words, and use List.Accumulate function to detect multi-word phrases in textual feeds.

This blog post was excerpted from the draft manuscript of Collect, Transform and Combine Data using Power BI and Power Query in Excel (Pearson Education, 2018). Pre-order the book on Amazon here.

 

 

 

 

 

 

1 Comment

  1. Afzal khan Reply

    Love the excerpts….. when is the release date of the book..

Leave a Reply