Today I am excited to share a second excerpt from the draft manuscript of my book. Today’s excerpt is the first part of chapter 11, which discusses basic text analytics, and a cool technique to detect multiple keywords in textual feeds using Cartesian Product. To read more excerpts, as they will be published on the blog go here.
You can pre-order the book on Amazon here.
Excerpted from the draft manuscript of Collect, Transform and Combine Data using Power BI and Power Query in Excel (Pearson Education, 2018).
Basic Text Analytics
Words empty as the wind are best left unsaid.
IN THIS CHAPTER, YOU WILL:
- Learn how to search for keywords in textual columns and apply this technique as a simple topic detection method.
- Learn how to load external list of keywords and apply them dynamically in the search.
- Use Cartesian Product or custom function to search for keywords.
- Use Table.Buffer and List.Buffer to improve keyword detection performance.
- Learn several methods to split words, ignore common words, and count word occurrences in your analysis.
- Learn how to prepare your textual data and load it in Word Cloud custom visual in Power BI.
In this chapter, you will harness Power Query to gain fundamental insights into textual feeds. Many tables in your reports may already contain abundant textual columns that are left barren, untapped and devoid of meaning. As a paraphrase to Homer’s saying that empty words are best left unsaid, you may find many words in your reports that are “empty as the wind”. These textual columns are most likely ignored in your report, or are kept in, just for some context. Fortunately, in next two chapters, you will learn very useful methods to take leverage of textual columns and tap into new insights as you unravel these hidden treasure troves.
The techniques in this chapter will be extremely useful in analyzing tables with textual input, such as client feedback in surveys, operational comments, or social network feeds. You will learn how to apply common transformations to extract meaning from words. We will start the journey with basic techniques to detect keywords, and gradually improve our toolset to solve common text-related analytical challenges. Our text analytics journey will continue in the next chapter with advanced text analytics on Azure, including text translation, sentiment analysis and detection of key phrases, with no help from data scientists or software developers.
Searching for Keywords in Textual Columns
Our first text analytics challenge is very common – How to find specific keywords in textual feeds. While this task may seem trivial, it will become more challenging when there are many keywords to search, or when you would like to dynamically load the keywords from an external source, without modifying your queries. In the first exercise you will learn how to detect a small set of predefined keywords.
Exercise 11-1: Basic Detection of Keywords
In this exercise you will search for keywords in messages that are posted by Microsoft Press on their official Facebook page (https://www.facebook.com/microsoftpress). Once you detect the keywords, you will be able to find out which topics are promoted the most by Microsoft Press’ social team.
To analyze Microsoft Press’ Facebook posts, you will need to download the source workbook C11E01.xlsx from here, and save it in C:\Data\C11.
You can open the file and review its main table. It includes the following four columns: Facebook messages (Message column); The date that the post was published (Date column); the post’s unique identifier (Post ID column) and the URL to the actual post on Facebook (URL column).
Imagine you are a data analyst, working on Microsoft Press social media team. Your first assignment is to estimate which of the following book topics are promoted the most on Facebook:
- Microsoft Excel
- Visual Studio
Can you detect these four keywords in the Message column of your dataset?
- Open a blank workbook or Power BI Desktop report and import the workbook C11E01.xslx to Query Editor.
- In Navigator dialog, select Sheet1 and click Edit.
- The Query Editor will open. You will notice that Microsoft Press Facebook posts are stored in Message column.
Due to the fact, that Power Query is case-sensitive in searches, we can simplify our method by working on a duplicated lowercased version of the messages. Thus, you will keep the original messages intact, for cases where you should include the original text in your report and maintain the specific uppercased characters in the messages.
Select Message column. In Add Column tab, click Duplicate Column.
Select the Message – Copy column. In Transform tab, click Format, then select Lowercase.
Rename the column Message – Copy to Lowercased.
In Add Column tab, click Conditional Column. The Add Conditional Column dialog box will open. As shown in the figure below, you will now create a new column that will return one of the four topics, if they are found in the Lowercased column.
- Repeat the following steps:
- Set Topic as New Column Name.
- Set Lowercased as Column Name.
- Set Contains as Operator
- Set microsoft excel as Value
- Set Microsoft Excel as Output.
Note In step 8.4 we used a lowercase version for the topic (e.g. microsoft excel), as the search is case sensitive. In step 8.5 we used the capitalized version, for reporting purposes.
- While you are still on Add Conditional Column dialog box, click Add Rule. A new condition row will be added in the dialog box. Repeat step 8, with the following exceptions: In step 8.4, use the keyword visual studio as Value, and in step 8.5 set Visual Studio as Output.
- Click Add Rule. Repeat step 8 with the following exceptions: In step 8.4, use the keyword azure as Value, and in step 8.5, set Azure as Output.
- Repeat step 8 with the following exceptions: In step 8.4, use the keyword windows as Value, and in step 8.4, set Windows as Output.
- Set Other as Otherwise. Verify that you have configured the dialog box as shown in the figure above and click OK.
The preview pane will include the new Topic column. Obviously, this method is not very effective in detecting all the book topics in Microsoft Press Facebook posts. Many rows are classified as Other. Some of the posts promote books in topics such as Office 365, or SQL Server. Other posts promote general sales discount. For example:
“visit www.microsoftpressstore.com today—and discover deals, new books, and special promotions!”
To keep things simple, let’s assume that this approach met our objectives to detect the main book topics. You can now filter out the rows with Other in Topic column.
- Click on the filter control of Topic column header, then uncheck Other in the filter pane and click OK.
- Finally, remove Lowercased column, rename the query to Microsoft Press Posts and load the report.
Now, when we have accomplished our goal, let’s review the main caveats of the method we used in this exercise:
“New book! Programming for the Internet of Things: Using Windows 10 IoT Core and Azure IoT Suite”.
In this post we missed the Windows keyword, as our logic prioritized the Azure keyword in the Add Conditional Column dialog, and in its derived M formula. In some cases, you would do better if you allocate multiple keywords to the same message, to ensure that the message will be counted for each keyword.
“This SQL Server book is excellent.”
“Read this C# book to excel in software development”.
In both examples, we have missed the real topic of the book. To avoid false detection of topics, you will need to split the messages into words, and match the keywords with the split words. We will address this challenge later in this chapter.
Let’s address the first three caveats and introduce a great method to compare multiple datasets – The Cartesian Product.
Using Cartesian Product to Detect Keywords
In this exercise we will learn an effective method to detect large number of keywords. In the previous exercise we used a conditional column to detect four keywords. Would you consider using the same approach to search for a large set of keywords? Adding a long list of keywords manually, will be time consuming, and quite challenging in case you need to periodically detect different keywords.
To scale up our solution used in exercise 11-1 and detect large number of topics from a dynamic list of keywords, we will apply a Cartesian Product between Microsoft Press posts and the list of keywords. Using the Cartesian Product, a temporary larger table will be created. The table will consist of all the combinations between Facebook posts and keywords.
In the figure above, you can find three messages in the top left table. To the right, you can see the keywords table. The Cartesian Product will create a new table with all the combinations of message-keyword pairs. This technique can help you to explore all the combinations, and then apply further computation on each pair. In our case, the computation will be to search for the keyword in the message, and then filter the matching rows.
If you are not sure yet how to apply a Cartesian Product, or why it is useful in achieving a large-scale keyword search, follow the next exercise to answer these questions.
Exercise 11-2: Implementing Cartesian Product
As the Chief Analyst of Microsoft Press social team (In case you missed the change in your title since exercise 11-1, you have just been promoted to Chief Analyst – Congratulations), your next challenge is to analyze Microsoft Press Facebook posts by topic, as demonstrated in the figure below. To detect the topics, you will apply a Cartesian Product between the Facebook posts and a large set of keywords, as seen in the slicer of this figure:
- In case you skipped Exercise 11-1, download the workbook C11E01.xslx and save it to C:\Data\C11\
- Download the text file Keywords.txt from here and save it in your local folder C:\Data\C11\
- Open a blank workbook or Power BI Desktop report and import the workbook C11E01.xslx from C:\Data\C11\ to Query Editor.
- In Navigator dialog, select Sheet1 and click Edit.
- Rename the query to Microsoft Press Posts.
- You will now create a reference to Microsoft Press Posts query to build a new table of posts and their topic, where the same Post ID can have multiple rows.
- In Queries pane, right click on Microsoft Press Posts query and click Reference.
- Rename the new query to Post Topics
- While Post Topics is selected in Queries pane, click on Select Columns in Home tab.
- In Choose Columns dialog box, select Post ID and Message and click OK.
- Select Message column. In Add Column tab, click Duplicate Column.
- Select the column Message – Copy. In Transform tab, click Format, then select Lowercase.
- Rename the column Message – Copy to Lowercased.
- It’s time to load our keywords as a new query. While the Query Editor is still open, click New Source in Home tab, and import Keywords.txt using File, Text/CSV.
- The Keywords.txt dialog box will open with a preview of the file. Click OK.
- The Keywords query will be loaded. Rename its column to Keyword.
- To lowercase all the keywords, select the Keyword column, then in Transform tab, click Format, and select Lowercase.
The Cartesian Product
- In Queries pane, select Post Topics query.
- In Add Column tab, click Custom Column. The Custom Column dialog box will open.
- Set Cartesian as New Column Name.
- Set the following formula in Custom Column Formula and click OK.
- The Cartesian column will be added, with the Keywords table in each row. At this stage each Keywords table is represented as [Table] object in the Query Editor. It’s time to expand it.
- Click on the expand icon in the Cartesian header name (As an alternative, you can select the Cartesian column, and click Expand in Transform tab). In the expand pane, keep Keyword checked, and uncheck Use Original Column Name As Prefix. Then, click OK.
You can now see in the preview pane that the Cartesian Product was successfully implemented. It is time to learn why the Cartesian Product is so helpful. With all the combinations between messages and keywords, you can apply a filter to keep only the matching rows. We can use the Add Conditional Column transformation step, that will be followed by a filter on the new conditional column.
- In Add Column, click Conditional Column. The Add Conditional Column dialog box will open. Set Topic as New Column Name.
- Set Lowercased as Column Name. Set Contains as Operator. In the drop down below Value click on Select a Column.
- Set Keyword as Value. In the drop down below Output click on Select a Column. Set Keyword as Output.
- Set null as Otherwise and click OK.
Following the preceding step, the Topic column was added. It contains many null values for all the combinations of posts that didn’t contain the given keywords. It’s time to filter out all the rows with null values.
- Click on the filter icon of Topic column and select Remove Empty in the filter pane.
- Remove the columns Message, Lowercased and Keyword.
Let’s review what you have achieved so far. Using the Cartesian Product, you were able to map posts to specific topics from a dynamic list. You have two tables which will be needed in your report. The first table is Microsoft Press Posts, with the columns Post ID, Message, and other columns. The second table is Post Topics with the columns Post ID and Topic.
You can now create the relationship between Microsoft Press Posts and Post Topics
(The instructions were removed from this blog post to fit the scope of a blog post).
You are ready to start the analysis. Download the solution workbook C11E02 – Solution.xlsx and see how these connected tables in Data Model can be used in a PivotChart and a Slicer to find out how many Facebook posts are published by topic over time.
Before we do a deep dive on performance considerations, let’s see what we were able to achieve so far. Going over the caveats that were described at the end of exercise 11-1, the Cartesian Product method addressed the following caveats:
Let’s consider for a moment the performance of our solution. When you refresh the report, you will notice that the query loads relatively fast. The Cartesian Product, combined with the filtering of matching rows is implemented in an efficient method that doesn’t require high memory consumption. When we created the Cartesian Product, and expanded all the combinations between Facebook posts and keywords, the expanded table was not entirely consumed in memory. Instead, the M engine performed the transformations on small portions of two tables.
This is an important and powerful capability of Power Query. Whenever possible, multiple transformation steps are applied on a small portion of the data and iterate over the dataset, till the entire dataset is transformed. This approach ensures that the M engine will not load the entire dataset in memory, unless it is necessary required. As a result, the Cartesian Product doesn’t leave significant memory blueprint.
Nevertheless, we can still significantly improve the performance of the Cartesian Product.
- Open your workbook or Power BI report following exercise 11-2 steps 1-24.
- In Query Editor, select the Post Topics query, and click on Added Custom in Applied Steps. You will find the following formula in the formula bar:
= Table.AddColumn(#"Renamed Columns", "Cartesian", each Keywords),
In this step, the M engine accesses the keywords for each row. Surprisingly disappointing, the implementation of this step along with the expand step that follows lead to redundant loads of the keywords from the external text file.
To fix this step and enforce the M engine to load the Keywords from the external text file only once, you can use the M function Table.Buffer on Keywords. By doing so, you explicitly instruct the M engine to store the entire Keywords table in memory and ensure that the M engine will not access the text file multiple times.
- To add Table.Buffer, click the Advanced Editor, and add this row after the let line.
KeywordsBuffer = Table.Buffer(Keywords),
- Next, find the following line:
#"Added Custom" = Table.AddColumn(#"Renamed Columns", "Cartesian", each Keywords),
- Change the line above to the following one by replacing Keywords with KeywordsBuffer:
#"Added Custom" = Table.AddColumn(#"Renamed Columns", "Cartesian", each KeywordsBuffer),
When you load the query now, you may not notice the improvement in load time, as the dataset is relatively small. To compare the load time of Post Topics query before and after Table.Buffer, you can download the workbook C11E02 – Refresh Comparison.xlsx from here.
In C11E02 – Refresh Comparison.xlsx workbook, you can measure how fast the query is loading with Table.Buffer. As demonstrated in the screenshot below, you can compare the load time of Post Topics query (without Table.Buffer) and Post Topics – Faster query (with Table.Buffer) and witness the improvement in load time of the second query.
To perform the comparison, open the workbook, click in Queries & Connections in Data tab and then click Refresh All. You can notice that Post Topics – Faster loads faster than Post Topics.
In C11E02 – Refresh Comparison.xlsx workbook, we have multiplied Microsoft Press Posts 100 times, using the following function as the last step of the query:
= Table.Repeat(#"Changed Type", 100)
In the next exercise you will learn an alternative method to detect keywords. Using a custom function.
That’s it for today. Hope you liked the preview of my book. In the next book excerpt, I will share cool techniques to perform split words, and use List.Accumulate function to detect multi-word phrases in textual feeds.
This blog post was excerpted from the draft manuscript of Collect, Transform and Combine Data using Power BI and Power Query in Excel (Pearson Education, 2018). Pre-order the book on Amazon here.