Better late than never. Prior to 2017 Power BI Data Insights Summit, Microsoft published here three challenges, including a complex data preparation. So, I had to try it.
At the time, I was too busy to submit my solution, but I planned to share it at DataChant. You can also learn how the challenge was solve by the winner of the challenge, Rishi Sapra, here.
Download my solution pbix file here.
Here is the challenge in a nutshell:
Hawaii Tourism Authority provides detailed data for visitors. Unfortunately, this data is buried in somewhat cumbersome table layouts. For this challenge, prep the data to visualize trends in Hawaii tourism:
- Where are people visiting from?
- Has there been any interesting shifts in what regions visit Hawaii in general or specific islands in particular?
- What months are the busiest by Island?
The source data is an Excel workbook (available here), which contains a stack of summarized tables per year on a single worksheet. While the initial obstacle is to find out how to append all the tables together, and unpivot the tables, you will soon, discover that many values in this workbook need a proper normalization, as each island and some countries are represented with different names.
To solve the mismatching island names, I created a conversion table for the islands, which contains the pairs of source islands and their normalized name. By transposing the table conversion, as shown in the diagram below, we can quickly convert any island value in our fact table to its proper normalized value.
And here is the function that normalize the island values, based on the conversion table #”Island Conversion Table” (the right table in the diagram above).
(txt) => try Table.Column(#"Island Conversion Table", Text.From(txt)){0} otherwise txt
To create a good star schema in the Data Model, I used Reference + Remove Duplicates in Query Editor on the fact tables to create the tables Countries and Flight Types. As an alternative approach, I also used the Enter Data feature, to create the Islands lookup table.
And here is the Relationships view of the star schema. The two fact tables are located at the bottom. Both tables are connected in many to one relationship to Calendar, Islands, Flight Types. The Countries table is connected to Visitors by Origin.
I intended to write a more detailed post with the steps I used to resolve this challenge, but I am sure you will figure it out by yourself, after you give a good look at the pbix file. Download it here.
Have questions, or better ideas how to prepare this data for analysis? Please share it in the comments below.
Nice Gil, thanks for the mention and I like the approach of doing it by year (it doesn’t require the helper file that my approach does). For the normalisation of the island/country names though, I think it’s better to use a mapping table which you then merge onto an appropraite column, returning only the island/country index number – there’s some strange ones in there (E.g. Switzerland called SWITZ-LAND or something in one of the sheets…so you can have all the various combinations as different columns!). Have a look at the file at https://1drv.ms/u/s!Ap_WTTa_A1G_iNBJwW9G-F26-Jb28g and let me know what you think!
Hi Rishi,
Thank you for sharing your work here. I agree that a mapping table and a Merge are easier, but theoretically speaking, a Merge will require higher computation time O(N^2), because each key value in the left table is matched against the keys in all the rows of the second table, while the function that I used is only O(N). With regards to the country names, I just used Replace value to show the simplest solution.
Thanks for posting this. Hopefully this exercise will be covered in your book upcoming book, where you take novice users through the entire process.
Thank you, Tom. I plan to create a different challenge in my book, and go over the steps with the readers. Perhaps one day, I will also share the steps here.