Did you know that when Bill Gates enters a bar, the average customer becomes a billionaire? That is what I call the Bill Gates Effect. (Many thanks to Introductory Statistics)
Pss! There is a Power BI dashboard at the end of the blog post, don’t miss it!
The Bill Gates Effect Explained
A business owner is trying to understand the average income of the clients in the bar next door so he hires a student to interview the customers.
When the student had finished the interviews and the analysis, the person with the highest income leaves the bar and Bill Gates comes in. Frustrated, he re-runs the analysis this time with Bill Gates data instead.
The student comes back to the business owner and reports that the average customer in the next door bar is a billionaire!!
Average income= $ 7 900 042 500
The business owner ask the student to show the calculations and after reviewing the data understands the ”mistake”.
Income average Before Bill enters the bar:
Income average After Bill enters the bar:
Most people would agree that $47,500 (the median) is a better reflection of the income amounts for the people sitting in the bar and that the amount of $8 billion has nothing to do with the reality of the 9 people in the bar who are not Bill Gates.
Side Note: I was refreshing my statistical knowledge for my Lean Six Sigma certification when I found that example, and since then I have never had to search for those terms again: average, mean and median. This clearly demonstrated for me how powerful Story Telling can be.
But, how does the “Bill Gates Effect” relates to Google Analytics?
You might already see where this is going, but let me give you an example using Site speed metrics given by Google Analytics.
The Calc. Load time (sec) is a manual calculation of Avg. Page Load Time according to Google Analytics formula in Power BI:
Sum(Page Load Time (ms))/1000/sum(Page Load Sample),
and of course, it yields the same results as if you would take Avg. Page Load Time (sec) directly from Google Analytics.
Let’s calculate the “averages” and Median for both days:
Let me explain the metrics:
1. Avg. Page Load Time (sec) is the metric you will get from Google Analytics
2. Avg Load time is the average the column Avg. Page Load Time (sec)
3. Median is the Median of the Avg. Page Load Time (sec)
Here is an example with Google Analytics Average Page Time and our calculated Power BI Median segmented by country:
There is a huge difference between the metrics when you have outliers in your data, right?
What metric should you use?
Now that you understand the problems using average, you will be horrified by all the averages Google Analytics provides without any consideration to outliers.
Between the average and the mean, here is the Rule of thumb of which metric to use:
• For data sets with outliers (remember, Bill Gates in the club), it is better to use Median
• For symmetric data sets, use the mean.
It is a bit more complex than that, refer to Introductory Statistics if you want to dig deeper, but this is good enough for our analysis.
Do I have outliers?
I can almost guarantee you that you do, but, let’s found out.
Using Power BI (Power Query), we categorize the Page Load Time in bins:
- a) 0-1
- b) 1-3
- c) 3-7
- d) 7-13
- e) 13-21
- f) 21-25
- g) 25-35
- h) 55-60
- i) 60-80
- j) +180
Add a column in the Query Editor with the following formula:
Bins = if [#”Avg. Page Load Time (sec)”]<=1 then “a) 0-1″ else if [#”Avg. Page Load Time (sec)”]<=3 then “b) 1-3″ else if [#”Avg. Page Load Time (sec)”]<=7 then “c) 3-7″ else if [#”Avg. Page Load Time (sec)”]<=13 then “d) 7-13″ else if [#”Avg. Page Load Time (sec)”]<=21 then “e) 13-21″ else if [#”Avg. Page Load Time (sec)”]<=25 then “f) 21-25″ else if [#”Avg. Page Load Time (sec)”]<=35 then “g) 25-35″ else if
[#”Avg. Page Load Time (sec)”]<=60 then “h) 55-60″ else if [#”Avg. Page Load Time (sec)”]<=180 then “i) 60-180” else “j) +180”
And visualize them in a bar chart:
Or you can use one of the latest custom visualizations, the box plot: (the outliers are the red dots)
Can I kick Bill out from my club?
You cannot remove your outliers, but to understand why, we need to understand how site speed is measured. Here is a timeline for how Google Analytics measures site speed and page load:
Simplifying things a lot, you have to take into account:
• People’s internet speed, that varies by location
• your webserver and domain name service,
• how well your website is built,
• and which browser is being used.
Looking back at our example above, 28 seconds might not be an outlier, but a normal performance at some locations and or performance with some browsers.
What should you do then?
Segmentation, segmentation, segmentation
What I normally do is to segment my data by those factors and look for patterns there.
Here is a great example, using our bar chart and filtering them by continent, you can see that the majority of the data in Africa falls in the 7-13 secs bin, while in America falls into the 3-7 secs category.
Makes sense, right? America has probably better internet conenctions than Africa.
So, an outlier in America might not be an outlier in Africa.
Does this mean that I don’t have outliers? Looking at the tail in the bar chart above, we still have “true” outliers, and those, you can remove.
Let’s dig deeper to the Top Offenders to uncover our true outliers:
One person from Russia, had a Page Load Time of 551 Secs!! And the median for Europe is 5,93, so that is definitely an outlier.
It would be great if we could eliminate those outliers easily in Power BI, and of course, somebody thought about it already, so Vote it up, but if you are in a hurry, DataSavvy solved this and posted the solution in his blog, thanks!.
In the meantime, Mean instead of averages will give you a better picture of reality. Remember our chart Page Loads by Location?
When we talk about Page Loads, seconds are as important as milliseconds for a 100 meters runner, so it is important to get as close to the truth as possible.
Are you still with me? Great, because here is where the magic happens.
This weeks dashboard will help you monitor your website speed.
I am not going to go into the ongoing discussion about how site speed, is or is not, important in Google search ranking, there are tons of sources for and against it and none I have found that are conclusive.
What I worry about is my visitors and not Google. If they are happy, and there are ways to measure that, Google will be happy too. I obsess about right speed at the visitor’s location, and the time it takes before my visitors can interact with the content, instead of how fast the entire page loads.
Google Analytics has a metric for that: Document Interactive Time and you should monitor that one.
What is the right speed at the right location?
Use your data to find out: segment by important locations for your business and monitor them closely. If you are having issues there, then go to browser, device, connection times, etc, to improve those interaction load times.
What insights can we get with this dashboard?
As usual, we start with the insights from the dashboard:
• In this dashboard you will be able to monitor trends on your site speed performance
• You will be able to see if important locations are performing well
• If your site loads as fast on all device types (mobile, tablet, desktop)
• And if your most used browsers are performing well.
This is a monitoring dashboard, but with Power BI’s cross filtering capabilities, you can gain insights as to where you have performance issues pretty quickly.
If you would like to see a live version of the dashboard with an explanation on how to use it? You can watch it in our YouTube Channel.
Download the Power BI file
Don’t want to create the Power BI file? If you are a member on our site (it’s free), you can download it below.
You will need also to change the data source from my account to your Google account.
Do you need help creating the file or setting up your account? Contact me for more details.
Creating the dashboard: Power BI Community
Do you want to create the dashboard yourself?
A step-by-step guide is available at the Power BI community.