/ web

Status of Web Analytics: It's Complicated

My data visualization website, populationpyramid.net has a fair number of users every day. Now, I would like to know how much exactly, and this turns out to be a fairly hard question to answer.

My go to source of information about this is Google Analytics. I've added the Analytics tracking code more than 7 years ago. Here are my traffic numbers over these years:

Screen-Shot-2018-06-17-at-23.03.14

So, the site had more than 8 million users and more than 11 million sessions so far. According to Google, a new session happens either when a new user arrives on the site or when a user that has been inactive for more than 30 minutes comes back to the site. So, if I trust the Google estimate of 4 minutes and 22 seconds spent per session, I can compute that more than 95 years of human time has been spent on the site. This seems magical to me, but that should be the subject of another post.

Now, this gives a first impression of the success of the site. But is it accurate? The answer is complicated.

To explain that clearly, let me focus on a given day: the 12th of June of 2018. For that day, Google Analytics are reporting 11,530 users and 12,759 sessions.

Screen-Shot-2018-06-17-at-23.06.55

There are at least two reasons for me to believe that these numbers are not very accurate.

First, according to Analytics, 70% of the visitors are coming from "Organic search", which means search engines, which mainly means ... Google.

Screen-Shot-2018-05-26-at-10.33.35

But Google is also providing a tool for tracking how many people are reaching your site through their search engine. In their webmaster tools, you find a "Search Analytics" section. For the 12th of June, the number of clicks on Google search results leading to populationpyramid.net is 10,320.

Screen-Shot-2018-06-17-at-23.09.19

Google is not the source of all search engine traffic, but it is more than 96% for me, and I would guess that it is true for most of the English speaking internet. So, if 10,320 people reached the site through Google and that is less than 70% of the traffic (according to Analytics), I should see more than 14,742 visitors in my analytics, while I actually see 11,530 (you could discuss here if I should compare with the number of sessions or the number of visitors, but in both cases, they do not match closely at all with what the Search Analytics gives). To thicken the mystery, Google Analytics is only reporting 8,246 visitors coming from Google search on that day. That is 20% less than what Google Search Analytics is reporting.

Screen-Shot-2018-06-17-at-23.13.10

Second, I have set up Cloudflare, a proxy service in front of populationpyramid.net and they provide their own analytics service. For the 12th of June, Cloudflare reports 22 370 users. That is almost 10 thousands more users than reported by Google Analytics!

Screen-Shot-2018-06-17-at-23.15.50

Notice here also that the hour at which the counting for a given day is started is not too clear. What does that 2:00:00 means? Your guess is as good as mine.

Here, the definition of visitors is different than on Google Analytics. Cloudflare says in their help : "This graph shows you the number of unique users that have visited your website. Cloudflare calculates this number of users based on the unique IP addresses requesting content from your site.".

But you have to know that many users can share the same IP adress from an external point of view at any point. Usually, all users within an enterprise or school LAN are visible as the same IP adress from Cloudflare perspective. This should in my mind gives smaller numbers than what Google Analytics is reporting, since a good chunk of my traffic is coming from schools accessing the site in classes, where most probably all student share the same IP, at least seen from the outside. But the opposite happens: the numbers of cloudflares are actually higher.

Now, how to explain the discrepancies?

I have a lot of ideas about this, but no definitive answer:

First, Google Analytics is using a little script, written in JavaScript, that is ran in the users browsers. Some users have Javascript disabled, but I would say that this is a very, very minor part of internet users. Populationpyramid.net is unusable without Javascript for that matter. No, the most plausible cause of Google Analytics not reporting users is the usage of Adblockers.

Adblockers have been designed to ... drum roll ... block ads, which are indeed often very annoying, consuming more resources (cpu, memory and network) than the content you were initially loading the page for. Now, one of the first settings that you see if you go to the settings of the most well known, Adblock Plus, is a checkbox to also block "additional tracking" which means Google Analytics, amongst others possible trackers.

Screen-Shot-2018-06-17-at-23.29.13

Once the checkbox is checked, the loading of the Google Analytics script is blocked as you can see in the following screenshot.

Screen-Shot-2018-06-17-at-23.31.34

I guess that other adblockers are having the same kind of options. Actually, for the anecdote, these tools will often block any URL that contains the word "ads" or "analytics", which can lead to strange bugs, as we discovered in a company where I worked: the admin interface was called "adsomething", where ad was actually standing for "admin" and it was impossible to load on some computers. It took us some time to figure that the problem was the ad blocker of some users.

Ok, let us go back to our main question: why are the numbers of users reported by the various analytics tools so different? My assumption is that ad blockers are the main reason. If they were, they should account for the 20% less users between what Google Search Analytics and Google Analytics are reporting. But according to what you can find through one quick Google search, the number of internet users having installed an adblocker varies between 10 and 20%. And you have to remember that normally, way less than that should have disabled all trackings (unless the option is ON by default in lots of case, which I do not believe).

So, it appears that my assumption would be false. To be sure, I did a little experiment: everytime a page load, I send, using Javascript, a little "page_load" event to my backend. On top of that, if the Google Ads script fails to load, I send, through the "onerror" callback of that script an event "ad_blocker_detected" to my backend. This is probably naive, since the errors might be caused by something else than the adblocker, and I do not even know how well that onerror callback works in different browsers, but here are the numbers I got out of that experiment:

I had 40,750 "page_load" events for the 12th of May and 2,798 "ad_blocker_detected". That's 6% of the users who would have an adblocker, only a small percentage of which should have disabled all tracking. We are far from the 20% expected.

Notice also that Google Analytics reports 19,105 "pageviews" for that day, while I had 40,750 "page_load". Another mystery. For this one, I might have an explanation: crawler bots. For example, for the 12th of May, according to my server logs, 7029 requests were made by Googlebot, the one creating the index used by Google to provide you those useful search results. It turns out that for a few years, Google Bot has been executing Javascript on your page, and evaluates the result to include in their results. The problem is that it means that every page load done by Google bot will also be counted in my little experiment, while Google Analytics is, hopefully, smart enough not to count visits of Google Bot. Except that it is, obviously, a little bit more complicated than that. It is thought that depending on the page, Google might be not loading your Javascript. This is part of their secret sauce, and they have reasons not to be too transparent on this kind of question, since it might help people to game Google algorithms, which can be very lucrative if you manage to do it. Indeed, being the first Google result for a well searched term can bring you a lot of traffic.

The crawler bots might explain the difference in accounting between Cloudflare and Google. Google might not count these as visitors, while Cloudflare does. Assuming that the bots are using a wide range of IPs, this could make sense when you take the definition of visitors based on IP addresses of Cloudflare. On the other hand, Google Analytics documentation basically says that if a crawler loads scripts on your page, it will appear like a normal user in the stats. At that point my only conclusion is 🤷.

But after thinking of all this, something else struck me. It might be obvious to some, but if I have 20% less visitors on the site than visitors who clicked on the results in Google Analytics, it could be because people are giving up loading the site between their click on Google results, and the loading of my page. According to another Google Tool PageSpeed Insights, page speed and optimization of the site seem good, with 1.6s to "First Contentful Paint", but I do have a lot of traffic out of South America and Asia, and it might be that for them, the site is too slow. Not to mention that 1.6 seconds, for all of us suffering more and more of some kind of attention deficit disorder, does indeed sometimes look like an eternity during which we may change our mind about how we are going to spend the next 10 seconds of our lives. But how can I know that this happens? Yet again, 🤷.

Conclusion

Frankly, I was tempted to just write in big bold letters:

I HAVE NO IDEA FOR A CONCLUSION, THIS IS WAY TO COMPLICATED.

or just:

🤷

But let us try to be a bit more constructive: what conclusion can we draw from all this? I think that having exact visitor numbers is an exercise in frustration. There are too many definitions of what a session or a visitor can be and no way to be sure that you are not including traffic from various bots in your numbers.

So, in the end, I would tend to say that you should worry more about two points than the exact number of visitors:

  • The first important point, rather than the exact numbers, would be the trend in these numbers, whatever the source is (for me, Google Analytics will do), even if you do have to be aware of important gotchas. For example, you might see decreases in your visitors count that are due not to a real loss in visitors, but to the fact that a lot of them installed Adblockers or increases due to the sudden activity of bots (Baidu bot brought my site to its knees once, for example, with tens of requests per seconds). Now, how do you estimate the impact of such factors precisely? Again, I HAVE NO IDEA. This is really frustrating, and I am quite sure that whole teams must be devolved to such endeavours at big publications, due to the high complexity of the question.
  • The second important point would be the actual revenue (if your site brings revenue) that you get at the end of the day. Usually, visitor counts are only a proxy to this. It is also important to notice also that increasing the number of visitors will not at all time bring your more revenue. I, for one, have increased my traffic a lot on populationpyramid.net over the last year, but most of it came from countries where Google Ads are not bringing a lot of revenue. In my case, basically 90% of the revenue will come from English speaking countries, other countries/languages are simply not bringing the same money per click.

I will probably delve a little more in another post on that last question: the revenue that an ad funded site can bring per user, but let me already tell you that the current trend seems rather bleak, with numbers falling a little every year and that you have to get a massive number of visits to make a good living, or even a living that could justify to spend a lot of time on a site like mine when you have a family to cater for.

This will probably lead me to search another revenue source for populationpyramid.net in the close future, but that is okay, I never liked ads anyway, especially for a site that I would like to be used mainly in education, as a free resource.

P.S.: the numbers you see on a site like Similar Web are very far from accurate, but they seem to be able to stay in an order of magnitude of the truth, and to get the general trends right. 🤷

Status of Web Analytics: It's Complicated
Share this