Weak visibility measurements require robust procedures

In my previous post I showed that visibility measures taken from News Search Engines (NSE) appeared to be very reliable, with a test-retest Mean Absolute Percent Accuracy in excess of 99% and a scale Cronbach’s Alpha in excess of 0.992. I also warned that these results had to be tested on a larger sample of concepts.

In this post I’ll focus on the robustness issue. I’ll follow-up with a detailed summary of my findings on the reliability of NSEs, in my next post.

First, recall that a reliable measure is a measure that is consistent. We can appreciate reliability by taking repeated measures of the same metrics. The consistency of test-retest is certainly desirable but clearly insufficient as a defective instrument returning constant results would be consistent yet unreliable. We must therefore take repeated measures of a metric on different subjects. A reliable instrument will be able to consistently discriminate between subjects (i.e. if you weight people and rank them from the heaviest to the lightest, a perfectly reliable scale would consistently return identical ranks). A better way is to take repeated measures using different instruments. That is the idea behind "convergent reliability".

These ideas come largely from the field of psychometrics, where the distribution of errors could reasonably be assumed to be somewhat normal. But this is not the case with web metrics, where errors can be extreme.

So let me introduce the idea of robustness and contrast it with the idea of reliability. (see a wikipedia article here). A robust instrument is an instrument that always returns reasonable values, with normal errors. A weak instrument is an instrument that occasionally breaks down and may return extreme values. Think of a bathroom scale — a robust scale always gives your weight, give or take a couple of kilos; a non-robust scale will sometimes return erratic values like zero kilo or 900 kilos. Non-robust could be conceived as "delicate", where a deliberate trade-off has been made such that usually the result is very accurate but occasionally something goes wrong, or merely as "weak" where freak events occur in addition to normal errors.

Consider the figure below, where 100 consecutive measurements are reported. Both instruments are identically reliable (99.5% accurate). The difference is that errors are random in one case, and extreme in the other. Now, let me ask: which scale would you rather use, the robust scale or the delicate scale?


My answer is that it depends — if you use a robust estimation technique, you should be able to detect extreme values and you will prefer series 2, produced by a delicate instrument which is usually right on. If, on the other hand, you cannot tell if a value is extreme or not, then you may prefer series 1, produced by a robust instrument which is never far off the mark.

Extreme values occur quite frequently on the web. In my next post I will report on a set of 9 queries about 1 000 concepts. Out of 9 000 queries, 39 were obvious outliers (unable to get a result after 6 tries). But this is the very tip of a quite large iceberg as there are hundreds of suspect values, including some really really extreme outliers.

This is both not really surprising nor trivial. Not surprising as extreme errors may have several causes, either at the source (a NSE returns an erroneous value) or during the treatment (a parsing error). Not trivial as the magnitude of these extreme values dwarfs the true correlation. Consider the scattergram below, where counts about 1 000 queries to Google News and Yahoo News are displayed.


For some reason, on October the 7th, Yahoo! reported a count of 585 000 news items for the word "Loveridge" compared to just one instance in Google News. This is not a parsing error as AllTheWeb, owned by Yahoo! returned 750 000 instances, with other NSEs returning very low counts.

At the other end of the spectrum, Google News reported a count of 72 000 news items for "Lies" compared to 1 071 at Yahoo! News. Other NLEs also reported counts in the neighborhood of 1 000.

Keeping these two extremes in the dataset will yield a very weak correlation of .02 between Google and Yahoo! News. Removing just 2 observations will bring the correlation close to .75!!

The figure below shows the same points as above, after removal of these 2 extreme outliers.


So what? First, take more than one measure. Three at least because if you are confronted with a "Loveridge case" where one NSE says "very many" and another says "very few" you’ll need a third if not a fourth data point if you want to be able to tell which is which. Second, eyeballing extreme values will not do it. There are just too many data points, many of which are not obvious calls. Since NSEs are not robust instruments, a robust estimation is a must.