The accuracy of web-based visibility metrics

Just a quick post to introduce a research note I am just about to file. It presents web-based visibility metrics harvested during the US presidential race (vote held on November 4th 2008) and the Canadian federal election (vote held on October 14th 2008).

The most striking finding is the almost super-natural precision of visibility metrics culled from the news search engines. Visibility shares on the eve of the vote were off by 0.2% and 0.3% in the Canadian and US election respectively.

Table 1: Summary of recent results

CM Capture 1

Below are radar charts of various metrics computed between 03:00 and 05:00 on the day of the election, where day-to-day indicators (essentially news and blogs) refer to one day earlier (i.e. on the morning of November 4th, news search engines were queried on content published/indexed on November 3rd).

Further below are time series showing how visibility has evolved during the races. It should be noted that indicators other than news are mere averages of a collection of ratios computed for each relevant search engine (see the list at the bottom of this post). For the news index, the metric is the arithmetic average of the returns reported by Factiva, Google News and Yahoo! News for the Canadian race. In addition, NorthernLight and Bloglines were used to compute an average score for the US race. These rules were established based on a reliability assessment made on independent data (no fishing here).

I plan to recalibrate each index before the end of the year, and publish revised timeseries.

As a final note, the last chart traces the news visibility index published by Factiva in 2004. This is perhaps what is most important — just a few years ago, media where both off and slow to react. My inclination is to think that blogging has changed the nature of news reporting, although we could also argue that indexing and searching have improved.

Fulltext is here.

Figure 1: Visibility shares, Canada, October 13th

CM Capture 2

Figure 2: Visibility shares, US, November 3rd

CM Capture 6

Figure 3: Visibility shares, Stephane Dion

CM Capture 7

Figure 4: Visibility shares, Barack Obama

CM Capture 8

Figure 5: Visibility and delegates shares, Howard Dean & John Kerry, 2004

CM Capture 9

Table 2: List of search engines

CM Capture 10

Don’t we ever learn anything?

I've been quite busy traveling, crunching numbers and writing. I am late in finishing a post on what eats my days, of late, visibility indicators. Probably another 2 weeks before I can find the time. So let me apologize for venting some frustration. On oil prices. And on the procession of grave looking analysts sharing their projections on how low the barrel will go.

Am I the only one cringing whenever a commentator shakes his head in disbelief over the falling price of oil? Has anyone noticed that prices are falling to the level at which they were 2-3 years ago? Has anyone wondered how emerging economies could sustain the recent prices at which oil was traded?

I am not saying that I understand or can forecast better than anybody else. But I am profoundly irritated by those who do (say that they know).

So, for my education, I searched for historical data. Thanks to the Energy information administration, here it is, all the way back to 1859. This is not a typo. More than a century of prices. And here is how it looks like:


Series like that are never easy to understand, because exponential growth obliterates older data points. Better to take the logs. In case you have no idea what a logarithm is, let's just say that it removes the main curvature in a series. Easier if you look at the chart below (starting in 1900):


Same data as the first chart, except that this time, this is expressed as the power of 10 (well, the log base ten actually). So when you read 1, it means a barrel at $10. When you read zero, it means a price of $1. When you read 2, a price of $100. Well, you probably get the idea.

Magically, the chart starts to be intelligible. It is more or less a straight line, moving around in irregular cycles.


Well. For one, when the price hit 147, (2.16 on a log(10)), that was newsworthy. A value of 2.17 is off the chart, especially if one notices that 2007 had not jumped much higher than 2006.

Two, it "looks as if" when prices fall significantly, they take a long while before climbing back to the previous level.

But then, obviously, all depends on your understanding of the data. I read tonight than some major broker had forecast a rebound to $100 (2 on the log scale) within a year. That could make sense if you use only recent history (i.e. starting in year 1998 onwards) to extrapolate. You would say that 2007 was a slow year, 2008 a spike, and  that 2009 will return on the "trend line".

On the other hand, there were not that many years of free-falling oil prices during the past century. I see three bad years following the peak of 1920, and 1997. I also see two major peaks (1920 and 1982) after which the world economy has adapted.

So my personal inclination would be to bet on a rather long contraction. With repercussions everywhere.

But what worries me the most is to read predictions such as the one I mentioned above, out of the blue, and with this amazing quote : « We're in a global recession now, and you've got to be
close to the bottom[…]" reported by Bloomberg, as carried by CNBC, as made by Boone Pickens…

Last time I looked, the world economy was still growing. We might certainly be entering a phase of contraction of the global economy, but we are not there yet. Not at all.

For reference, the latest IMF forecast was a global growth of 2.2% in 2009. And it was made on November the 6th.

Weak visibility measurements require robust procedures

In my previous post I showed that visibility measures taken from News Search Engines (NSE) appeared to be very reliable, with a test-retest Mean Absolute Percent Accuracy in excess of 99% and a scale Cronbach’s Alpha in excess of 0.992. I also warned that these results had to be tested on a larger sample of concepts.

In this post I’ll focus on the robustness issue. I’ll follow-up with a detailed summary of my findings on the reliability of NSEs, in my next post.

First, recall that a reliable measure is a measure that is consistent. We can appreciate reliability by taking repeated measures of the same metrics. The consistency of test-retest is certainly desirable but clearly insufficient as a defective instrument returning constant results would be consistent yet unreliable. We must therefore take repeated measures of a metric on different subjects. A reliable instrument will be able to consistently discriminate between subjects (i.e. if you weight people and rank them from the heaviest to the lightest, a perfectly reliable scale would consistently return identical ranks). A better way is to take repeated measures using different instruments. That is the idea behind "convergent reliability".

These ideas come largely from the field of psychometrics, where the distribution of errors could reasonably be assumed to be somewhat normal. But this is not the case with web metrics, where errors can be extreme.

So let me introduce the idea of robustness and contrast it with the idea of reliability. (see a wikipedia article here). A robust instrument is an instrument that always returns reasonable values, with normal errors. A weak instrument is an instrument that occasionally breaks down and may return extreme values. Think of a bathroom scale — a robust scale always gives your weight, give or take a couple of kilos; a non-robust scale will sometimes return erratic values like zero kilo or 900 kilos. Non-robust could be conceived as "delicate", where a deliberate trade-off has been made such that usually the result is very accurate but occasionally something goes wrong, or merely as "weak" where freak events occur in addition to normal errors.

Consider the figure below, where 100 consecutive measurements are reported. Both instruments are identically reliable (99.5% accurate). The difference is that errors are random in one case, and extreme in the other. Now, let me ask: which scale would you rather use, the robust scale or the delicate scale?


My answer is that it depends — if you use a robust estimation technique, you should be able to detect extreme values and you will prefer series 2, produced by a delicate instrument which is usually right on. If, on the other hand, you cannot tell if a value is extreme or not, then you may prefer series 1, produced by a robust instrument which is never far off the mark.

Extreme values occur quite frequently on the web. In my next post I will report on a set of 9 queries about 1 000 concepts. Out of 9 000 queries, 39 were obvious outliers (unable to get a result after 6 tries). But this is the very tip of a quite large iceberg as there are hundreds of suspect values, including some really really extreme outliers.

This is both not really surprising nor trivial. Not surprising as extreme errors may have several causes, either at the source (a NSE returns an erroneous value) or during the treatment (a parsing error). Not trivial as the magnitude of these extreme values dwarfs the true correlation. Consider the scattergram below, where counts about 1 000 queries to Google News and Yahoo News are displayed.


For some reason, on October the 7th, Yahoo! reported a count of 585 000 news items for the word "Loveridge" compared to just one instance in Google News. This is not a parsing error as AllTheWeb, owned by Yahoo! returned 750 000 instances, with other NSEs returning very low counts.

At the other end of the spectrum, Google News reported a count of 72 000 news items for "Lies" compared to 1 071 at Yahoo! News. Other NLEs also reported counts in the neighborhood of 1 000.

Keeping these two extremes in the dataset will yield a very weak correlation of .02 between Google and Yahoo! News. Removing just 2 observations will bring the correlation close to .75!!

The figure below shows the same points as above, after removal of these 2 extreme outliers.


So what? First, take more than one measure. Three at least because if you are confronted with a "Loveridge case" where one NSE says "very many" and another says "very few" you’ll need a third if not a fourth data point if you want to be able to tell which is which. Second, eyeballing extreme values will not do it. There are just too many data points, many of which are not obvious calls. Since NSEs are not robust instruments, a robust estimation is a must.

Visibility metrics in the news media – a few observations

Before I present and briefly discuss the data, let me clarify a few notions.

By metric, I mean "some abstraction that can be
measured." What you weight is a simple metric. How happy you are is
not. In this post we deal with a simple metric — how visible is a
concept (say… football) in the news media.

By measure I mean applying an instrument to a metric. Up on a
scale and we can make a reading of how much you weight. In this post,
the instruments are news search engines. 

By index I mean an aggregate of measures. The main purpose of
an index is to yield a more reliable figure. In this post I report on
the following news engines : {Ask, AllTheWeb, Bloglines, Google, Factiva, Live, Topix and Yahoo!}

By reliable I mean that an instrument produces consistent
estimates of a metric for a specific concept, i.e. if you take the
measure twice, you should get the same value.

By robust I mean that an index is designed in such a way as to appropriately discount bogus values. More about this in a future post.

To follow on the sports example I introduced a couple of posts back, Table 1
shows the number of results returned by each news search engine for 5
sports. Whenever possible, the search specified a single day (October
1st), but no region nor language restriction.

Two observations should be made. First, there are fairly large
differences across engines. Not considering Ask, Live nor Topix’s
figures (see notes (1) and (3)), counts range from a low of 2,293 news
items to a high of 6,740 news items for football. Second, all engines
produce consistent estimates — on successive requests, counts returned
by an engine will generally not vary by much.

Consistent yet different counts need not be a matter of concern if
counts vary merely by some fixed proportion (i.e. Yahoo! always
returning more items than Factiva). If it were not the case (i.e.
sometimes, Yahoo! claims more items, sometimes Factiva does), then
there would be a problem. To answer this, we must correlate instruments
(the search engines) across concepts (the sports).

Table 2 shows how results correlate across news search
engines. We can readily see that Ask and Live are poor indicators in
our example, because counts have reached their ceilings, therefore
providing no information on relative visibility — according to Ask,
cricket is the most visible sport, dubious as all other engines but one
put it at the bottom. We can also see that Topix’s overall stock of
news items correlates as well as any other engine.

Table 3 shows the customary reliability statistics for our
array of measures (which could become an index).  The Cronbach’s alpha
value (more or less similar to the average correlation between items,
varying between zero for pure noise, up to one for totally reliable)
can be very high. As high as 0.992 if we merely remove counts returned by Ask and Live.

So far so good, but not good enough as the concepts used in this
example are in no way representative of the search universe. In my next
post I’ll present results derived from a much larger and reasonably
diverse sample of concepts.


Table 1


(1) Notice that Ask and Live cap results. Ask will not
return counts above about 800 pages; Live will not return counts above 1100 pages or so.
(2) The results figure provided by Google News
appears to refer to the total stock of active news items (presumably a
month). If you click to list news items of the past week, day or hour,
the results count stays the same. The numbers I report are my own
estimates, based on the number of items available or the rate at which
the past 1000 items have been published.
(3) Topix doesn’t allow narrowing the search to a specific day or date. Figures refer to the stock of active news items
MAPA stands for Mean Absolute Percent Accuracy. It is computed as (1 –
MAPE) where MAPE is the well know Mean Absolute Percent Error,
routinely used to compare various forecasting methods. These figures
were computed based on 5 bursts of 50 consecutive requests made at
intervals of less than 1 second (because differences between longer
intervals might arise because real changes have affected the underlying
metric, i.e. new news coming in, old news moving out).

Table 2


Table 3


Note: (5) The 6-item index excludes Topix, Ask and Live ; the 7-item index excludes Ask and Live

Offre d’emploi – Coordonnateur marketing interactif

Une offre d’emploi relayée par une ex-étudiante


Poste: Coordonnateur(trice) marketing Web et interactif – Remplacement de congé de maternité
Poste à Québec, à partir de l’Hôtel ALT Québec.
Début : Décembre 2008

Description du poste

Relevant de la Directrice Marketing, votre rôle principal sera de développer des stratégies interactives répondant aux besoins des clients internes tout en agissant à titre de consultant au niveau du Web et de l’interactivité.

Vous possédez une formation en communication, marketing ou tout autre domaine pertinent. Parfaitement bilingue, vous êtes doué(e) pour comprendre les besoins et les demandes de différents types de clients. Communicateur(trice) efficace, vous avez du leadership et une facilité à entretenir de bonnes relations interpersonnelles. En plus d’être autonome, vous savez comment gérer efficacement les changements. Finalement, vous avez une excellente connaissance de l’Internet (Web 2.0) ainsi qu’un intérêt pour le marketing traditionnel (rédaction et traduction, promotions, etc.)

Votre rôle ne s’arrête pas là! Une fois en poste, vous devrez:

•    assurer la gestion des projets Web, de l’échéancier, de la gestion des ressources et budgets ;
•    assurer le suivi du projet et coordonner avec l’équipe interne et les intervenants externes;

Exigences / compétences recherchées

•    Excellente connaissance du marketing Web ainsi que de toutes les étapes de réalisation d’un projet Web;
•    Bonnes connaissances de l’Internet, blogues, sites Web transactionnels et du développement de réseaux (social/professionnel) tel Facebook;
•    Diplôme universitaire en communications ou marketing ;
•    2 à 5 ans d’expérience en marketing interactif ;
•    Aptitude pour planifier, développer et coordonner plusieurs projets et évènements ;
•    Avoir un bon sens de l’organisation, être dynamique et apte à travailler dans un contexte en mouvement;
•    Être reconnu pour ses aptitudes interpersonnelles et ses habilités à travailler dans une équipe multidisciplinaire (ventes, marketing et hôtellerie).

Autres informations

Si vous êtes à l’affût de l’actualité en marketing interactif, que vous aimeriez vous développer et grandir avec des projets aussi novateurs que vous, faites-nous parvenir votre CV à:

Seuls les candidats(es) retenus(es) seront contactés(es). Les conditions de travail seront discutées lors de l’entrevue.

Sphere’s blog search deprecated

Yesterday, AOL has pulled the plug on Sphere’s blog search engine. AOL had purchased Sphere in April 2008 (see here).

Sphere had launched in 2006 but quickly moved to expand its services by providing content consolidation. Their idea was to offer a widget that would display related posts and news. They phased out their blog search, which was still working — if you knew the url.

The strange thing is that now, if you go to the old Sphere search window, you end-up on AOL’s search portal where you can search the www, news, images and videos, several other specialized domains (i.e. music), but, paradoxically, not the blogs.

This could be a sign that the "blogosphere" is in some kind of trouble. There are so many splogs (fake blogs) that searching for blog-specific content doesn’t have much appeal.

Technorati is doing fairly well according to Alexa (see below), maybe because you can restrict your search to blogs with "some" authority, weeding out the splogs. Other blog search tools show much less volume, and a decline.


Is visibility everything? My 2-bit sentiment analysis

Today’s question is whether it makes sense to rely on raw visibility metrics, without any consideration given to what is visible. As I pointed out earlier, raw visibility metrics correlate quite well with the apparent popularity of candidates in various electoral races, so why bother with content?

Well… John Edwards was highly visible in the news last August, for events that have had a major, negative, impact on his political fortunes. So any tracking system worth its salt should provide some information on content as well as on the visibility of such content, isn’t it. And several start-ups promise to do just that. Jodange, Sentimine, Corpora. And there is this amazing academic site.

But sentiment analysis — as it is often called — is not an easy task. (read this entry for a feel).

If you are to rely on automated procedures, there are syntaxic traps. For instance, would a Natural Language Processor (NLP) understand that "XYZ is not good, he (she) is unbelievable!" has positive sentiment or would it be fooled by the "not good"?

And context matters (this phone is very small / this apartment is very small). A smaller phone is preferable to a larger one, at least up to a point, and a large apartment is worth more than a small one. Yes, we can teach machines, but there are so many variables that the task is herculean.

And then we have to consider that a fact may be interpreted differently by different persons. "This politician is pro-life" (i.e. opposed to abortion) could carry a positive, neutral or negative sentiment. Depends on the writer’s views, who may support, attack or merely state the politician’s attitude. And then, suppose a pro-lifer writing (favorably) about a pro-life candidate. That doesn’t mean that the reader will interpret the statement favorably. And then consider the case of a pro-choice writing favorably about a pro-life candidate?

Then what? One way is to forget about bases for sentiment. After all, facts are not sentiments (even though several facts carry obvious sentiments — headlines such as: YXY convicted of first degree murder). Ignoring facts, we may focus on valence and care only for statements such as "[so and so] is cool, [so and so] is good, I like [so and so]"

So? My 2-bit sentiment analysis goes like this: Compute the ratio of pages including the word "good" over pages including the word "bad" in addition to the concept of interest. For instance, how many pages are returned for the query ["Microsoft Vista" +good]? How many pages for ["Microsoft Vista" +bad]? Compute the ratio (or the share if you prefer). Et voilà! Cannot be simpler.

Is this a joke or a useful metric? Well consider this comparison between Microsoft’s Vista and Apple’s OSX. Graphs below suggest that Vista is more visible than OSX whereas OSX has higher 2-bit sentiment score on all dimensions except the NEWS vector. Plausible.Vistavsosx

And how does such 2-bit sentiment analysis fares in the political realm? Surprisingly well. Below are scattergrams of sentiment scores for leading candidates in the USA and Canada races comparing the 2-bit sentiment metric with textMap’s NLP derived scores (textMap doesn’t provide scores for Sarah Palin nor Elizabeth May, two relatively recent figures in the political realm).


We can see that both metrics agree quite well on relative sentiment. In fact, it would be difficult to decide which scale is more appropriate. On the other hand, a case-by-case examination of the 2-bit sentiment metrics is not for the faint of heart as the words "good" and "bad" are very often used in completely irrelevant contexts (a candidate is interviewed on "Good Morning America), or have diametrically opposing sentiments (a candidate was "not good").

But let’s not worry for the moment as the first order of business is to assert the general reliability of www-based metrics. If we can establish that they are — not to worry, they can be — it will certainly be interesting to explore the validity of "automated n-bit sentiment metrics" 🙂