Swine flu — there is at least one good thing about it…

We are in the midst of an information storm about an apprehended flu pandemic. Data clearly shows a spike starting 4-5 days ago, in the news as well as in the blogs, where the number of indexed documents referring to flu, of any kind, shows a ten to hundred–fold increase. That is not surprising, would you think. But what is, at least to me, is that Google trends' special flu microsite shows absolutely no sign of increasing activity.

This is odd — I would have bet the house that people are searching [flu] is near record numbers. In fact, I've added [flu] to the list of concepts I track precisely in order to examine the lead-lag in search vs visibility (i.e. do people search for complements of information on what has recently become visible, or does search signal interest which is a precursor of visibility?).

Unfortunately, "search for search" is much more difficult than searching for visibility. Looking at Yahoo! Buzz, I see that swine flu was the top search during the past hour. Looking at Google trends, the current spike is obvious.

So, yes, people do search for complements of information, like crazy. And this is certainly a case of reactive search (i.e. a news report has initiated the process). The next question is why is this surge not showing on the flu microsite? Is it sophisticated enough to distinguish between search terms revealing actual infections (such as [I am sick with the flu]) and queries motivated by curiosity?

One plausible answer is that the flu microsite is updated once a week and that the spike will show up at the next update. If this were the case, then we will have to reconsider the punch line that Google trends provides a two-week lead in the number of flu cases reported by the CDC. What was "white magic" (as in white hat hackers — the use of the "infinite power of IT" to do good) would take a blow as the number of search would just be "experiential coincidence". (i.e. people search for flu just because they anticipate getting sick, just as people would be more likely to search for [sunburns] during the summer ; people search for carribean beaches *before* they go on holidays, and so on). 

So I am eagerly anticipating the microsite updates as I am curious to see how sophisticated this search analysis is. And if it turns out that it is, I will then worry that this information is proprietary. And if it isn't, I will be disappointed by the limited wisdom of the crowds… So it is win-win 🙂

Wikia search is closing

In my previous post I reported that the open-search index Wikia had stopped working. Earlier today I stumbled across this blog entry from Jimmy Wales, yes — Wikipedia's founder. Wales says that his foundation has stopped funding Wikia Search.

Search.wikia.com is still live, but likely to become useless in a short while.

This is very unfortunate as we need diversity in indexing if webometrics (the science of analyzing the web corpus) is to prosper. With Wikia going under, and Cuil looking as if it might (according to Alexa, traffic has all but vanished), triangulation becomes more difficult.

Is there such a thing as Spring fever for search engines?

On April 1st, Alexa has changed its search logic moving from documents to site-level searching. Probably a good thing actually.

Since April 4th I am losing Nutch (Wikia's open index). First problem occurred in March when open-index.visvo.com vanished. And now search.isc.org has become silent. This index was very small in comparison to the field leaders, but did provide some useful validity check (if two leading indexes give radically different counts, even an imprecise third party will help singling out extreme errors)


And now Sphere is changing its search tool. It is no longer possible to discriminate between news items and blogs. And more importantly, it is no longer possible to perform quoted searches.


A quoted search is when multiple items are enclosed between quotes. Sometimes called a phrase. As you may guess, there are relatively few news/blog items about the Canadian Green Party leader, Elizabeth May. Google reports 11 blog and 3 news items. But that is if you search for ["Elizabeth May"]. Remove the quotes and you jump to 17k blog and 10k+ news items, about an assortment of Elizabeths but mostly about things that happen in May or things that *may* happen. The ability to perform quoted search of people's names is crucial if you want relevant results.


I can only hope that Sphere will restore its quoted search capability (I would guess that the current situation is just an oversight). Here again, not because the world relies on Sphere to get news about various topics, but because sites like theirs are becoming more and more important as validation tools.

[EDIT]

Typing [Elizabeth+May] will (apparently) yield the desired result. 

Alexa provides a new and potentially interesting visibility metric

Alexa (an Amazon company) used to return the number of entries found in the top n sites. I routinely queried it to get the number of mentions for specific concepts appearing in the top 100 sites. 

Since April 1st they now return the number of sites "about" a concept, within a user defined bracket running from the top 1 to 10M.

Their FAQ indicates that their crawler is designed to find sites rather than documents. There is no doubt that Alexa was doing a rather poor job at indexing documents. On March the 31st they returned 337K documents pertaining to "Barack Obama" vs 4,140K pertaining to "John McCain", a rather surprising ratio of 12-to-1 in favor of McCain. (Google returned 172M mentioning Barack Obama vs 34.5M mentioning John McCain)

Alexa's new site search returns a plausible 3,000 sites worldwide "about" Barack Obama vs 176 "about" John McCain. 

A very preliminary investigation raises questions about the value of such information (Alexa reports close to 52,000 sites about the iPhone…) but opens tantalizing opportunities such as finding if sites about a given concept are more of less prevalent as we move up in the rankings.

The accuracy of web-based visibility metrics

Just a quick post to introduce a research note I am just about to file. It presents web-based visibility metrics harvested during the US presidential race (vote held on November 4th 2008) and the Canadian federal election (vote held on October 14th 2008).

The most striking finding is the almost super-natural precision of visibility metrics culled from the news search engines. Visibility shares on the eve of the vote were off by 0.2% and 0.3% in the Canadian and US election respectively.

Table 1: Summary of recent results

CM Capture 1

Below are radar charts of various metrics computed between 03:00 and 05:00 on the day of the election, where day-to-day indicators (essentially news and blogs) refer to one day earlier (i.e. on the morning of November 4th, news search engines were queried on content published/indexed on November 3rd).

Further below are time series showing how visibility has evolved during the races. It should be noted that indicators other than news are mere averages of a collection of ratios computed for each relevant search engine (see the list at the bottom of this post). For the news index, the metric is the arithmetic average of the returns reported by Factiva, Google News and Yahoo! News for the Canadian race. In addition, NorthernLight and Bloglines were used to compute an average score for the US race. These rules were established based on a reliability assessment made on independent data (no fishing here).

I plan to recalibrate each index before the end of the year, and publish revised timeseries.

As a final note, the last chart traces the news visibility index published by Factiva in 2004. This is perhaps what is most important — just a few years ago, media where both off and slow to react. My inclination is to think that blogging has changed the nature of news reporting, although we could also argue that indexing and searching have improved.

Fulltext is here.


Figure 1: Visibility shares, Canada, October 13th

CM Capture 2

Figure 2: Visibility shares, US, November 3rd

 
CM Capture 6

Figure 3: Visibility shares, Stephane Dion


CM Capture 7


Figure 4: Visibility shares, Barack Obama


CM Capture 8


Figure 5: Visibility and delegates shares, Howard Dean & John Kerry, 2004

CM Capture 9

Table 2: List of search engines

CM Capture 10


Weak visibility measurements require robust procedures

In my previous post I showed that visibility measures taken from News Search Engines (NSE) appeared to be very reliable, with a test-retest Mean Absolute Percent Accuracy in excess of 99% and a scale Cronbach’s Alpha in excess of 0.992. I also warned that these results had to be tested on a larger sample of concepts.

In this post I’ll focus on the robustness issue. I’ll follow-up with a detailed summary of my findings on the reliability of NSEs, in my next post.

First, recall that a reliable measure is a measure that is consistent. We can appreciate reliability by taking repeated measures of the same metrics. The consistency of test-retest is certainly desirable but clearly insufficient as a defective instrument returning constant results would be consistent yet unreliable. We must therefore take repeated measures of a metric on different subjects. A reliable instrument will be able to consistently discriminate between subjects (i.e. if you weight people and rank them from the heaviest to the lightest, a perfectly reliable scale would consistently return identical ranks). A better way is to take repeated measures using different instruments. That is the idea behind "convergent reliability".

These ideas come largely from the field of psychometrics, where the distribution of errors could reasonably be assumed to be somewhat normal. But this is not the case with web metrics, where errors can be extreme.

So let me introduce the idea of robustness and contrast it with the idea of reliability. (see a wikipedia article here). A robust instrument is an instrument that always returns reasonable values, with normal errors. A weak instrument is an instrument that occasionally breaks down and may return extreme values. Think of a bathroom scale — a robust scale always gives your weight, give or take a couple of kilos; a non-robust scale will sometimes return erratic values like zero kilo or 900 kilos. Non-robust could be conceived as "delicate", where a deliberate trade-off has been made such that usually the result is very accurate but occasionally something goes wrong, or merely as "weak" where freak events occur in addition to normal errors.

Consider the figure below, where 100 consecutive measurements are reported. Both instruments are identically reliable (99.5% accurate). The difference is that errors are random in one case, and extreme in the other. Now, let me ask: which scale would you rather use, the robust scale or the delicate scale?

Pic0

My answer is that it depends — if you use a robust estimation technique, you should be able to detect extreme values and you will prefer series 2, produced by a delicate instrument which is usually right on. If, on the other hand, you cannot tell if a value is extreme or not, then you may prefer series 1, produced by a robust instrument which is never far off the mark.

Extreme values occur quite frequently on the web. In my next post I will report on a set of 9 queries about 1 000 concepts. Out of 9 000 queries, 39 were obvious outliers (unable to get a result after 6 tries). But this is the very tip of a quite large iceberg as there are hundreds of suspect values, including some really really extreme outliers.

This is both not really surprising nor trivial. Not surprising as extreme errors may have several causes, either at the source (a NSE returns an erroneous value) or during the treatment (a parsing error). Not trivial as the magnitude of these extreme values dwarfs the true correlation. Consider the scattergram below, where counts about 1 000 queries to Google News and Yahoo News are displayed.

Pic1
 

For some reason, on October the 7th, Yahoo! reported a count of 585 000 news items for the word "Loveridge" compared to just one instance in Google News. This is not a parsing error as AllTheWeb, owned by Yahoo! returned 750 000 instances, with other NSEs returning very low counts.

At the other end of the spectrum, Google News reported a count of 72 000 news items for "Lies" compared to 1 071 at Yahoo! News. Other NLEs also reported counts in the neighborhood of 1 000.

Keeping these two extremes in the dataset will yield a very weak correlation of .02 between Google and Yahoo! News. Removing just 2 observations will bring the correlation close to .75!!

The figure below shows the same points as above, after removal of these 2 extreme outliers.

Pic2

So what? First, take more than one measure. Three at least because if you are confronted with a "Loveridge case" where one NSE says "very many" and another says "very few" you’ll need a third if not a fourth data point if you want to be able to tell which is which. Second, eyeballing extreme values will not do it. There are just too many data points, many of which are not obvious calls. Since NSEs are not robust instruments, a robust estimation is a must.


Visibility metrics in the news media – a few observations

Before I present and briefly discuss the data, let me clarify a few notions.

By metric, I mean "some abstraction that can be
measured." What you weight is a simple metric. How happy you are is
not. In this post we deal with a simple metric — how visible is a
concept (say… football) in the news media.

By measure I mean applying an instrument to a metric. Up on a
scale and we can make a reading of how much you weight. In this post,
the instruments are news search engines. 

By index I mean an aggregate of measures. The main purpose of
an index is to yield a more reliable figure. In this post I report on
the following news engines : {Ask, AllTheWeb, Bloglines, Google, Factiva, Live, Topix and Yahoo!}

By reliable I mean that an instrument produces consistent
estimates of a metric for a specific concept, i.e. if you take the
measure twice, you should get the same value.

By robust I mean that an index is designed in such a way as to appropriately discount bogus values. More about this in a future post.

To follow on the sports example I introduced a couple of posts back, Table 1
shows the number of results returned by each news search engine for 5
sports. Whenever possible, the search specified a single day (October
1st), but no region nor language restriction.

Two observations should be made. First, there are fairly large
differences across engines. Not considering Ask, Live nor Topix’s
figures (see notes (1) and (3)), counts range from a low of 2,293 news
items to a high of 6,740 news items for football. Second, all engines
produce consistent estimates — on successive requests, counts returned
by an engine will generally not vary by much.

Consistent yet different counts need not be a matter of concern if
counts vary merely by some fixed proportion (i.e. Yahoo! always
returning more items than Factiva). If it were not the case (i.e.
sometimes, Yahoo! claims more items, sometimes Factiva does), then
there would be a problem. To answer this, we must correlate instruments
(the search engines) across concepts (the sports).

Table 2 shows how results correlate across news search
engines. We can readily see that Ask and Live are poor indicators in
our example, because counts have reached their ceilings, therefore
providing no information on relative visibility — according to Ask,
cricket is the most visible sport, dubious as all other engines but one
put it at the bottom. We can also see that Topix’s overall stock of
news items correlates as well as any other engine.

Table 3 shows the customary reliability statistics for our
array of measures (which could become an index).  The Cronbach’s alpha
value (more or less similar to the average correlation between items,
varying between zero for pure noise, up to one for totally reliable)
can be very high. As high as 0.992 if we merely remove counts returned by Ask and Live.

So far so good, but not good enough as the concepts used in this
example are in no way representative of the search universe. In my next
post I’ll present results derived from a much larger and reasonably
diverse sample of concepts.

                                                               
                                                                      
         

Table 1

Untitled1

Notes:
(1) Notice that Ask and Live cap results. Ask will not
return counts above about 800 pages; Live will not return counts above 1100 pages or so.
(2) The results figure provided by Google News
appears to refer to the total stock of active news items (presumably a
month). If you click to list news items of the past week, day or hour,
the results count stays the same. The numbers I report are my own
estimates, based on the number of items available or the rate at which
the past 1000 items have been published.
(3) Topix doesn’t allow narrowing the search to a specific day or date. Figures refer to the stock of active news items
(4)
MAPA stands for Mean Absolute Percent Accuracy. It is computed as (1 –
MAPE) where MAPE is the well know Mean Absolute Percent Error,
routinely used to compare various forecasting methods. These figures
were computed based on 5 bursts of 50 consecutive requests made at
intervals of less than 1 second (because differences between longer
intervals might arise because real changes have affected the underlying
metric, i.e. new news coming in, old news moving out).

Table 2

Untitled2

Table 3

Untitled3

Note: (5) The 6-item index excludes Topix, Ask and Live ; the 7-item index excludes Ask and Live