Swine flu — there is at least one good thing about it…

We are in the midst of an information storm about an apprehended flu pandemic. Data clearly shows a spike starting 4-5 days ago, in the news as well as in the blogs, where the number of indexed documents referring to flu, of any kind, shows a ten to hundred–fold increase. That is not surprising, would you think. But what is, at least to me, is that Google trends' special flu microsite shows absolutely no sign of increasing activity.

This is odd — I would have bet the house that people are searching [flu] is near record numbers. In fact, I've added [flu] to the list of concepts I track precisely in order to examine the lead-lag in search vs visibility (i.e. do people search for complements of information on what has recently become visible, or does search signal interest which is a precursor of visibility?).

Unfortunately, "search for search" is much more difficult than searching for visibility. Looking at Yahoo! Buzz, I see that swine flu was the top search during the past hour. Looking at Google trends, the current spike is obvious.

So, yes, people do search for complements of information, like crazy. And this is certainly a case of reactive search (i.e. a news report has initiated the process). The next question is why is this surge not showing on the flu microsite? Is it sophisticated enough to distinguish between search terms revealing actual infections (such as [I am sick with the flu]) and queries motivated by curiosity?

One plausible answer is that the flu microsite is updated once a week and that the spike will show up at the next update. If this were the case, then we will have to reconsider the punch line that Google trends provides a two-week lead in the number of flu cases reported by the CDC. What was "white magic" (as in white hat hackers — the use of the "infinite power of IT" to do good) would take a blow as the number of search would just be "experiential coincidence". (i.e. people search for flu just because they anticipate getting sick, just as people would be more likely to search for [sunburns] during the summer ; people search for carribean beaches *before* they go on holidays, and so on). 

So I am eagerly anticipating the microsite updates as I am curious to see how sophisticated this search analysis is. And if it turns out that it is, I will then worry that this information is proprietary. And if it isn't, I will be disappointed by the limited wisdom of the crowds… So it is win-win 🙂

Wikia search is closing

In my previous post I reported that the open-search index Wikia had stopped working. Earlier today I stumbled across this blog entry from Jimmy Wales, yes — Wikipedia's founder. Wales says that his foundation has stopped funding Wikia Search.

Search.wikia.com is still live, but likely to become useless in a short while.

This is very unfortunate as we need diversity in indexing if webometrics (the science of analyzing the web corpus) is to prosper. With Wikia going under, and Cuil looking as if it might (according to Alexa, traffic has all but vanished), triangulation becomes more difficult.

Is there such a thing as Spring fever for search engines?

On April 1st, Alexa has changed its search logic moving from documents to site-level searching. Probably a good thing actually.

Since April 4th I am losing Nutch (Wikia's open index). First problem occurred in March when open-index.visvo.com vanished. And now search.isc.org has become silent. This index was very small in comparison to the field leaders, but did provide some useful validity check (if two leading indexes give radically different counts, even an imprecise third party will help singling out extreme errors)


And now Sphere is changing its search tool. It is no longer possible to discriminate between news items and blogs. And more importantly, it is no longer possible to perform quoted searches.


A quoted search is when multiple items are enclosed between quotes. Sometimes called a phrase. As you may guess, there are relatively few news/blog items about the Canadian Green Party leader, Elizabeth May. Google reports 11 blog and 3 news items. But that is if you search for ["Elizabeth May"]. Remove the quotes and you jump to 17k blog and 10k+ news items, about an assortment of Elizabeths but mostly about things that happen in May or things that *may* happen. The ability to perform quoted search of people's names is crucial if you want relevant results.


I can only hope that Sphere will restore its quoted search capability (I would guess that the current situation is just an oversight). Here again, not because the world relies on Sphere to get news about various topics, but because sites like theirs are becoming more and more important as validation tools.

[EDIT]

Typing [Elizabeth+May] will (apparently) yield the desired result. 

Alexa provides a new and potentially interesting visibility metric

Alexa (an Amazon company) used to return the number of entries found in the top n sites. I routinely queried it to get the number of mentions for specific concepts appearing in the top 100 sites. 

Since April 1st they now return the number of sites "about" a concept, within a user defined bracket running from the top 1 to 10M.

Their FAQ indicates that their crawler is designed to find sites rather than documents. There is no doubt that Alexa was doing a rather poor job at indexing documents. On March the 31st they returned 337K documents pertaining to "Barack Obama" vs 4,140K pertaining to "John McCain", a rather surprising ratio of 12-to-1 in favor of McCain. (Google returned 172M mentioning Barack Obama vs 34.5M mentioning John McCain)

Alexa's new site search returns a plausible 3,000 sites worldwide "about" Barack Obama vs 176 "about" John McCain. 

A very preliminary investigation raises questions about the value of such information (Alexa reports close to 52,000 sites about the iPhone…) but opens tantalizing opportunities such as finding if sites about a given concept are more of less prevalent as we move up in the rankings.

The accuracy of web-based visibility metrics

Just a quick post to introduce a research note I am just about to file. It presents web-based visibility metrics harvested during the US presidential race (vote held on November 4th 2008) and the Canadian federal election (vote held on October 14th 2008).

The most striking finding is the almost super-natural precision of visibility metrics culled from the news search engines. Visibility shares on the eve of the vote were off by 0.2% and 0.3% in the Canadian and US election respectively.

Table 1: Summary of recent results

CM Capture 1

Below are radar charts of various metrics computed between 03:00 and 05:00 on the day of the election, where day-to-day indicators (essentially news and blogs) refer to one day earlier (i.e. on the morning of November 4th, news search engines were queried on content published/indexed on November 3rd).

Further below are time series showing how visibility has evolved during the races. It should be noted that indicators other than news are mere averages of a collection of ratios computed for each relevant search engine (see the list at the bottom of this post). For the news index, the metric is the arithmetic average of the returns reported by Factiva, Google News and Yahoo! News for the Canadian race. In addition, NorthernLight and Bloglines were used to compute an average score for the US race. These rules were established based on a reliability assessment made on independent data (no fishing here).

I plan to recalibrate each index before the end of the year, and publish revised timeseries.

As a final note, the last chart traces the news visibility index published by Factiva in 2004. This is perhaps what is most important — just a few years ago, media where both off and slow to react. My inclination is to think that blogging has changed the nature of news reporting, although we could also argue that indexing and searching have improved.

Fulltext is here.


Figure 1: Visibility shares, Canada, October 13th

CM Capture 2

Figure 2: Visibility shares, US, November 3rd

 
CM Capture 6

Figure 3: Visibility shares, Stephane Dion


CM Capture 7


Figure 4: Visibility shares, Barack Obama


CM Capture 8


Figure 5: Visibility and delegates shares, Howard Dean & John Kerry, 2004

CM Capture 9

Table 2: List of search engines

CM Capture 10


Weak visibility measurements require robust procedures

In my previous post I showed that visibility measures taken from News Search Engines (NSE) appeared to be very reliable, with a test-retest Mean Absolute Percent Accuracy in excess of 99% and a scale Cronbach’s Alpha in excess of 0.992. I also warned that these results had to be tested on a larger sample of concepts.

In this post I’ll focus on the robustness issue. I’ll follow-up with a detailed summary of my findings on the reliability of NSEs, in my next post.

First, recall that a reliable measure is a measure that is consistent. We can appreciate reliability by taking repeated measures of the same metrics. The consistency of test-retest is certainly desirable but clearly insufficient as a defective instrument returning constant results would be consistent yet unreliable. We must therefore take repeated measures of a metric on different subjects. A reliable instrument will be able to consistently discriminate between subjects (i.e. if you weight people and rank them from the heaviest to the lightest, a perfectly reliable scale would consistently return identical ranks). A better way is to take repeated measures using different instruments. That is the idea behind "convergent reliability".

These ideas come largely from the field of psychometrics, where the distribution of errors could reasonably be assumed to be somewhat normal. But this is not the case with web metrics, where errors can be extreme.

So let me introduce the idea of robustness and contrast it with the idea of reliability. (see a wikipedia article here). A robust instrument is an instrument that always returns reasonable values, with normal errors. A weak instrument is an instrument that occasionally breaks down and may return extreme values. Think of a bathroom scale — a robust scale always gives your weight, give or take a couple of kilos; a non-robust scale will sometimes return erratic values like zero kilo or 900 kilos. Non-robust could be conceived as "delicate", where a deliberate trade-off has been made such that usually the result is very accurate but occasionally something goes wrong, or merely as "weak" where freak events occur in addition to normal errors.

Consider the figure below, where 100 consecutive measurements are reported. Both instruments are identically reliable (99.5% accurate). The difference is that errors are random in one case, and extreme in the other. Now, let me ask: which scale would you rather use, the robust scale or the delicate scale?

Pic0

My answer is that it depends — if you use a robust estimation technique, you should be able to detect extreme values and you will prefer series 2, produced by a delicate instrument which is usually right on. If, on the other hand, you cannot tell if a value is extreme or not, then you may prefer series 1, produced by a robust instrument which is never far off the mark.

Extreme values occur quite frequently on the web. In my next post I will report on a set of 9 queries about 1 000 concepts. Out of 9 000 queries, 39 were obvious outliers (unable to get a result after 6 tries). But this is the very tip of a quite large iceberg as there are hundreds of suspect values, including some really really extreme outliers.

This is both not really surprising nor trivial. Not surprising as extreme errors may have several causes, either at the source (a NSE returns an erroneous value) or during the treatment (a parsing error). Not trivial as the magnitude of these extreme values dwarfs the true correlation. Consider the scattergram below, where counts about 1 000 queries to Google News and Yahoo News are displayed.

Pic1
 

For some reason, on October the 7th, Yahoo! reported a count of 585 000 news items for the word "Loveridge" compared to just one instance in Google News. This is not a parsing error as AllTheWeb, owned by Yahoo! returned 750 000 instances, with other NSEs returning very low counts.

At the other end of the spectrum, Google News reported a count of 72 000 news items for "Lies" compared to 1 071 at Yahoo! News. Other NLEs also reported counts in the neighborhood of 1 000.

Keeping these two extremes in the dataset will yield a very weak correlation of .02 between Google and Yahoo! News. Removing just 2 observations will bring the correlation close to .75!!

The figure below shows the same points as above, after removal of these 2 extreme outliers.

Pic2

So what? First, take more than one measure. Three at least because if you are confronted with a "Loveridge case" where one NSE says "very many" and another says "very few" you’ll need a third if not a fourth data point if you want to be able to tell which is which. Second, eyeballing extreme values will not do it. There are just too many data points, many of which are not obvious calls. Since NSEs are not robust instruments, a robust estimation is a must.


Visibility metrics in the news media – a few observations

Before I present and briefly discuss the data, let me clarify a few notions.

By metric, I mean "some abstraction that can be
measured." What you weight is a simple metric. How happy you are is
not. In this post we deal with a simple metric — how visible is a
concept (say… football) in the news media.

By measure I mean applying an instrument to a metric. Up on a
scale and we can make a reading of how much you weight. In this post,
the instruments are news search engines. 

By index I mean an aggregate of measures. The main purpose of
an index is to yield a more reliable figure. In this post I report on
the following news engines : {Ask, AllTheWeb, Bloglines, Google, Factiva, Live, Topix and Yahoo!}

By reliable I mean that an instrument produces consistent
estimates of a metric for a specific concept, i.e. if you take the
measure twice, you should get the same value.

By robust I mean that an index is designed in such a way as to appropriately discount bogus values. More about this in a future post.

To follow on the sports example I introduced a couple of posts back, Table 1
shows the number of results returned by each news search engine for 5
sports. Whenever possible, the search specified a single day (October
1st), but no region nor language restriction.

Two observations should be made. First, there are fairly large
differences across engines. Not considering Ask, Live nor Topix’s
figures (see notes (1) and (3)), counts range from a low of 2,293 news
items to a high of 6,740 news items for football. Second, all engines
produce consistent estimates — on successive requests, counts returned
by an engine will generally not vary by much.

Consistent yet different counts need not be a matter of concern if
counts vary merely by some fixed proportion (i.e. Yahoo! always
returning more items than Factiva). If it were not the case (i.e.
sometimes, Yahoo! claims more items, sometimes Factiva does), then
there would be a problem. To answer this, we must correlate instruments
(the search engines) across concepts (the sports).

Table 2 shows how results correlate across news search
engines. We can readily see that Ask and Live are poor indicators in
our example, because counts have reached their ceilings, therefore
providing no information on relative visibility — according to Ask,
cricket is the most visible sport, dubious as all other engines but one
put it at the bottom. We can also see that Topix’s overall stock of
news items correlates as well as any other engine.

Table 3 shows the customary reliability statistics for our
array of measures (which could become an index).  The Cronbach’s alpha
value (more or less similar to the average correlation between items,
varying between zero for pure noise, up to one for totally reliable)
can be very high. As high as 0.992 if we merely remove counts returned by Ask and Live.

So far so good, but not good enough as the concepts used in this
example are in no way representative of the search universe. In my next
post I’ll present results derived from a much larger and reasonably
diverse sample of concepts.

                                                               
                                                                      
         

Table 1

Untitled1

Notes:
(1) Notice that Ask and Live cap results. Ask will not
return counts above about 800 pages; Live will not return counts above 1100 pages or so.
(2) The results figure provided by Google News
appears to refer to the total stock of active news items (presumably a
month). If you click to list news items of the past week, day or hour,
the results count stays the same. The numbers I report are my own
estimates, based on the number of items available or the rate at which
the past 1000 items have been published.
(3) Topix doesn’t allow narrowing the search to a specific day or date. Figures refer to the stock of active news items
(4)
MAPA stands for Mean Absolute Percent Accuracy. It is computed as (1 –
MAPE) where MAPE is the well know Mean Absolute Percent Error,
routinely used to compare various forecasting methods. These figures
were computed based on 5 bursts of 50 consecutive requests made at
intervals of less than 1 second (because differences between longer
intervals might arise because real changes have affected the underlying
metric, i.e. new news coming in, old news moving out).

Table 2

Untitled2

Table 3

Untitled3

Note: (5) The 6-item index excludes Topix, Ask and Live ; the 7-item index excludes Ask and Live


Is visibility everything? My 2-bit sentiment analysis

Today’s question is whether it makes sense to rely on raw visibility metrics, without any consideration given to what is visible. As I pointed out earlier, raw visibility metrics correlate quite well with the apparent popularity of candidates in various electoral races, so why bother with content?

Well… John Edwards was highly visible in the news last August, for events that have had a major, negative, impact on his political fortunes. So any tracking system worth its salt should provide some information on content as well as on the visibility of such content, isn’t it. And several start-ups promise to do just that. Jodange, Sentimine, Corpora. And there is this amazing academic site.

But sentiment analysis — as it is often called — is not an easy task. (read this entry for a feel).

If you are to rely on automated procedures, there are syntaxic traps. For instance, would a Natural Language Processor (NLP) understand that "XYZ is not good, he (she) is unbelievable!" has positive sentiment or would it be fooled by the "not good"?

And context matters (this phone is very small / this apartment is very small). A smaller phone is preferable to a larger one, at least up to a point, and a large apartment is worth more than a small one. Yes, we can teach machines, but there are so many variables that the task is herculean.

And then we have to consider that a fact may be interpreted differently by different persons. "This politician is pro-life" (i.e. opposed to abortion) could carry a positive, neutral or negative sentiment. Depends on the writer’s views, who may support, attack or merely state the politician’s attitude. And then, suppose a pro-lifer writing (favorably) about a pro-life candidate. That doesn’t mean that the reader will interpret the statement favorably. And then consider the case of a pro-choice writing favorably about a pro-life candidate?

Then what? One way is to forget about bases for sentiment. After all, facts are not sentiments (even though several facts carry obvious sentiments — headlines such as: YXY convicted of first degree murder). Ignoring facts, we may focus on valence and care only for statements such as "[so and so] is cool, [so and so] is good, I like [so and so]"

So? My 2-bit sentiment analysis goes like this: Compute the ratio of pages including the word "good" over pages including the word "bad" in addition to the concept of interest. For instance, how many pages are returned for the query ["Microsoft Vista" +good]? How many pages for ["Microsoft Vista" +bad]? Compute the ratio (or the share if you prefer). Et voilà! Cannot be simpler.

Is this a joke or a useful metric? Well consider this comparison between Microsoft’s Vista and Apple’s OSX. Graphs below suggest that Vista is more visible than OSX whereas OSX has higher 2-bit sentiment score on all dimensions except the NEWS vector. Plausible.Vistavsosx

And how does such 2-bit sentiment analysis fares in the political realm? Surprisingly well. Below are scattergrams of sentiment scores for leading candidates in the USA and Canada races comparing the 2-bit sentiment metric with textMap’s NLP derived scores (textMap doesn’t provide scores for Sarah Palin nor Elizabeth May, two relatively recent figures in the political realm).

Pols_3

We can see that both metrics agree quite well on relative sentiment. In fact, it would be difficult to decide which scale is more appropriate. On the other hand, a case-by-case examination of the 2-bit sentiment metrics is not for the faint of heart as the words "good" and "bad" are very often used in completely irrelevant contexts (a candidate is interviewed on "Good Morning America), or have diametrically opposing sentiments (a candidate was "not good").

But let’s not worry for the moment as the first order of business is to assert the general reliability of www-based metrics. If we can establish that they are — not to worry, they can be — it will certainly be interesting to explore the validity of "automated n-bit sentiment metrics" 🙂


The visibility game

I had written that I would do my best to update visibility charts for leading political candidates. So here they are: results for the Canadian race, and results for the US presidential. You may subscribe to daily updates by clicking on the xml tag at the bottom of these pages.

News visibility scores for Canadian candidates put Harper ahead of Dion at 41.8% vs 26.7%. Striking similarity with the latest Canadian Press Harris-Decima survey published this morning, which puts Harper/Dion at 41%/26%.

On the US scene, scores suggest that the Republican ticket is ahead, largely because of the Palin sensation who dominates the "social" vector (a composite of such things as backtype, twitter, delicious, digg, etc.) where McCain does not do well. Here also, results are matched by polls.

Still very "alpha", so expect several changes. In particular, results are currently nothing more than averages from several distinct sources. Still, no doubt we can learn from this.


Four reasons to be concerned about visibility metrics

Google "baseball", "basketball", "cricket", "football", "hockey" using advanced search preferences to obtain the number of results for pages located in Canada, India, the UK and the USA. Here’s what I got on September 3rd 2008.

Sports

Not so bad. Hockey is Canada’s favorite. Cricket is India’s. Football in the UK and the USA. Makes lot of sense. Visibility metrics appear to be useful.

But wait! Football doesn’t mean the same thing in North America and elsewhere. North Americans say "soccer" when everybody else say "football". So what do we have here? And then look at this. First page for cricket in Canada. Oh, yes, cricket also has several meanings (there is the sport, there is the insect, there is the Disney’s character, etc.). But these are validity issues. With hard and intelligent work it is conceivable that we could weed out irrelevant results and zero in on our target concepts. And with more hard work it is conceivable that we could identify the underlying relationship between "www visibility" and "popularity", whatever we mean with these definitions.

But wait some more! Is it really possible that Canadians prefer cricket to football (26 000 to 16 400)!? Even accounting for the Jiminy effect, something must have gone wrong somewhere. These figures may not be reliable. (Remember that reliable measures are a pre-requisite for valid inferences, and that reliability means that measures are consistently reproducible).

There are important reliability concerns :

1) the numbers are estimates. Fair enough. It would be next to impossible to return exact data as the web is in constant flux, as queries are handled by different servers, and so on. (I will illustrate later how much this can be a problem — in a nutshell, it is significant when big news stories erupt).

2) Repeating the exact same query several times in a row will often return different results. In general differences are small, but they can be surprisingly large (less than half of the previous estimate, obtained 2 minutes ago). And sometimes totally off the scale. Visibility metrics are not very robust.

3) Pages do not materialize. Try a Google search for "cricket" (no quotes) in the USA, updated within the past 24 hours. I got 67k results. But only three pages of results(!!??). Clicking to go to the last page returned results 201-211 (fair enough), but a revised estimate of the total results to match (bummer!). I am now told that that are 211 web pages about "cricket" from the USA updated within the past 24 hours, down from 67 700. This is not a trivial difference — the revised figure, no longer an estimate, is a paltry 0.3% of the initial figure. We can guess at various explanations, but the bottom line is that the results figure will dramatically inflate the visibility score of more visible concepts (whose score will not be truncated for consistency). To put it differently, if Google says that there are 67k pages yet they can show no more than 211, should I use the 67K figure? Or should I use estimates for some concepts and hard results for others, knowing that hard results may be a tiny fraction of estimates?

4) Different engines give wildly different results. AlltheWeb (now part of Yahoo!) returns no less than 1.4 million pages resulting from a search for "football" in Canada, updated during the past 24 hours. Compared to 16 400 pages from Google. A hundred-fold difference. Is it a difference caused by the engine? (i.e. does ATW systematically report much larger estimates than Google?), a sign that estimates are unreliable (i.e. for other queries, Google could return estimates that are much larger than ATW’s), or just bad luck? (i.e. for most queries, ATW and Google roughly agree).

—-

What to make of this?

Even though the number of results returned by a query contains "some" information as we can see in the sports-by-country example, these numbers vary considerably and sometimes unexpectedly. In the absence of a reliability gauge, reaching conclusions based on visibility scores is a risky business.

In my next post I will explain how I am trying to build a reliable and robust measurement scale of… let’s call this "visibility" for the moment, even though the term is misleading.