Today’s question is whether it makes sense to rely on raw visibility metrics, without any consideration given to what is visible. As I pointed out earlier, raw visibility metrics correlate quite well with the apparent popularity of candidates in various electoral races, so why bother with content?
Well… John Edwards was highly visible in the news last August, for events that have had a major, negative, impact on his political fortunes. So any tracking system worth its salt should provide some information on content as well as on the visibility of such content, isn’t it. And several start-ups promise to do just that. Jodange, Sentimine, Corpora. And there is this amazing academic site.
But sentiment analysis — as it is often called — is not an easy task. (read this entry for a feel).
If you are to rely on automated procedures, there are syntaxic traps. For instance, would a Natural Language Processor (NLP) understand that "XYZ is not good, he (she) is unbelievable!" has positive sentiment or would it be fooled by the "not good"?
And context matters (this phone is very small / this apartment is very small). A smaller phone is preferable to a larger one, at least up to a point, and a large apartment is worth more than a small one. Yes, we can teach machines, but there are so many variables that the task is herculean.
And then we have to consider that a fact may be interpreted differently by different persons. "This politician is pro-life" (i.e. opposed to abortion) could carry a positive, neutral or negative sentiment. Depends on the writer’s views, who may support, attack or merely state the politician’s attitude. And then, suppose a pro-lifer writing (favorably) about a pro-life candidate. That doesn’t mean that the reader will interpret the statement favorably. And then consider the case of a pro-choice writing favorably about a pro-life candidate?
Then what? One way is to forget about bases for sentiment. After all, facts are not sentiments (even though several facts carry obvious sentiments — headlines such as: YXY convicted of first degree murder). Ignoring facts, we may focus on valence and care only for statements such as "[so and so] is cool, [so and so] is good, I like [so and so]"
So? My 2-bit sentiment analysis goes like this: Compute the ratio of pages including the word "good" over pages including the word "bad" in addition to the concept of interest. For instance, how many pages are returned for the query ["Microsoft Vista" +good]? How many pages for ["Microsoft Vista" +bad]? Compute the ratio (or the share if you prefer). Et voilà! Cannot be simpler.
Is this a joke or a useful metric? Well consider this comparison between Microsoft’s Vista and Apple’s OSX. Graphs below suggest that Vista is more visible than OSX whereas OSX has higher 2-bit sentiment score on all dimensions except the NEWS vector. Plausible.
And how does such 2-bit sentiment analysis fares in the political realm? Surprisingly well. Below are scattergrams of sentiment scores for leading candidates in the USA and Canada races comparing the 2-bit sentiment metric with textMap’s NLP derived scores (textMap doesn’t provide scores for Sarah Palin nor Elizabeth May, two relatively recent figures in the political realm).
We can see that both metrics agree quite well on relative sentiment. In fact, it would be difficult to decide which scale is more appropriate. On the other hand, a case-by-case examination of the 2-bit sentiment metrics is not for the faint of heart as the words "good" and "bad" are very often used in completely irrelevant contexts (a candidate is interviewed on "Good Morning America), or have diametrically opposing sentiments (a candidate was "not good").
But let’s not worry for the moment as the first order of business is to assert the general reliability of www-based metrics. If we can establish that they are — not to worry, they can be — it will certainly be interesting to explore the validity of "automated n-bit sentiment metrics" 🙂
I had written that I would do my best to update visibility charts for leading political candidates. So here they are: results for the Canadian race, and results for the US presidential. You may subscribe to daily updates by clicking on the xml tag at the bottom of these pages.
News visibility scores for Canadian candidates put Harper ahead of Dion at 41.8% vs 26.7%. Striking similarity with the latest Canadian Press Harris-Decima survey published this morning, which puts Harper/Dion at 41%/26%.
On the US scene, scores suggest that the Republican ticket is ahead, largely because of the Palin sensation who dominates the "social" vector (a composite of such things as backtype, twitter, delicious, digg, etc.) where McCain does not do well. Here also, results are matched by polls.
Still very "alpha", so expect several changes. In particular, results are currently nothing more than averages from several distinct sources. Still, no doubt we can learn from this.
Google "baseball", "basketball", "cricket", "football", "hockey" using advanced search preferences to obtain the number of results for pages located in Canada, India, the UK and the USA. Here’s what I got on September 3rd 2008.
Not so bad. Hockey is Canada’s favorite. Cricket is India’s. Football in the UK and the USA. Makes lot of sense. Visibility metrics appear to be useful.
But wait! Football doesn’t mean the same thing in North America and elsewhere. North Americans say "soccer" when everybody else say "football". So what do we have here? And then look at this. First page for cricket in Canada. Oh, yes, cricket also has several meanings (there is the sport, there is the insect, there is the Disney’s character, etc.). But these are validity issues. With hard and intelligent work it is conceivable that we could weed out irrelevant results and zero in on our target concepts. And with more hard work it is conceivable that we could identify the underlying relationship between "www visibility" and "popularity", whatever we mean with these definitions.
But wait some more! Is it really possible that Canadians prefer cricket to football (26 000 to 16 400)!? Even accounting for the Jiminy effect, something must have gone wrong somewhere. These figures may not be reliable. (Remember that reliable measures are a pre-requisite for valid inferences, and that reliability means that measures are consistently reproducible).
There are important reliability concerns :
1) the numbers are estimates. Fair enough. It would be next to impossible to return exact data as the web is in constant flux, as queries are handled by different servers, and so on. (I will illustrate later how much this can be a problem — in a nutshell, it is significant when big news stories erupt).
2) Repeating the exact same query several times in a row will often return different results. In general differences are small, but they can be surprisingly large (less than half of the previous estimate, obtained 2 minutes ago). And sometimes totally off the scale. Visibility metrics are not very robust.
3) Pages do not materialize. Try a Google search for "cricket" (no quotes) in the USA, updated within the past 24 hours. I got 67k results. But only three pages of results(!!??). Clicking to go to the last page returned results 201-211 (fair enough), but a revised estimate of the total results to match (bummer!). I am now told that that are 211 web pages about "cricket" from the USA updated within the past 24 hours, down from 67 700. This is not a trivial difference — the revised figure, no longer an estimate, is a paltry 0.3% of the initial figure. We can guess at various explanations, but the bottom line is that the results figure will dramatically inflate the visibility score of more visible concepts (whose score will not be truncated for consistency). To put it differently, if Google says that there are 67k pages yet they can show no more than 211, should I use the 67K figure? Or should I use estimates for some concepts and hard results for others, knowing that hard results may be a tiny fraction of estimates?
4) Different engines give wildly different results. AlltheWeb (now part of Yahoo!) returns no less than 1.4 million pages resulting from a search for "football" in Canada, updated during the past 24 hours. Compared to 16 400 pages from Google. A hundred-fold difference. Is it a difference caused by the engine? (i.e. does ATW systematically report much larger estimates than Google?), a sign that estimates are unreliable (i.e. for other queries, Google could return estimates that are much larger than ATW’s), or just bad luck? (i.e. for most queries, ATW and Google roughly agree).
What to make of this?
Even though the number of results returned by a query contains "some" information as we can see in the sports-by-country example, these numbers vary considerably and sometimes unexpectedly. In the absence of a reliability gauge, reaching conclusions based on visibility scores is a risky business.
In my next post I will explain how I am trying to build a reliable and robust measurement scale of… let’s call this "visibility" for the moment, even though the term is misleading.
We would say that visibility metrics are valid if they helped us better understand something. Like in "Based on visibility scores, we can see that ‘cricket’ is much more popular in India than in the USA." Establishing validity is multifaceted, complex.
The first step is usually to establish that the metric is reliable as there would be no point in trying to learn from random events. Like "Based on 100 flips of a coin, we can see that ‘cricket’ is much more popular in India than in the USA." A reliable measure is one that we can consistently reproduce. Establishing reliability is fairly straightforward. Let me illustrate.
Monday morning, you climb on a scale to weigh yourself. It reads 70 kilos. Are you surprised? Well, if you generally read 60kg, you’ll wonder if the scale is working properly. Down, up again on the scale. This time it reads 25kg. Inconsistent. Unreliable. The scale is not working properly.
Monday morning, your 5 year-old climbs on the scale. It reads 70kg. Are you surprised? Probably. Down, up again on the scale. Still 70kg. Consistent yet fishy — somehow, you know that it doesn’t make sense. So now you climb. It still reads 70kg. Consistent. Too consistent. Try with the cat in your arms. Still 70. Try with the kid in your arms, the cat in your kid’s arms… 70kg. It is just as if someone had painted 70kg. Consistent yet useless. Unreliable.
Monday morning, you climb on the scale. You read 60 kg. You go the the doctor’s office. Up on his scale. you read 64 kg. Are you surprised? Probably not. No down and up again routine. Clothes, breakfast, scales’ adjustments… Your 5 year-old climbs. 20kg. Looks OK.
So, the basic idea behind a reliable measure is that different instruments measuring various objects will give highly correlated results. Fairly easy to understand and implement.
In my next post I will briefly show that search engines’ scales are not as reliable as we might have thought.
This Fall I’ll post a series on the metric properties of visibility scores culled from various search queries. The basic idea is utterly simple — you Google "Tennis" and you find out that there are 209 million pages about this sport and then you search for boulingrin to find out that that are 50 thousand pages about this sport — what conclusion can you draw? Let’s make it more interesting. Compare "football, baseball, basketball and hockey" searching for pages located in the US; in Canada; in the UK. Might the *results* tell us something about the relative popularity of these sports? (maybe add "cricket" and "India" to the list…)
As more and more data becomes available through the web, there is an increasing awareness that the web may provide insights into how the world turns. More specifically, on how elections results will turn out. More generally, on how concepts, including brands, fare in society.
The "field of webometrics" — ugly name derived from better known applications such as econometrics, psychometrics or bibliometrics — is gaining momentum. Just to give you an idea of what this is about, you may want to look at Jack Yan’s page where he uses Google returns to assess politicians’ profile. Or at Howsociable where 20-something "metrics" are used to measure brands’ visibility in various corners of the web. Or at TorrentFreak where search data is correlated with actual product usage.
Vsibility data were shown to be closely related to the political fortunes of the French presidential candidates. Several consultants now provide various forms of tracking. But be careful because the reliability and validity of these metrics have not been carefully examined, if at all. Consider this post, where Elliott Back wrote that Ron Paul had a shot at becoming the next President of the US, based on the not-so-easily-gamed Google trend’s data. Indeed, Ron Paul was way ahead of the Republican pack on most sub-indexes of web visibility just prior to the Iowa caucuses, at the beginning of 2008, yet collapsed in a matter of weeks.
Just to give an idea of how a "reasonable" visibility indicator may look like, below are figures showing the visibility shares of Barack Obama (US presidential candidate for the Democrats), and Stephen Harper (Canada’s Prime Minister and leader of the Conservatives). Obama’s share is calculated within the Obama-McCain dyad. Harper’s share is calculated within the four leading contestants (Harper, Dion, Layton and Duceppe).
The figures are unreliable averages of returns from various search engines, yet they are not uninteresting. I’ll update the data on an irregular basis, ideally once a week.
What next? First, a post on the concepts of reliability and validity. Then, for each subscale, I’ll look at the basic components, the apparent reliability of the metric, and the design of a robust estimator of the "true" score. I woud also like to share several observations on peculiarities of search engines.
Needless to say, I would appreciate any comment/suggestion, either on this blog or privately.
The noise is almost deafening — recession is all but upon us. Look at the trends.google.com chart below. The "buzz" (number of search queries) is way up. The first jitter occurred about a year ago. The news headline of the day said that according to economists, a recession was unlikely. Click on the chart to access the (most recent) chart and mull over the morphing newscape.
And then, consider this: actual data on the US GDP, available here from the Bureau of Economic Analysis. For convenience I have computed the average annual GDP growth rate in constant USD (2000). (warning : the time scales of the two charts are different. The BEA data covers the 1947-2007 time period)
Is the buzz picking up early signs of a weakening economy? (such as the credit crunch). Is the buzz feeding on itself to ultimately yield a self-fulfilling prophecy? (by affecting consumer/investor confidence). Could the buzz be just that — buzz?
The signal emanating from the digital sphere can be closely related to the "real" public opinion. Based on data on the French presidential race of 2007, a visibility index designed by Swammer correlates reasonably well with the results of opinion polls and with the actual result of the electoral race. (read further down in the www series category to find relevant posts).
The situation might be different elsewhere. While there is considerable buzz surrounding Ron Paul, a republican candidate, his bid fails to translate into significant numbers if we consider tracking polls.
Visibility data as captured by Swammer places Paul a distant fourth, with roughly half of the leader’s score (McCain). If ranking matches tracking polls, the distance if much less — tracking polls give a 5-to-1 lead to McCain compared to a 2-1 lead in visibility. More striking, perhaps, is Google trends, which shows Ron Paul ahead of McCain in the number of search queries.
Understanding when/how/why Internet-initiated buzz becomes commonplace opinions may have to do with the breadth of the buzz, more than it relates to its intensity.