Title: Google Scholar Beta
Tested: November 18-27, 2004
Google Scholar has enormous gaps in its coverage of publishers' archives, and implicitly in the direct links to the full-text documents therein. The citedness scores of documents displayed in the results lists have great potential for choosing the most promising articles and books on a subject, but they often are inflated. The prominent display of the citedness scores could help the scholars and practitioners whose libraries don't have access to the best citation-based systems, such as Web of Science and Scopus, or to the smartest implementations of citation-enhanced abstracting indexing databases, like some on CSA and EBSCO. Google should take a page from the best open-access services and repositories, such as CiteBase, Research Index and RePEc/LogEc, which handle citing and cited references and citedness scores much better than Google Scholar.
Google's crawlers, which many scholarly publishers and preprint servers let in to their archives for this project, picked up information for many redundant and irrelevant pages and ignored a few million full-text scholarly papers and/or their citation/abstract records.
With the exception of the authors' name field, Google treated the items in the huge archives as any of the zillions of unstructured pages on the Web. Google Scholar needs much refinement in collecting, filtering, processing and presenting this valuable data.
In the universal, ritualistic adulation, it was no surprise that Google's latest service received publicity that was as wide as it was shallow. The blogorrhea and avalanche of e-mail was as if a free, magical cure for cancer had been announced by the National Institutes of Health. I like and use Google a lot, but not with the "nothing-but-Google" zealotry of its fans.
Google Scholar is a follow-up to the CrossRef Search Pilot project that was launched in April with the help of CrossRef and the Digital Object Identifier (DOI) registration agency that handles scholarly and professional publishers. CrossRef was the matchmaker between Google and the nine original participating publishers. My review of the first version of that project praised the agency for its work, and criticized Google for the careless implementation. At the annual conference of the Society for Scholarly Publishing in mid-2004, I moderated a session about "Searching Proprietary Content" that was graced by learned panelists, including systems developers from Google and Elsevier. I presented some of my disheartening findings about the massive omissions of documents from the nine publishers' archive in Google CrossRef Search. My testing of Google Scholar eerily reminded me of the same symptoms.
Amid the many myths, one is that Google could penetrate the invisible Web. It couldn't and it didn't. The publishers who cooperated in the Google Scholar project opened the doors of their document stores (normally invisible to Web-wide search engines) to allow Google's special crawlers to collect data and to show some of it free to anyone. This would then steer users to their libraries' subscription-based digital journal archives.
Undoubtedly, Google significantly enlarged the scope of Google Scholar by crawling and gathering data from the sites of many additional publishers and/or their digital facilitators, as well as from open access abstracting/indexing databases and from the largest archives of preprint and reprint servers.
Google Scholar offers free access for anyone to the bibliographic records and often to the abstracts of millions of articles. It may also lead users to full-text documents that can be displayed (if they qualify for free access) or (if they don't) to a document-delivery company.
Elsevier has been doing this with the Scirus service for years (although on a smaller scale), offering far better search and results display options. Yes, I did criticize Scirus at its launch for including in its database zillions of not merely worthless, but sometimes inane and vulgar Web pages (mostly created by undergraduate students with an .edu account). However, some time ago this subset was significantly reduced. By the way, Google Scholar also includes a large number of pages that are not scholarly by any stretch of the imagination.
Content is the most obscure part of Google Scholar. Apart from the generic statement on the About page that states that "Google Scholar enables you to search specifically for scholarly literature, including peer-reviewed papers, theses, books, preprints, abstracts and technical reports from all broad areas of research. Use Google Scholar to find articles from a wide variety of academic publishers, professional societies, preprint repositories and universities, as well as scholarly articles available across the web" there is no specific information about the publisher archives or the (p)reprint servers covered, nor about the type of documents processed (such as major articles versus all the content, including reviews, letter to the editors) or the time span covered. Exploring the dimensions of the content base of this service is as difficult as deciphering the real meaning and implications of the credit card agreements penned by the lawyers in the banking industry, so consider this a beta review.
Just because a service is free doesn't mean that the producer is not expected to disclose substantial information about the content. Scirus, HighWire Press, Research Index and RePEc show the best examples of the professional attitude of enlightening users about their free information services. One implementation of the open access RePEc archive goes the farthest by providing substantial and very informative content details. The content disclosure of Google Scholar is not at all informative.
Furthermore, the Google Scholar's FAQ page does not address the most substantial content issues. The questions included seem unlikely to be the really frequent type. They sound more like the scripted questions in infomercials that let the inventor impress the carefully selected audience with the invention's capability to meet all the needs that the average customer will never have.
Scope and Size of the Database
Sample searches may shed light on the size of the content — sometimes. You will find hits from the archives of ACM, Blackwell, the Institute of Physics, the Nature Publishing Group, Wiley Interscience, Springer, IEEE and many others. But there is no list of publishers; preprint and reprint servers; or open access abstracting/indexing databases, like the largest e-print collection of the NASA Astrophysics Data System (ADS), the outstanding digital preprint and reprint collection of the RePEc repository or PubMed, among others.
Breadth of Archives' Coverage
More importantly, users would not have the faintest idea that only a small subset of the articles in many of these digital archives are known to Google Scholar. This is particularly painful in such cases where open-access, full-text scholarly articles are ignored by Google Scholar. The RePEc archive, for example, has 292,416 items, of which 196,025 are full-text. Google Scholar has information about and links to merely 43,800 items.
To get some sense of the breadth of coverage, the journal and the source base of Google Scholar, a somewhat experienced searcher may make test searches to find out if a given publisher's archive is covered and to what extent, but the process is not exactly intuitive. Scholars may be good in their subject territories, but not necessarily in the syntax of Google's advanced search.
Even if they know that the search can be restricted to a domain with the "site:" parameter (though it is not documented), would they know that the correct site name for, say, Blackwell is blackwell-synergy and that it must be followed by .com, as in "site:blackwell-synergy.com"? Would they really know that there must be no space before and after the colon? There is not even an advanced mode in Google Scholar, which could make the syntax somewhat more transparent.
After making some simple searches, users would see various domain names in the results list and could figure out that if they want articles from, say, one of the 753 Blackwell journals to which the library has full access in digital format, the subject query must be limited like this: "site:blackwell-synergy.com dengue fever hawaii". Not the most user-friendly solution, but the software gets more unintuitive at other tasks.
More unnervingly, my test searches by domain name clearly indicated that Google Scholar has gathered information for only a small fraction of the articles available on several publisher sites. For example, Blackwell claims that it has "437,451 records for articles published in 755 leading journals." Google Scholar finds 53,400 records when doing a domain search. In other words, nearly 90% of the records are not retrieved from Blackwell's archive through Google Scholar. This is not an extreme example, and may have serious consequences even if the record for some of those articles missed by Google Scholar may show up in its results list from other databases such as PubMed.
These records, however, offer only the descriptor-enhanced citation and/or abstract. They don't offer links to the subscription-based journal archives to which the user's library may subscribe. That's why the holes in the coverage of many scholarly journal archives by Google Scholar is not merely an academic exercise and issue for this reviewer, but something that is important to most of the scholars and their libraries. That's why I elaborate on the coverage issue, reporting about some additional test results here.
Highwire Press' superb search engine, which hosts many publishers' journals, returns 29,044 hits for a test search of the top-ranked Proceedings of the National Academy of Sciences. Google Scholar retrieves only 12,900 records for the domain restricted search.
One has to be careful with domain searches, as Google Scholar may show different domain names in the results list of topical searches, or domain names that yield no results as a search parameter. This is the case, for example, with the Wiley archive. Its link appears as doi.wiley.com in all the results, but in a domain restricted search, the string "site:doi.wiley.com" returns no results. It must be searched as "site:interscience.wiley.com" or "site:wiley.com". These two domain name searches, by the way, bring up slightly different results. Indeed, it is possible that not all of the documents are stored under the same domain name. I tested several variants that I saw on results lists, as well as ones that I guessed as possible variants.
The native search software in the archive of the Institute of Physics (for an admittedly quick and dirty test search) found 187,678 records for journal articles. Through Google Scholar's domain searching, the total number of records is 25,600 for "site:iop.org" and 24,400 for "site:www.iop.org". Sometimes the domain name with or without the www or other prefix makes no difference, as in the case of BioMed Central .
It is a no-brainer to sense that something is wrong when the query "site:ncbi.nlm.nih.gov" (the mouthful domain name for PubMed) brings up only 879,000 records and the same number of records when using "site:www.ncbi.nlm.nih.gov" domain. For a reality check, PubMed acknowledges that it has more than 15 million records.
And there are even larger gaps. Ingenta's native search engine reports having records for 17,343,034 articles, chapter, reports and other documents. Through Google Scholar, the total number of records was merely 128,000 for the query "site:ingenta.com" (the domain name that keeps coming up in the results lists). With due diligence I tried other domain name parameters, like ingentaconnect.com or catchword.com (acquired earlier by Ingenta), but Google Scholar did not find any records for these domains.
Casual users may not care too much about these problems, as long as they can find a few good records for scholarly articles from any journal of any academic publisher for their research papers. Real scholars, however, are concerned with finding as much relevant, and as little irrelevant or redundant, items as possible on a specific topic, and to not pay for something that their college, research institute or corporation already paid the journal publisher for. The combination of the total lack of information about source coverage and the shallowness of coverage can hit the serious users and/or their employees hard.
I have run several topical test queries limited to the appropriate domain across a number of archives using Google Scholar and the native search engines, searching separately in the full-text and title fields. As a follow-up on one of my earlier tests for Google CrossRef, I searched for the exact phrase "maximum fractional energy loss" in the full text.
The native search engine of the archive of the Institute of Physics (IoP) found 24 articles (one more than in my April 2004 test). Google Scholar returned only 15 hits. The item-by-item comparison did not indicate any pattern for the omission of records by Google Scholar. Current items from 2004 were missing, as well as items from 1985. The format of the full text — PDF versus PostScript — was not a reason for the omission either.
Other topical tests have shown similarly large differences across several archives for three test queries. These are not surprising in light of the disappointing result of the broad, domain-only searches.
The full-text search for the eponym Karman retrieved 430 records by the native engine and 271 through Google Scholar from the IoP archive. The ratio from Nature was 37-to-5. For the keyword "vortex," the ratio was similarly disheartening: for Annual Reviews it was 521-to-371; Blackwell was 372-to-215; IoP 1,333-to-839; and Nature Publishing Group 195-to-15. The search for the phrase "energy loss" showed similarly bad ratios for Google Scholar: in the archives of Annual Reviews, it was 700-to-521; Blackwell was 677-to-400; IoP was 7,899-to-3,730; and Nature Group 383-to-23.
The native search with Wiley's search engine consistently underperformed Google Scholar in the full-text searches, suggesting possible problems with the implementation of their native search engine, which offers a combined abstract/full-text index. These full-text searches yielded sets that were too large for item-by-item comparison.
However, the same searches limited to title field made it easy to quickly spot the glaring omissions in Google Scholar. I posted a new polysearch engine on the Web so anyone could run test searches in the full-text and the title fields using the native search engines and Google Scholar (with predefined domain restriction) for five major publishers' archives.
After typing in the query and selecting the archives, the search is run and the results are displayed side-by-side in separate window panes. In this example, three articles are retrieved by the native search engine, and only one by Google Scholar. A 32-year-old article is the common hit, but the two more current ones were not found by Google Scholar, which has the author name oddly misspelled as DW INMAN. Oddly, because it appears correctly as D WEINMAN in the archive.
Scrolling down the 12 matching hits for the search about "vorticity" illustrates that the native search engine retrieved six times as many hits as Google Scholar because it is smarter and lemmatizes the query word "vorticity" so as to also retrieve "vortex," "vortices" and "vortical." This still does not explain why Google Scholar did not retrieve the record for the paper on vorticity dynamics.
Lemmatization, stemming and automatic pluralization could explain some differences between the number of hits in some other results lists, but this does not much change the disheartening ratios mentioned above. Most of them are inexplicable (and unacceptable) omissions, such as the fourth item for the Karman query where Google Scholar also shows a weird change in the order of the title words, suggesting an article describing "how von Kármán flows swirling," when it is about the swirling flows in the noted scholar's vorticity theory. Once again, in the archive the title is correct. The second and third matches may have been omitted because of the correct accents in Kármán's name, which Google apparently could not handle.
The retrieval of some articles with the plural format of the search phrase "energy loss" explain can not alone explain why the ratio between the results by the native search engine and Google Scholar is 25-to-11 in the title-only test.
Large in a Bad Way
Searching by topical words alone would yield an impressively large number of hits from Google Scholar, as it seems to be a very large database — but it is large in a bad way. Here is a typical example of how inflated the hit counts of Google Scholar can be when it presents three entries (counting them as three hits) with 14 links for the article in Computer, a journal of IEEE.
In this case, the inflated hit count is partly due to crawling a variety of sites whose scholarly nature is not immediately apparent from the funky names of their mirror sites, such as crazyboy.com and nigilist.ru (the transliteration of the Russian word for nihilist). My learned colleagues may not exactly feel lucky being steered to some of these sites whose entries may be graced by prominent journal names in the results list of Google Scholar.
Then again, discovering such sites with possibly unauthorized copies of articles may have been an argument in persuading scholarly publishers to let Google's special crawlers into their archives. I am the greatest fan(atic) of self-archiving by authors, but for these and hundreds of thousands of other hits, those may not be cases of such self-archiving.
The third "hit" for the above query shows one of the many examples of Google's problem extracting the correct names of the authors. It misses authors and mistakes first names and initials for last names, even though on the page of the linked second site (which was the one working and sporting the PDF file in all its glory) they appear correctly.
The content of the results lists is rather enigmatic and badly needs an illustrated help file and some cleaning up.
The mass adulation for Google is search engine is largely due to its simple user interface and smart relevance ranking, which usually brings to the top some of the most relevant hits in a no-brainer format. Understandably, users often think, say and click "I'm feeling lucky." Google smartly indoctrinated, sloganized and "buttonized" this apothegm, just as AOL made grandparents happily hum the "You've Got Mail" ditty. Like its popular counterpart, searching Google Scholar is easy, finding the gems is difficult.
Content and Ranking of Hits
The display of the citedness score would definitely make me feel lucky, but those scores are often much inflated (more about that later). I doubt if most users feel lucky looking at Google Scholar's results list. I bet many feel discombobulated by the enhanced entries, specifically the labels preceding them.
Google has added new labels like CITATION, which identifies items extracted from the reference footnotes of other documents, bibliographies, curricula vitae, etc., but have no further information and therefore are not clickable. There is a link to launch a Web search using the standard Google search engine with a well-formulated query, which in turn retrieves pages that include the query term, but the user still may not get a link to the primary document.
The items with the PS label, identifying PostScript documents that are particularly popular for physics and computer science articles, may be unfamiliar to scholars in other fields. Therefore, they may be discouraged from clicking on such items as they would be required to download and install the PostScript plug-in. Users may also not understand why certain PDFs are offered for viewing in HTML format while others are not.
Few would understand why the no. 1 article appears with the same title 10 times in various formats scattered throughout the results list — showing up among other places as item no. 34, 48, 52, 54, 59, 64, 73, 89, 113, 117 and 119. It helps if they realize that this paper appeared in full and abbreviated versions in different sources. Scholars (who are not necessarily intimately familiar with information technology) may feel more confused than lucky and wonder how these records relate to the six others, four of which have cryptic hyperlinks as part of the no. 1 entry.
Clicking on the link to show a list of all six links in a separate window may not alleviate their confusion as it has a duplicate pair, which reduces the number to five. This is just a prelude to the really daunting task of understanding what the links mean, when and why they are selected, and where they take the user.
If they figured these out, then they may believe that they understand the ranking of the results as they see the decreasing citedness scores until they get to item no. 5, which was cited more than four times as often as item no. 4. They may guess that records that matched the query term in the title field are ranked ahead of the ones with higher citedness score, but this does not seem to hold true when looking at items no. 8 and no. 9.
If that's not enough, they may question why some items have a cached version while others don't. Then comes possibly the most discombobulating issue: the links listed in the records.
Links, Links, Links
The first time an eyebrow may really rise is when two links appear with the same name in the same entry — one hotlinked, the other not — such as in this entry from Blackwell. The first occurrence is not clickable because it is linked through the title field of the record. The second is hotlinked, but it takes you to the very same location as the title link within the archive. Although the names of the links suggest that you will be taken to the homepage of the publisher, they are just a shorthand. Right clicking the links, then selecting the Properties option will reveal the full URL.
Many scholarly users would be even more puzzled as to why ingenta.com is hotlinked (because it hosts Blackwell journals) next to blackwell-synergy.com which is not hotlinked (because clicking on the hotlinked title takes one to the publisher's site in this case).
Furthermore, why is there a link to ncbi.nlm.nih.gov for the same record? Because MEDLINE also has a record for the article. It is only an abstracting/indexing record, but with MeSH terms as a bonus. But why does ingenta.com appears twice with both being hotlinked? Why does the second ingenta link take the user to the record of an unrelated article? Because the seemingly unrelated article does have a relationship to the main article about dengue fever.
Alas, this relationship is a very indirect one: the article to which the second link takes the user was published in the same issue of the journal Heredity. "So what?" you may ask. Well, the table of contents page on Ingenta includes both of them. That's it. And this is only the tip of the iceberg as the results screen shows more cryptic notations.
The CITATION Hits
The biggest confusion overall may be caused by listing the primary documents or their indexing/abstracting records intertwined with records for other documents that list the primary document in their references.
Results retrieved for my search on the problems of intractability and computers illustrates the possible extent of this problem and the inflated nature of the hit counts and citedness scores. The search yielded 8,130 hits. I looked at the first 100 "hits" and 92% of them were about the book "Computers and Intractability" by Garey and Johnson, with as many errors and inconsistencies in the title, subtitle, author names, publishers names, locations and years as one can imagine. Only eight of the first 100 hits were for items other than this book, scattered around the results list as items 14, 27, 36, 37, 55, 89, 99 and 100. Ninety-one of the "hits" were labeled as CITATIONS, meaning that the "hit" was extracted from references in other records in one of the other archives crawled by Google Scholar.
It is not that so many references were given incorrectly in the source documents. Most of them came from the cited reference list of the ACM Guide. It is a lovely archive, but it has a prominent note in red type in every record that "OCR errors may be found in this Reference List extracted from the full text article." Well, OCR errors are found in most reference lists as the technology is not yet perfect.
The problem is that the crawlers of Google Scholar take and deliver the references as they are, then Google Scholar seems to create a record for each of them. Consequently, it counted and listed each that matched my two-word query. I don't know how many hits on the entire results list were for variants of this book, but I do know that no scholar would scroll down the 8,130-item results list of a topical search in the hope of finding the full documents, or at least an abstract of other items relevant to this topic.
Google's approach is like mixing in a gigantic bowl the appetizer, soup, entree, salad, dessert and coffee. It is not exactly a mouth-watering potpourri, even though there are many delicious ingredients in the bowl.
All the other citation-enhanced systems (including the best free ones like CiteBase and Reference Index) handle these two hit categories separately; try to consolidate the format differences; filter the "citing" sources to avoid course listings and other materials of tertiary importance for a topical search, let alone for citation counting; and offer clearly explained options to look up cited and citing references.
From the results returned as CITATIONS, you may launch a Web search in the generic Google service, or get to the cluster of the citing records for each in Google Scholar. The first hit on the original results list had 8,397 "cited by" sources, the second had 1,736. If you add up the citedness scores for each variant for this book it would be well over 10,000. I don't know how many of these are double and triple listed and counted in calculating the citedness scores due to postings on mirror sites, and I wonder if anyone would want to find out.
I do know that books are more cited than other items in many disciplines. I do know that this is one of the most-cited books in computer science. I do know that a score above 5,000 unique citing references would make a computer science book, article or conference paper an all-time citation classic superstar (to borrow Eugene Garfield's terminology). I do know that both the hit counts for searches without domain restriction and the citedness scores are often inflated. Paradoxically, I also know that millions of citations from scholarly journals and books are not counted, let alone listed, such as the ones from most of the 1,700 Elsevier publications that are not covered at all by Google Scholar, let alone analyzed for citations.
Google, Inc. has the intellectual and financial resources (and the largest group of cheerleaders) to create a superb resource discovery tool of scholarly publications. It needs to:
I promise that I will write a hagiographic review about Google Scholar when it is done, and done well.