Title List Changes

New Titles

Outside U.S. and Canada

Customer Center

Product Center

Free Resources

Péter's Digital Reference Shelf - June

Google Scholar (Redux)

Publisher: Google, Inc.
URL: http://scholar.google.com
Cost: Free
Tested: Continuously

The Context

There have been only two multidisciplinary, citation-based or citation-enhanced mega-databases on the market: the classic Web of Science (WoS) from ISI and Scopus from Elsevier, which was launched officially in early Fall 2004. Both of them are expensive but they provide unique options for finding information about 35 million (WoS) and 25 million (Scopus) source documents along with hundreds of millions of cited documents (many of which are themselves source documents in WoS and Scopus).

Such mega-projects require mega-million dollar investments in order to subscribe to and process the thousands of expensive journals for traditional abstracting and indexing database purposes. Add to this the cost of processing close to 30 million citations every year and we're talking about big money for even a single year. The largest version of WoS (the Century Database) goes back more than 100 years with cited references. Scopus has cited references for articles published in the past decade.

There are some remarkable, although much smaller discipline-oriented citation-based/citation-enhanced databases and services. PsycINFO has nearly 350,000 records enhanced with about 14 million cited references. Two of its many hosts stand out for their powerful and comprehensive handling of cited references: ISI through its Web of Knowledge platform and Illumina of CSA, which in April enhanced some of its own social science databases with 1.6 million cited references.

Then there are the free and awesome CiteSeer and eBizSearch databases, and the various impressive information services based on the RePEC metadata collection and the OpenCite project. All these use autonomous citation indexing, proving that it can be done far better than in Google Scholar. There is the masterfully implemented super-archive of HighWire Press (HWP) with more than 800,000 open access scholarly articles. (See the HWP companion review this month.)

As I was re-testing HighWire Press and Google, I kept thinking about where we would be if publishers had provided HWP with the same unfettered access to their digital archives as they did for Google. Publishers should recognize that Google is not the only game in town and that there are others who are willing to be as smart and more commitment-oriented.

Google Scholar has to be looked at with this background: Even in its disappointing incarnation it is an asset for those scholars whose university or research institute cannot afford WoS or Scopus. Those who just need a few good papers and Web sites might as well be satisfied with the regular Google or Yahoo search engines. Those who need a comprehensive set of papers that includes the most respected (and hence most-cited) articles, books and conference papers are advised to treat the hits — and citedness scores — in Google Scholar with much reservation.

The "Googlemania" fueled by the enormous media publicity and laymen's ecstasy rubs off on Google Scholar and makes otherwise learned people disregard reality. It is a telling example that Jan Velterop, who deserves much credit for directing Biomed Central's impressive open access journal project, was quoted by Information World Review as calling Google Scholar "... a threat to Scopus, it is better and it is free." Is it really better? I don't think so. And luckily, there are other sober voices.

Gary Price gave a summary of the new developments, as well as the relentless and unappealing secrecy about the sources covered by Google Scholar. It enhanced his initial review where he and Shirl Kennedy brought up several concerns about the scholarly nature of some of the source genres discovered and the flimsy implementation of some software options, like date searching. Roy Tenant is among the few who does not jump for joy at a new announcement from Google. His critique about Google Print (Google Out Of Print) also suggests what I feel: Google is going to become a jack of all trades and a master of none by keeping too many irons in the fire. In the meantime, I was laboring to find improvements in the areas that I criticized when Google Scholar debuted half a year ago. Most of these were not addressed with the software update which add some badly needed advanced search options. Improvements were few and far between.

The Software

I have decided to discuss the software issues before the content in order to explain how difficult it is to make even elementary searches efficiently (let alone special searches for database evaluation) for lack of adequate software tools for browsing, searching and saving results in a structured output formats.

Browsing

There are no browsing options at all in Google Scholar. These are essential to look up variants of author and journal names. Even in high-quality databases there are many instances of inaccurate and correct but inconsistent spellings of author and journal names. The best systems offer users a chance to look up at least simple variations by browsing field-specific indexes or a special database, such as Dialog's Journal Name Finder.

With hundreds of millions of cited references, WoS makes not only the author and journal name index browsable, but also the indexes of the cited author and cited work (book title, journal name, etc.). Google Scholar has not created any browsable indexes from the publishers' archives in spite of their metadata-rich records that identify the data elements. Beyond a reminder that journal names may be variably spelled, the searcher is left completely in the dark as to how to enter a journal name, such as the many variants of the Proceedings of the National Academy of Sciences (PNAS).

Searching

The advanced search page was added earlier this year, but in many regards it is like a movie prop — for show only. Beyond the standard Google advanced search options in Google Scholar, seemingly, you may also restrict the search to publication name, author name and publication year or year range. But don't applaud before you read the caveats about these options.

Searching for journal name

You may search one variant name at a time in Google Scholar's journal name cell, but which one do you choose: PNAS, Proc Nat Acad Sci, Proc Natl Acad Sci? To put things into perspective, in the PASCAL database of the French National Scientific Research Center, there are 64 differently abbreviated variants for PNAS. True, PASCAL has had the sloppiest entries among the active databases, but it is unnerving to think about all of the possible name variations in Google Scholar.

At least your worrying is over as soon as you realize there is nothing for the software to offer. You can't even truncate the words in the query, let alone use proximity and positional operators which would ease the pain of guessing.

At the opposite extreme, the simplest journal names are even more of a headache in Google Scholar with its lack of browsable indexes. It is hopeless to make a pure search for articles published in, say, Science magazine on a given topic. The result list is full of journals whose title includes the word science along with other terms such as Information Science & Research or Cognitive Science.

This is the same problem you will find when querying for a topical term and a single-word journal name like Blood, Brain, Gut, Cancer or Cell. You will be flooded with articles from journals whose name includes your word along with other words. You may exclude some, but this option is very limited. Even a more specific two-word journal name, Current Science for example, is hopeless to search as information about articles hog the first few pages of the result list published in journals irrelevant for the query, such as Current Protein and Peptide Science, Current Trends in Psychological Science and many other journals. It adds insult to injury that Google Scholar mistakes Current Science, Inc., the publisher of Current Reports series, for the journal Current Science a few thousand times.

Wouldn't it help to use "Current Science" as an exact phrase for a journal name? It certainly would, but Google Scholar does not allow it for journal names. It removes the quotation marks around the journal name and searches for any journal whose name includes the words.

Searching for author name

In the author name field, exact searching is theoretically possible by enclosing the initial(s) and the last name of the author. Searching for MJ Bates, especially when combined with a distinguishing subject word such as online, is pretty efficient in telling apart the information scientist Marcia J. Bates (although not perfectly) from the physicist MJ Bates. The problem arises when searching for authors who have only one initial, such as "E Garfield". The search will retrieve many articles from RE Garfield, and even EF Garfield in spite of the exact phrase search syntax. Adding a subject word would make the result list better, but there are times when you want to search by the author's name alone and then to pick items from the result list, or to limit the search to a given range of publication years.

The result of searches limited to a publication year or year range will make you so discombobulated that before dealing with it you may wish to take a look at a sample of the most productive authors in Google Scholar such as I. Introduction (13,500), A. Professor (3,780), I. Part (2,360) or P. Emeritus (582). As a symbol of the still existing glass ceiling in academia, P.Emerita checks in with 262 hits. The most colorful author name belonged to a certain Mr. A Agoraphobia. Although he had only one hit, he seems to be part of a dynasty, at least based on his qualifier: A. Agoraphobia IV. These examples suggest that Google needs to work on the algorithm for author name extraction a bit more.

Searching by publication year

Quite often the quickest way to reduce the number of hits is to restrict a topical search to a year or a current time span. It can be a bit dangerous because records that omitted the publication year will be automatically excluded from the result, drastically reducing the number of hits. Can limiting a search to a year (or year range) increase the result? Nothing is impossible for Google Scholar — it offers such an option, but you may not want to use it.

The search for protein limited to documents published between 1970-2005 yields 942,000 hits. Limiting the search to 1971-2005 should return fewer hits, shouldn't it? In old-fashioned systems it would, but in Google Scholar the result increases to 993,000. And what happens if the same search is limited to 1972-2005? It surpasses the 1 million mark, returning 1,080,000 records. Yes, the searches were done in one fell swoop. I took snapshots.

Next, Google Scholar interprets another traditional search option in an unorthodox fashion — and not usually the one taken for granted, such as Boolean OR. If "protein" retrieves 7,390,000 hits and the plural format "proteins" retrieves 3,790,000 hits, what will be the minimum number of hits when using "protein OR proteins" (directly or indirectly through the search template) ? Forget what you learned in kindergarten, the answer from Google Scholar is 1,280,000. All nice round numbers, but librarians will have to explain why they mislead LIS 100 students with their atavistic concepts about Boolean logic. I saw this liberal handling of the Boolean OR in early April in the regular Google.

Output options

There is a single output format; the searcher has no options for its content or layout. The results cannot be sorted by any user criteria. The only sort criteria used to be by citedness score, displaying the result in decreasing citedness order. This may be good most of the time, but not all of the time. The searcher may need to sort by author. Forget about it, as author names, as many other words acting as author initials and last names, are in the format of first initial then last name (such as ME Koenig). Perfect for sorting a bibliography by first initial. And don't even think about sorting by date as Google Scholar does not know how to determine the correct publication year. There are also no options to save (download) or e-mail the short results list records of scholarly papers in a structured format, such as XML, RIS or CSV — all common in professional systems. The precious metadata tags which identify the data elements that make up the records and are characteristic of most of the scholarly archives are lost.

The problem may be that Google developers have been working smart and hard to make heavy duty software excavators that dig up useful data from unstructured masses of data from zillions of Web pages. They were great for discovering the Web landfill, but not for digging scholarly archives — just as heavy duty excavators are inappropriate for archeological digging of Mayan tombs instead of tiny pick axes, chisels, shovels, trowels and brushes to extract, clean, bag and label the finds. The result of using inappropriate technology is clearly visible on the result list. The first one correctly indicates that it is a record for a paper published in PNAS in 2001. It is followed by a snippet that is rather confusing for a non-librarian with information that the user doesn't need to know at this stage (pagination, volume, issue) or does not need to see (the DOI — which is ugly, but very useful for linking to the source document).

Unfortunately, the second item is the far more typical entry type. Where was this article published? In PNAS? Then why is the abbreviated name of the Journal of Biochemistry, its place of publication and date of publication displayed as opposed to item #1 before it? How can an article published in 2004 already be cited 13,791 times? Why is the title repeated instead of the abstract? What are those links following the citedness score? Well, this is the Voice of Esau, Hand of Jacob situation. J Biochem is the name of the journal that has one of the hundreds of articles in the stable of PNAS that cites the DNA sequencing article published 28 years ago. Why is that one picked for the snippet? I have no idea. It is about the 40th in a very long list of citing articles. It is the PNAS article that has been cited so often and the link takes you to that article. The other links take you to PubMed Central, PubMed and the Astrophysics Data System at Harvard. I am afraid this is not immediately clear to every user. It may be a good quiz for the online (asynchronous distance education) Ph. D. program which is a fiercely discussed issue on the JESSE listserv these days.

I wonder which of the above software options prompted Mr. Velterop to say that Google Scholar is better than Scopus, which displays this hit in a perfect format. Well, it may have been the content.

The Content

Google Scholar's content was finally updated in mid-April after a six-month dormancy. I wish I could link you to a site where Google lists at least its publisher partners and their journals, as HighWire Press, WoS, Scopus and most professional services do, whether their service is free or subscription-based. There is no quantitative information about the sources in Google Scholar. The simple paragraph in the About file is pretty non-committal, and so is the statement from one of the developers of Google Scholar in the above quoted IRW article: "We do cover most major publishers from broad areas of research."

It reminds me of the informativeness of the communiqués dispensed by the Korean Central News Agency about the nutritional condition in the country. The test searches done for my earlier review, as well as this one suggest that Google has a reason to be so tight-lipped.

Whether my tests used author names, journal names, subject terms or title word combinations, the Google search results compared poorly with results received from the native search engines of the publishers' archives, from WoS, Scopus and some traditional indexing/abstracting services, such as PsycINFO as implemented by CSA and Web of Knowledge.

It will be no surprise that I can't provide my usual overview about the content, composition and dimensions of the database. No one can with a software like this, and with the secrecy of Google. Suffice it to say that in Google Scholar, information about scholarly and not-so scholarly journals, books, PowerPoint presentations, course materials and many other source "documents" are mixed. The professionally designed systems, like Scirus, Scopus, WoS and CSA handle the two sets separately for several reasons.

Google Scholar is muddying the water and makes its hit numbers and citedness scores even less reliable by this mixing. It is as if the bartender put not only tonic in your gin, but also the peanuts. You may ask your bartender to do it again without mixing, but in Google Scholar this is the mandatory way.

I have to admit that after hundreds of hours using Google Scholar I have no idea of the size of the database; how far it goes back; how many journal articles, book chapters, conference papers, conference proceedings, dissertations, manuscripts, preprints, postprints, reprints, e-prints, appendixes and you-name-its are horded in the database. I do know, however, that even after the update, millions of scholarly articles were left behind by the crawlers that were sent in to the precious archives of many scholarly publishers. You can do your own sampling by using my special polysearch engine for some publishers' archives.

My usual techniques to get some details about databases don't work here because of the serious limitations of the software and of its disregard for common practices and conventions. It is quite easy to tell the number of items in a database just by doing a sweeping search by the widest feasible publication year range. Of course, if not all the records have publication years then the number is not accurate. But in professional systems this can be quickly corroborated by complementary searches.

In Google Scholar the search for the year range 1455-2005 (in case Google already scanned the original Gutenberg Bible) yields 1,388,000 hits. Considering that the term "protein" returned 7,390,000 this, this number is as unreliable as any other, even if in this search only the source items were counted. By now you will not be surprised, but still feel confused, to see that the number of hits keeps increasing as you narrow the year range: there are 1,430,000 records for 1905-2005. The searches were done consecutively in the same minute, so having just added new records is not a good explanation.

The only test that can give some insight is to do title-only known-item searches and then compare the number of records and their citedness with those returned from Web of Science and Scopus. Be forewarned, Google Scholar inflates its hit counts by counting every misspelled variants as a hit as shown in the screenshot for its top-cited entry with a citedness score of 31,955 about the landmark article by Laemmli, et al. There are additional entries with the name spelled as Laemli, Leamli and the title spelled also erroneously and/or incompletely. It is quite telling about the sloppiness of us authors that 250-300 citations use the same misspelled name and title variants. You may be impressed by the high citedness score in Google Scholar (even without adding up the citedness scores of all the variants). Again, don't applaud yet.

Scopus brings up one record, but before you dismiss it look at its citedness score: 61,833. Even if you would scrape together all of the erroneous and inconsistent variants in Google Scholar, the difference is still 50-60%. I checked the top 100 records in Google Scholar with the highest citedness scores and compared them with the citedness scores in Scopus. The pattern was very similar throughout.

Web of Science had the highest citedness scores as it goes back the furthest in including cited references. Scopus makes it clear that it includes references from 1996 onward (and delivers better with significant number of records enhanced with cited references from 1995). Google also takes the 5th Amendment in this regard. The pattern seen in the PNAS test is true also for social science articles, and for those interdisciplinary articles, such as the most cited psychiatry paper in Google Scholar with a score of 5,829. Scopus reports 12,197 citations received by this article. I have always found its number trustworthy on sampling. Such tests provide a feel for the breadth of coverage and the time span of citation databases, and hopefully make you think hard.

I don't have the delusion that most users will care about any of the above. But librarians, other information professionals and scholars should be aware of the limitations of Google Scholar and not join the Google-for-president crowd by dispensing careless endorsements and hollow sound-bites for the press, faculty, students and staff.

There are certainly many journals of many publishers covered to keep casual users, high-school and undergrad students, TV talking heads and shallow journalists happy, but for scholarly research the breadth of coverage is not sufficient, the implementation is sloppy and the software options are inferior.

I dread the moment when ill-informed administrators at mediocre universities pushing asynchronous distance education and other ideas to cut costs will quote Mr. Velterop and others when canceling Scopus, WoS and other citation-enhanced databases because they believe that Google Scholar is better. Then they will cut library jobs and eliminate those costly libraries. And then ... you figure out the rest.

Careers at Cengage   |   Contact Cengage Cengage Learning     —     Gale   |   Course Technology   |   Delmar   |   Academic   |   Nelson
Privacy Statement   |   Terms of Use   |   Copyright Notice