Title List Changes

New Titles

Outside U.S. and Canada

Customer Center

Product Center

Free Resources

Péter's Digital Reference Shelf
June 2004

product icon
Title: CrossRef Search Pilot
Publisher: Google
URL: http://www.google.com/cobrand?
restrict=crossref&cof=AWPID%3Abbd6d01e9a530922

Cost: Free (bibliographic records and abstracts)
Tested: May 20-25, 2004

The Context

I am fully aware of the fact that writing something less than a hagiographic commentary about Google is not merely a felony, but a despicable act in the eyes of many. I never received as much negative reaction to my opinion as when I wrote a column [pdf] about the hype of Google's PageRank soon after its launch five years ago. Yet here I come again with something less than adulatory.

Make no mistake, I like Google. I have the Google toolbar in my browser and I use it 25 to 30 times a day. I appreciate how smart and nibble it is with 3.5 billion Web pages of mostly unstructured text with no metadata, no tagged and marked fields to identify author, publication date, subject and the likes. But I am not impressed by its silver-spoon-kid attitude and its modest ability in handling a collection of about 2.5 million scholarly articles that are endowed with consistently used rich metadata. These articles were presented to Google (whose spider could not crawl these pages) on a silver platter by nine well-known publishers for the CrossRef Search Pilot project. CrossRef, a collaborative reference linking service, coordinated the project and deserves kudos for facilitating the smooth and empowering digital linking of scholarly articles from publishers who have not been known to be exactly friendly.

A Resource for All Seasons?

It is well-known that the majority of Web users turn first (and often only) to Google to find information. This February we learned something else from the keynote address of John Regazzi at the annual conference of the National Federation of Abstracting and Information Services (NFAIS). According to his survey, in which librarians and scientists were asked to name the top scientific and medical search resources that they use or are aware of, "scientists named Google, Yahoo! and PubMed" as the top three resources. That's quite a surprise, knowing that scientists of the developed world, especially in the hard sciences, are very well-served by gigantic full-text searchable, interlinked digital journal archives of scholarly publications. (I wish there had been more information about the population surveyed.)

If that isn't bad enough, one slide from Regazzi's presentation shows that Google ranks first with 39%, Yahoo! is second with 10% and PubMed is third with 9% in terms of popularity among the scientists. Three of the largest and most sophisticated journal archives and citation indexing databases checked in with a vapid 2% each, even though they are also free for the scientists if they work for a research institute or university library that subscribes to those archives.

Luckily, according to the survey, the librarians surveyed are not that blinded by Google and Yahoo!, knowing very well that the expensive, but very valuable, scholarly databases are simply invisible to Google and Yahoo! (and hence to the uninformed end-users). Many libraries do a good job of letting their patrons know about the importance of using professional and scholarly database supermarkets and publishers' digital archives, but some just keep licensing databases and archives without adding user-friendly resource discovery and access tools to help their patrons find the most appropriate licensed resources.

Content & Software

In light of Google-mania, it was an excellent idea for CrossRef to persuade some of its largest members to deliver a couple of million articles to Google, making them searchable and making the metadata (bibliographic citation and abstract) freely available. This seemed to be a case of being good for the goose and good for the gander. Those users whose libraries subscribe to the digital versions of the journals would also be able to display, print and save the articles; others would need to pay a fee for the source document (a fee that is always less than those charged by document delivery services). Any user could navigate further into the publishers' archives and benefit from their special research options, such as the superb citation tracing services offered by Annual Reviews, Inc. and the Institute of Physics, to list two of the nine participants in the project. (Not all of the publishers offer free citation tracing.)

This is a pilot project because it guides you to the journal articles and because it is guaranteed to be available only until the end of 2004, although I very much hope that it will be kept going. I am sure some abstracting/indexing services that have been providing the same old service for more than 30 years will not share my hope.

Spoon-Feeding Google

The idea was excellent because users who turn to Google for any information could use that beloved interface and be gently guided to many of the best rivers that have not only potable water, but a special mix from nine tributaries with high nutritional value. This should increase the chances that they will drink the good water.

While it is good that the full text of the millions of scholarly articles can be searched in one fell swoop, instead of hopping from publisher site to publisher site, other aspects of the implementation by Google are disappointing and muddy the water. In my tests, I found many times that the native search engines at publishers' sites had far more articles that correctly matched the query than Google, and very often Google could not find the articles even though they were present in the special collection.

For example, the Blackwell site found 660 matching records for the query about "energy loss," while Google CrossRef found only 242. The native search engine on the site of Annual Reviews found 13 correct hits for the exact phrase query "fractional energy loss," while Google found only one and listed it twice. The differences were very substantial in most of my tests. It was frustrating that the Google results kept going up and down during the week that I tested the search engine, although not more the 10%-15%.

The nine publishers who came to Google with gifts in hand include the American Physical Society (APS), Annual Reviews, Inc. (AR), Association for Computing Machinery (ACM), Blackwell Publishing, the Institute of Physics (IoP), the International Union of Crystallography (IUCr), Nature Publishing Group, Oxford University Press (OUP) and John Wiley & Sons (Wiley).

Their gifts are precious in more ways than one. Obviously the content is precious, but the fact that spiders of Web-wide search engines cannot normally access them makes this special collection even more precious. Their contents are stored in databases and are displayed in HTML format only when triggered to do so by a query. Spiders (including Google's spiders) don't ask questions, they just crawl Web sites and gather all the HTML, PDF, Excel, Word and Powerpoint files that they find laying around on their designated route. To solve this problem, the spiders must be fed the pages of the scholarly articles, so that they can collect them and deliver them to the indexing components of the search engine.

Google accepted the gift, loaded and indexed the pages, and apparently declared the case closed. It would have been essential (and it still is if it's not too late) to provide at least a checkbox on Google's search page to limit the search to this special collection, as well as a link to a page with some background information. After all, the purpose of feeding Google this information (readily available with free bibliographic citations and abstracts, and with far superior software on the publishers' site) was exactly to make it available through Google as well.

Publishers provide on their sites a simple query form with a logo heralding that it is powered by Google, as are 2,240,000 other sites. CrossRef has spread the word about the project through press releases and on its own well-designed site since the end of April.

The content of more than 1,100 journals are indexed. This number by itself may not be informative without saying that many of these journals are top-ranked in their Sfields and are widely held in digital and print format by university, college and special libraries. Many of them have digitized their entire run.

I wish I could tell you how many articles are indexed, but the number would be misleading. Sure, you can do the simplest search, using the definite article that must appear in every English language article, and you will get 2,520,000 hits. The problem is that this is not equal to the total number of articles in the CrossRef collection. As I tested seven of the nine domains and searched the publishers' sites using their native search engine to run comparative queries, I grew increasingly suspicious how Google handled this project.

Limiting the search to one or more publisher domain(s) is important, because a search term may have many different meanings, yielding a lot of noise in a search across all nine publishers' sites. If the query can be easily limited to the search about, say, "energy loss" to domains of APS/AIP, IoP and IUCr, it can spare the user from being swamped by hundreds of non-pertinent records, such as the ones from health and nutritional publications that would be irrelevant for a physicist. It is not a perfect solution because some publishers have journals both in physics and nutrition, and not accidentally, they also offer options to limit the search to one or more disciplines.

True, more experienced users can figure out that using the "site:" prefix would do this domain limiting function, but they may not know the exact domain names that should be used (without a space in front of them). For instance, is it "site:interscience.wiley.com", "site:doi.wiley.com" or "site:wiley.com"? In CrossRef, the first search yields 3,590 pages, the second yields 46,100 sites and the third one 49,500. Would the latter also find materials that are, say, news items rather than scholarly publications? When combined with a topical search, like "energy loss," the first and the third domain options yield the same 127 pages, the second one, however, finds none. The user, who sees the "doi.wiley.com" domain in the Google result list and tries to use it in a limited domain search will be surprised.

Choking on the Silver Spoon

I warn against taking the number of results reported by Google at face value because the same record may appear two, three or even four times in the result lists, and Google counts each one for its hit report. This quadruple listing happened, for example, for one of my test queries in the IUCr domain of CrossRef. You can see that the URLs of these entries refer to different segments of the same document (header, body, etc). There is no difference between the author, the header, the body and the count segments.

It is even more confusing that at the same time Google missed two records when searching for "energy loss" as a phrase (either with or without a hyphen) while they were retrieved by the native search engine. It is quite unnerving that two out of three records could not be found by the obviously correct subject search. Both of them are open access articles, so their absence is even more regrettable.

IoP's native search software, Verity, found 23 hits for the query "maximum fractional energy loss." Google found six and temporarily omitted one of them, considering it (incorrectly) to be a duplicate for one of the other five records. I summarized in a chart which records were found for the query by Google in the CrossRef Search Pilot collection and which records were retrieved by the native search engine on IoP's own site. For good measure, I also tested to see what the mainstream version of Google found for the above query if limited to the iop.org domain. The results are posted in a tabular form for convenience.

One which Google could not find a record but IoP could was item no. six in the IoP column in the table. It is a 15-page article from 2002. When searching for the article by words in its title, Google yielded two results. The first hit is "our" article (the other was an article published in 2003), but as you can see from the URL, it is a link to the table of contents site of the IoP journals. The lack of a small label in front of the entry in the result list informs you that it does not have a PDF. Or does it?

It turns out that Google misinforms you. If you click on the item you are indeed taken to the table of contents page that has a link to the abstract and to the PDF file. So why was the presence of the PDF not indicated in Google's result list? More importantly, the abstract has the exact query term so Google should have found it, even though it did not find the PDF itself. It is an enigma as to why this happens.

Somewhat similar is the situation for the next item, which IoP's native search engine found but Google did not. The article was published in 2000 and searching by words in its title, Google found a match. Again, the URL indicates that the link is to an abstract that does not include the query term. There is, however, a PDF version of the article that does include the exact query term "maximum fractional energy loss." Why then was it not found for the original subject query? I don't know for sure, but I can speculate before further researching the issue (which would make me miss my deadline badly).

It is possible that the traditional, but little-known, limitation of Google — in which it indexes HTML files only up to the first 100 KB and PDFs up to 120 KB — backfires. This could be quite bad, as scholarly articles tend to be far longer than that limit. If the query term first occurs after that limit, the article will not be found. Out of curiosity, I checked the size of the 23 hits from the IoP list. Only one PDF was below this limit at 104 KB. The other articles were well beyond — two of them had PDFs of more than one MB. Has anyone thought about the implications of this?

The third example (item no. seven in the table) shows that Google found two entries for the known-item type search, which it could not find for the subject search in the full text. It shows both the table of contents page of the issue and the abstract in the result list. Once again, it is enigmatic, unless the file size limit is the culprit, which is not a consolation for the end users.

When I did the test searches using the regular Google search engine, it brought up other anomalies. For the query about "maximum fractional energy loss" it could find two of the articles missed by the special CrossRef Google version. However, it missed item no. 13 on the table, which was by the CrossRef version of Google. In the Annual Reviews domain, CrossRef Google did not find a match for items with "energy loss" in the title. AR's splendid native search software from HighWire Press found three review chapters and one correction record. I found many other such anomalies even without trying hard.

The anomalies mentioned above may be software and/or content related. The ones mentioned here are clearly software deficiencies — at least in light of what the regular Google version offers. There is no cached version of articles, nor is there an automatic dictionary definition for your search word. There is no advanced search mode. There is no field-specific search option. Sure, you can use the "intitle:" search prefix, but my test showed that to be of little use. Why? Because not all of the HTML pages of the articles have a title in the HTML title tag, so you cannot rely on it.

The beauty of these precious scholarly collections is that the documents have clearly tagged structures, identifying the various data elements, like title, abstract, cited references, publication year, author, author affiliation, full text, etc. That's what the native search engines of these publishers exploit when they allow the search to be limited to the data elements listed above, either alone, as in limiting to title, or in combination, as in limiting to title or abstract. Google offers none of these options that are essential for focusing a search.

Google has been great in handling the billions of unstructured pages of the visible Web, but it totally ignored the fine record-level architecture of this scholarly subset, giving it the same assembly line treatment as the not-so-worthy content flooding the World Wide Web.

How CrossRef Records Appear in the Traditional Google Search

A month after the launch of the service there is absolutely no information about the CrossRef collection on Google. If you think it is not needed because the results for searches on scholarly topics will be brought to the top in the regular Google version, think again.

A simple search about "energy loss" yielded 340,000 hits from the traditional Google search. I checked only the first 100 — none of the first 20 hits were from the CrossRef collection. Finally, the 21st item was about a book to be published by Wiley. It had a good abstract and even a table of contents page. The 25th was the same is the 21st, except that it was from the Canadian wing of Wiley. The 29th item was a good hit from AIP's Physics Finder, which lists the most recently published articles related to electron energy loss. Hit no. 38 on the traditional Google list was the same book record as nos. 21 and 25, this time from the European wing of Wiley so the URL was slightly different. Hit no. 67 was the same as was no. 29 and easy to spot because of its pink link (indicating that I already visited the site).

Finally, hit no. 99 was an article from New Journal of Physics with a good abstract at the IoP site with links to mouth-watering options offered by the publisher, including a free HTML version of the article, version as well as a PDF. If that were not enough, there is a separate linked version of the list of 16 references cited in the article. (Earlier articles also offer the list of citing articles, but for this one there were none yet). Out of the 16 references, 15 are linked, some of them to a variety of alternative sources. There are 11 items with CrossRef links. Now you know why I really want to see this CrossRef Pilot Search implemented with software that gives it the treatment it deserves.

What's Next?

The full CrossRef Search collection could have metadata for more than 10 million scholarly papers, uniquely identified by Digital Object Identifiers (DOI) — a very valuable resource for researchers, even if they may not have free access to some documents. It is worth far more than dozens of average or mediocre indexing/abstracting databases combined. There are some very well-structured text retrieval programs, such as Verity, and companies that are not (yet) spoiled rotten and highly motivated to bring out the best of the CrossRef collection, such as HighWire Press, which has created partially free archives from the collections of many of its partners — almost all of them CrossRef members.

I can envision a really powerful tool from the metadata alone in the hands of those specialists who have worked with not only unstructured bulk data, but also bibliographic records with separate, clearly identified fields for data elements and thus potential field-specific index terms for data elements most often used in formulating queries. The same is true for the group of specialists at the University of Southampton in the UK, who have developed and deployed a suite of powerful e-print tools for creating and processing open archives, including some awesome citation analysis programs like CiteBase and ParaCite that can provide high-quality autonomous citation indexing, leading the users through related items. These programs deployed on the cited reference segments of articles in more than 11,000 journals could be wonderful, even if the full-text articles were still available only to subscribers or pay-per-view users. Ranking search results based on the number of times they were cited has been the exclusive territory of ISI.

If you add to these the fact that specialists of HighWire Press really know how to bring out the utmost synergy from a collection by integrating Vivisimo in its software, you have a perfect candidate for another CrossRef project. The search process can be made a simple and intuitive one using a template, such as Scirus has made using a modified version of the AllTheWeb software . (I criticized Scirus at its start, but it has come a very long way in both content and software features. It is another question that it still has not filtered out the junk sites full of four letter words spidered from non-scholarly sites in the past. It still has more than 200,000 pages in its "Other" Web sites sub-domain, which include the two most common four letter words in English. Luckily, you can easily exclude that domain from the search of its collection of Web pages.)

Whether or not Google improves the processing of the CrossRef collection and does something to alert its users to its availability, additional alternatives need to be explored to provide access to this exceptionally worthy, ever-increasing collection of metadata. I know how valuable such tools can be, as I have spent a lot of time trying to create my genre- and discipline-specific polysearch engines, including one focusing on energy science and technology, covering the related part of archives of some of the publishers active in the CrossRef project. Although they are very simple and do not even try to run sophisticated searches, let alone merge the result from the sources queried simultaneously, they have saved me a lot of time in my research. There are far more capable teams and tools that CrossRef could use to further its laudable mission. Just because Google is the most popular Web-wide search engine it does not mean that it is the only tool in town. The tail should not wag the dog and scientists and researchers will hopefully discover that they have better alternatives for finding the scholarly journal articles they need for their research.

Careers at Cengage   |   Contact Cengage Cengage Learning     —     Gale   |   Course Technology   |   Delmar   |   Academic   |   Nelson
Privacy Statement   |   Terms of Use   |   Copyright Notice