There are hundreds of search engines but there are only a few that focus on research publications. Some of them are of the finest quality, but their range is somewhat narrow. This is the case with splendid CiteSeer (which I reviewed a year ago in this column that focuses on computer and information science), or with excellent EconPapers (also reviewed here in detail, specializing on economics. They are relatively small, the former has 750,000 documents, the latter close to half a million. Then again, they represent a most exclusive club, targeting strictly academic papers published in journals and conference proceedings. The format of the content-wise qualifying papers is another limiting factor, as only papers in PDF or Postscript format are processed. On the other hand, the majority of the source documents are readily downloadable free.
Google Scholar appeared on the scene two years ago (and is still in beta mode), and never revealed any specific, tangible information about its size, sources and content. If you believe that its software would help in disclosing some of these vital statistics, give up all hope. Google Scholar hit counts are too often from thin air and highly inflated, and the citation counts (meant to indicate how often the source documents have been cited), are based on a mix of real and phantom citations as I discuss here and here.Microsoft developers have kept us waiting for a very long time, and finally delivered Windows Academic Live earlier in 2006. It was served up way too late, and offers way too little both in content and in search capabilities as I have illustrated in this column.
Elsevier’s most recent press release about the latest addition to the collections of Scirus in November indicates that Scirus has information about more than 250 million Web pages. In my tests I found more than 305.6 million items. There are three major collections in Scirus: the Journal Sources, the Preferred Web and the Other Web. The first has about 24.2 million, the second about 19.2 million, and the third a whopping 262.3 million items. Of course, the Journal Sources are also Web sources, and there are tens of thousands of journal article type items in the Preferred Web section from the many institutional repositories. The proportion of journal articles in the Other Web section is not known simply because many of the sites crawled by the robots of Scirus sites may act as a repository of research papers posted by faculty members and researchers. They are not enhanced by the authors with bibliographic metadata (such as document type) in a standard way as the items in the other two sections are.
There are 9 collections within this category ranging from the relatively small collection of papers of 13 journals of SIAM (Society of Industrial and Applied Mathematics) published in the past 10 years to the huge PubMed database of 16.3 million records for mostly journal articles. In between there is the 6.7 million free indexing/abstracting records collection of Elsevier’s ScienceDirect from close to 2,000 journals of the publisher, or the more than half a million item open access full-text collection of PubMed Central, and the one order magnitude smaller BioMed Central.
I tested the coverage of these digital collections (when it was possible) by running test searches using the native search software of the sites, then using Scirus. The numbers were very close. When there are differences in the number of hits, they can be attributed to two factors.
One is that that the Scirus crawlers have not picked up the latest postings yet; the other is the difference between the search programs. For example, sites that use Verity software as the native engine may pick up more items than Scirus for a query, because Verity applies a powerful stemming algorithm that goes beyond simple truncation, thus retrieving more variants. For example, my broadest search in the journal collection of the Institute of Physics (IoP) found 227,557 records using the native search engine; through Scirus the number of hits was 222,286, a difference of only 2.5%, most likely to be made up with the next round of crawling or update cycle.
Much more tellingly, Google Scholar picks up only 65,900 hits from the IoP site, using the possibly most comprehensive search for this target, running the site:iop.org query . Even that hit count must be taken with a grain of salt because Google Scholar plays fast and loose with its hit numbers, as no spot check can be made beyond the first 1,000 hits. Even the Windows Live Academic search engine of Microsoft can find more hits, 84,067. This is the exception rather than the rule, as Google Scholar almost always finds more items using the functionally and semantically identical query.
It is to be remembered that in both GS and WLA, the hit numbers vary even within a few minutes, although the databases were not updated. These numbers are really very liberal guesstimates as there is no audit possibility beyond 1,000 items.
Google Scholar and Windows Live Academic may have additional hits for some IoP journal articles from other sources, such as the CSA High Technology indexing/abstracting database. Google Scholar may have a record plied from the citations of the references to other articles. These (identified by the label CITATION), however are very minimalistic records, compared to the high-quality, information-rich abstracting/indexing records in the IoP archive, which in turn also offers open access articles, as the feature article from Physics World.
For this item Google Scholar has only a CITATION type record. Actually, there are at least two such paltry variants (diluting the hit counts). One is reported to be cited seven times, the other has no citedness score – whatever that score is worth in Google Scholar. Windows Live Academic has nor records for that paper.
This section lists 17 institutional and/or disciplinary depositories. By far the largest of them is the combination of several of the most important patent depositories with about 18,000,000 items. The closest to it are the University of Michigan Digital Archives with more than half a million items, and the Networked Digital Library of Dissertations & Thesis with close to 234,000 items. Other repositories are currently small, either because they are for a highly specific disciplinary area, or because the repository is fairly new, and institutions are in the process of converting and posting the publications of their faculty members.
CogPrints is an example of the former, which has 2,726 items in Scirus. The native engine of the repository has 2,844 items – again, the difference is minimal, and probably due to different update and crawling cycles.
The Curator repository of Chiba University of Japan is an example for the second reason mentioned above. It has close to 2,000 items as I write this, but when I e-mailed to the Webmaster of Curator on another matter (discussed in the software section), I learned that within a week there would be 10,000 items added (probably soon after this review is posted), quintupling its size.
This section is the dominant component in Scirus, although some of its targets would definitely qualify also for inclusion in the Preferred Web Sources section. As mentioned earlier, it also has millions of journal article records, but most of them are not in formal repositories or depositories, and are not enhanced with a minimum set of metadata by those who post (some of) their publications.
For example, (in compliance with the agreements with the publishers), I have posted dozens of my manuscripts, and reprints of journal articles on one of the servers of our Department of Information and Computer Sciences, but did not add metadata to them. (Yes, I know, I should have). Obviously, it is not possible to find any of these when searching in the author index of the Other Web Sources section of Scirus, or when searching the journal name index of this section for the title of the journal.
Google Scholar tries to figure out the authors based on position on the Web pages, typography and pattern. It is less than successful as you can see from the 42,400 items by authors named Introduction, with an “I”. Introduction being far the most productive with 29,000 items.
There are still pages in this section of Scirus, which have nothing to do with science. These are mostly pages by students, whose only qualification to be called students is that they have space on a site with an edu domain name. I understand that these are impossible to exclude without excluding also the informative sites under the edu domain. The number of these pathetic sites significantly decreased, and more importantly, the number of scientific collections sharply increased since my first review.
I have more understanding for the difficulty of separating the wheat from the chaff, the gem from the trash, irrespective of the domain ever since I had the misfortune to endure the otherwise talented Martin Scorsese’s latest movie, the “Departed”, which seemed as if Scorsese had "Grand Theft Auto" envy and made a film for Playstation 3 to please the departed Udey Hussein in his grave. Luckily, there are now much better chances that the typical user would not see in Scirus those pages created by “The Depraved”.
The software component is a licensed version of the excellent FAST software developed in Norway. It was the engine behind the outstanding AllTheWeb search service that was acquired by Yahoo (along with AltaVista and a number of other assets). Their content was purportedly consolidated with the then existing Yahoo content, and both software were lobotomized, as Yahoo apparently could not bear the sight of them. They were too smart for Yahoo, like Snow White was too fair for the Queen.
Luckily, the versions licensed to third parties by FAST were not affected by the acquisition. The version used by Scirus went through a series of improvements, and is now a refined tool. For searching the Journal Sources and Preferred Web Sources sections it makes excellent use of the metatags and metadata. It offers field-specific indexes for titles, authors, journal names, author affiliation, keywords, URLs, and even for ISSN (which appears in 97% of the Journal Sources section!).
There is also a field-specific index for words in the abstracts. It is not documented, even though it can make the topical searches (and the resource discovery process) much more efficient and satisfying. In my test, 55% of the records have abstract (remember, this is a very similar ratio that PubMed has) in the Journal section, and 93% in the Preferred Web section.
Once again, there are millions of items with abstract also in the Other Web Sources section for your viewing pleasures, but not for searching because the creator of the documents did not identify them with metatags. If author affiliation (which is indeed important for many scholarly searches), is offered as an index (available in 41.7% of the items in the Journal Sources Section, and 77% of the Preferred Web Sources section), than the pull down menu should offer the abstract option, too, even if both would exclude records from the Other Web Sources section. After all, so do the use of most of the other field-specific indexes (journal title, ISSN, even author name) for other web sources that do not use metatags to identify these data elements.
For comparison, Google Scholar has field-specific indexes for title, author name, publication name and publication date. However, the latter three they are not nearly as reliable as in Scirus even for journal articles, simply because Google Scholar’s philosophy of “we don’t need no metadata, we don’t need no mind control” for our uber-smart software. Why? Because the Googe Scholar developers build software that can figure out who is the author, what is the publication title, and when was the paper published – can they? Not really, so don’t bet the farm or your scholarly reputation on the information provided by GS.
I have shown above an example (or rather 42,400) for the very obviously mis-identified authors by Google Scholar. Here is one example for the date. After the Scorsese movie I looked up hypermasculinity AND police to learn something about it from some credible sources. I was overjoyed to see a very current issue in Google Scholar, which was already cited 5 times. It got to be very important. Some guys have all the luck, or do they? Well, not this time. It is an article from 2001. I was so pleased with my search that I forgot for a moment about the serious innumeracy problems of Google Scholar that I demonstrated in my keynote speech at the annual conference of the Japan Society for Information Science a year ago. Some of these were fixed since, but plenty remained.
Windows Live Academic does not show such problems, simply because it has no field specific indexes for author, publication, author affiliation, etc. It cannot even recognize that there is an abstract, right under its nose, and a very substantial, structured abstract. Neither can it make an index for the words in the abstract. To put it mildly, WLA is a very simple-minded search engine for scholarly materials.
The search in Scirus can be limited by 20 major subject categories ranging from agriculture to sociology. On the average 1.7 subject categories are assigned to a document, but not all items have subject headings. Not surprisingly, life sciences and medicine represent the two largest subject categories, language & linguistics is the smallest one.
The Scirus software is used by some of the repositories for their own collections either as a primary or secondary option, simply because the native search engine is not capable of the same functions as Scirus.
For example, the VTLS software is a very good program for online public access catalog, but not for searching the large Networked Digital Library of Theses and Dissertations. The same is true for the repository of the University of Science and Technology of Hong Kong, which uses the Scirus software for its digital archive.
The case of Chiba University shows in tangible form the advantages offered by the Scirus service, for the Curator archive, as an alternative to the original native search engine. During my tests I found that the native search engine finds far fewer records for my test query, than the one I found through Scopus. I had a hunch that the native search engine does not search the full text, so I inquired at Chiba University, and their reply confirmed my assumption. The native software does not search the full text, the Scirus service does.
Results can be sorted by date, and ranked by relevance, and can be e-mailed.Scirus has come a long way since it is debut. It has a rich, layered content built from a variety of primary document genres from a variety of journal archives, depositories and repositories. It is far the most capable and reliable in terms of software functions of the three scholarly search engines. It is smaller than Google Scholar but much larger than Windows Live Academic. Mike O’Leary’s was right when he said earlier this year in his column in Information Today that “Compared to Scirus, Google Scholar looks ill-planned and unfinished”. I can only add that Google Scholar may have more appeal, but it is more like that of the homecoming queen, more due to looks than brain.
Opinions expressed in this review do not necessarily reflect the opinions of Thomson Gale, its employees or affiliates. We cannot guarantee the accuracy of information contained in non-Thomson Gale sites.
— Péter Jacsó