Title: CiteSeer
Publisher: Penn State School of Information Science & Technology
URL: http://citeseer.ist.psu.edu/
Cost: Free
Tested: October 20-25, 2005
Too often, interesting pilot projects fade away after the initial grant money runs out. Luckily, earlier this summer the National Science Foundation awarded a $1.2 million grant to the Penn State University School of Information Sciences and Technology (IST) and University of Kansas to enhance and improve the original CiteSeer project. The funding is much deserved in light of the direct utility and the inspirational value of CiteSeer. It is a minor fund compared to the millions that the ill-conceived and poorly implemented PubScience project received a few years ago in form of a congressional appropriation. It produced hardly anything novel and left absolutely nothing usable after its few years in existence.
Beyond the huge multidisciplinary commercial citation indexes, Web of Science and Scopus (which I have previously reviewed), there are a few other literary information retrieval projects which offer novel, powerful open-access services related to scientific literature in specific disciplines or group of disciplines. Other services include arXiv.org, operated by Cornell University and covering astronomy, physics, mathematics, computer science and quantitative biology; the Astrophysics Data System (ADS), sponsored by NASA and run by the Harvard-Smithsonian Center for Astrophysics; and the Research Papers in Economics (RePEc) archive in its various flavors operated by various not-for-profit organizations. Faculty at Southampton University has had invaluable contributions in developing the essential software for self-archiving, collection analysis and citation analysis, not to mention the relentless and effective evangelizing to promote the self-archiving movement.
Then there is the multidisciplinary Google Scholar service. Behind its pretty (inter)face, many of its hits are misses. This isn't a surprise since its software is currently unable to carry out even the most elementary Boolean OR operation. For example, searching on the term "deception" results in 86,000 hits; for the term "deceptive", the hit count is 47,700; and for the search "deception OR deceptive", it returns 68,600 hits. The situation is worse when it comes to reporting the citedness scores. Before you include Google Scholar in your Thanksgiving blessings do your own testing to find out what is behind the appealing façade or check out my recent illustrated story book for some examples I did for an interview with The Scientist about counting on Google Scholar's hit and citation counts.
CiteSeer (which was likely the model for Google Scholar) started out in 1997 with this good name, then switched to ResearchIndex, then switched back to the original. It currently offers its services, including sophisticated citation searching options, based on nearly one million documents.
The documents were collected and processed from the open-access Web. They are the self-archived papers, their preprint and/or reprint versions. CiteSeer stands out by offering the full text of (almost) all of the documents. The size of the database in and by itself is impressive, and the instant access to the source documents makes it immensely useful. This instant access concept certainly limited the scope of the database, but it is already huge and grew at an impressive rate during the past eight years.
Beyond the instant access, there was another filter applied to collecting the computer science-related papers: Only papers in PDF and PostScript formats have been collected. This also reduced the scope of the collection, but certainly increased its quality. These two formats are the most common in computer science, so this is not as restrictive as it may sound. The inclusion of papers in HTML and Word formats could have increased the size of the collection, but it would have lowered its quality by picking up from the open Web far less-relevant papers posted by undergraduate students in introductory distance education computer science courses offered by one of the online universities.
The vast majority of the source documents in CiteSeer are conference papers, considered by many computer scientists to be the most precious type of information sources, primarily by virtue of currency and accounts of novel, experimental techniques which are less favored by editors of scholarly journals.
The content of this database can be best described by following a simple search on the topic of citation indexing. The space between the query word implies exact phrase searching and finds 57 articles. The items on the result list are sorted by decreasing citedness order. Clicking on the title of the paper brings up a much enhanced bibliographic record. Beyond the traditional content of author, title, source name and other publication data, it offers many (a little too many) additional links to the full text of the document from a variety of locations and different file formats. It also offers informative excerpts from a variety of lists about the cited, citing and otherwise related papers and their citedness indicator before making the complete lists available.
This is an awesomely information-rich, but very dense, page and it needs some illustrated help information to enlighten and guide novice users. Some labels and snippets are self-explanatory, but others are enigmatic. I can't even begin to explain all of the features in this space, but the article is only a click away (especially if you choose the PDF version) and provides a good, detailed background for those who don't want to click on the links and explore them own their own unprepared.
The references cited by this article appear in their citedness order, not in the order as presented in the original. I find this very useful (and very rare) as this ranking immediately provides a hint as to which of them may be the most relevant for the topic. It would be very helpful if the citing articles were also listed in their decreasing citedness order.
You must approach this, of course, with a grain of salt as citedness frequency also depends on the age of the documents. An older document has a longer time and higher chance to acquire citations than a recent one. Then again, these talented researchers (the authors of the source article) could easily use (maybe behind the scenes) a relative citedness score for the citing and cited documents as I have described before.
The citedness scores may not be as high as in Web of Science and Scopus because CiteSeer analyzes "only" the nearly one million papers it has collected, whereas the two commercial citation indexing databases have citation enhanced records for about 37 million and 27.5 million source documents, respectively. (For Web of Science, the number refers to the 1945-2005 edition). Then again, I did not find any phantom citing papers in CiteSeer, or grossly deflated hit counts with often misidentified cited documents for topical searches as I did in Google Scholar.
Let me emphasize one quintessential advantage of CiteSeer: you receive access to the source documents (with some exceptions) with no fuss and no muss — even if your library doesn't have a link resolver — because CiteSeer has a copy of the source document. This is partially true for Google Scholar, but to a far lesser extent.
CiteSeer has ultra high-brow software, way beyond what end-users will see directly. Actually, what the end users see may not be as tender an interface as you see in most Web-wide search engines, and it has no help file (which is a sin). This may make it look user unfriendly.
What it lacks in user friendliness it makes up in smartness, especially in selecting high-quality sources, and in normalizing/standardizing the terribly inconsistent, incomplete and inaccurate citations prevalent in every scholarly field.
You can see the latter directly if you click on the Check button as you look up the page of the citing references. It displays the many variant and incomplete citation formats which CiteSeer correctly identifies as the source document. If you want to see the full list, just click. I have been using CiteSeer for a long time, but I have never seen a misidentified source document. In preparing the test for this review, I spotted one citing reference which was not collocated with the other ones that cited the same document. I missed to opportunity to capture it then and could not reconstruct the search.
CiteSeer has perfected — within reasonable limits — the process of recognizing and consolidating matching records for incomplete and/or partially erroneous citations. It can also locate the references in the full text (not merely in the footnotes) for many of the documents, in about 60-65% of the cases in my test. You can see also this highly sophisticated feature if you click on the Context button of the record of the cited document. The reference to the cited documents appears in boldface with parts of the paragraph of the preceding and following text. If the document is referred to more than once in the citing document it is repeated in the Context (also known as Details) format. There are many other gems in CiteSeer that are worth polishing, therefore the latest news about the NSF funding is especially encouraging.
The original project was developed at the NEC Research Institute, which deserves credit for it. At that time, all three of its researchers worked for NEC. Since that time, Steve Lawrence left for Google and Lee Giles went to the School of IST at Pittsburgh State University. I could not trace Bollacker. I believe it is the youngest of the library and information science and technology programs in the country. But young does not mean immature. Actually, at least three of the most mature of the researchers specializing in the analysis of Web-wide search engines now work for IST. Beyond Giles, Amanda Spink and Jim Jansen are members of the IST School. They have published (along with the outstanding, long-time Rutgers professor Tefko Saracevic) the most insightful, fact-laden articles about users' search strategies and tactics based on several, exceptionally large projects in terms of user population.
ITS is one of the recipients of the grant. In light of past performance, that group is a guarantee that the fund for the project known as Next Generation CiteSeer will be well-used.
Of course I regret even more that only a relatively small amount was awarded for this project. It showed a working example of the revolutionary new method of autonomous citation indexing which is done without human indexing, does not require the enormously expensive journal subscription and processing investments, and can be ported to other disciplines. It needs powerful brains and time to do the demanding system analysis, programming, implementation and monitoring tasks. I am sure that the substantial research and development will create a superb tool for the next generation of researchers beyond computer science, and complement the commercial indexing services that have much stronger journal coverage for much longer periods of time and far wider scope.
As for Google Scholar, I hope it is not becoming a Jack of all trades, master of none and will be as good at handling the finely structured data served to it by many publishers and be able to understand the essence and nuances in citation indexing, as the generic Google software has been in handling the gigantic, unstructured hodge-podge of the World Wide Web. This may require the 20% free time of Steve Lawrence, one of the developers of CiteSeer who is now an employee of Google, Inc.