Title List Changes

New Titles

Outside U.S. and Canada

Customer Center

Product Center

Free Resources

Reference Reviews

Péter's Digital Reference Shelf

September 2008

Title: OAIster
Publisher: University of Michigan
Cost: free
URL: http://www.oaister.org
Reviewed: September–October, 2008

Those Web users who are not affiliated with any higher education institutes or are not members of academic, special, public or school libraries to enjoy digital access to tens of millions of primary documents in any disciplines can enjoy much of the benefits of the educational and research domains of the digital world for free.

They have been getting the best guides and directories of the highest quality open access repositories for many years. The pioneer and most outstanding service in this genre has been the Registry of Open Access Repositories (ROAR), developed by Tim Brody and his colleagues at Southampton University, which has been one of the key institutions in supporting the open access movement.

It has not merely grown from a few hundred items to 2,000 (as of October, 2008) but its functions and information content have been constantly enhanced and enriched to make it a visually appealing and highly informative reference source about open access repositories around the world. The latest functional enhancement was the addition of searching the content of the collections covered by ROAR.

The Directory of Open Access Repositories (OpenDOAR) developed by the University of Nottingham in the UK and the University of Lund (in Sweden), followed the model of ROAR, offering similar information about 1,120 open access repositories. Recently it widened the door to open access documents by making the full-text content of many repositories searchable through a trial service using Google Custom Search Engine, just like ROAR did. I will review this service if and when it becomes a regular service.

The Directory of Open Access Journals, DOAJ is a closely related project developed by Lund University, focusing on open access journals. Currently, there are more than 3,700 journals covered by DOAJ, and for more than a third of these the full text of the journals is also searchable, so it is much more than a directory.

The picture would not be complete without mentioning the excellent idea of Stuart Lewis at Aberystwyth University. Stuart Lewis mashed up the information from ROAR and OpenDOAR with the Google Earth map, providing a dynamic atlas about the distribution of 1,1000 open access repositories of 10.5 million digital documents, and the hosting software platform of those repositories. It is begging for a cartogram which would show the size of the countries on the world map in proportion to the number, or even better, in proportion to the size of the repositories.

THE CONTENT

OAIster provides information about the repositories of 1,035 contributors from all parts of the world. Some of the repositories are really tiny, such as the British Library Research Archive with about 50 items from 2000 to 2008, so it is more important to know that the total number of documents in the repositories registered by OAIster—as of the end of October, 2008—is 18,362,804, and growing, literally, every minute. As I will explain later, the number of documents does not mean unique documents, as there are duplicates, triplicates and quadruplicates.

It is somewhat ironic that the PR information on the homepage of the University of Michigan Digital Library Production Services (DLPS), the actual unit in charge of producing and maintaining OAIster, does not keep up with the impressive growth of OAISTER, indicating that it “provides access to over 12 million resources from over 800 different repositories. OAIster really deserves current PR information and statistics about this useful service on the pages which illustrate the growth of the repositories. The number of records is depicted only until the end of 2006, not showing the impressive rate of growth in the past two years.

As for the document types, they include digital text, pictures, moving images, audio files and data sets. Text documents represent about three quarters of the documents, followed by images at nearly 20%. Although video and audio files represent a tiny percent of the collection, their absolute number at 30,000 and 20,000 respectively, is impressive. The fifth format, data set, is little known, but can be very useful, as this category includes a wide variety of documents that are rarely available in other collections, such as annotated bibliographies, time series of indicators and details of test measurements, that cannot be included in the print versions of articles or conference papers.

As is well explained in the Collection Development Policy segment of the homepage, it was a great dilemma to include only universally free (open access for anyone) repositories, or extend it to those sources that are openly accessible for students of the University of Michigan. The latter option was chosen.

OAIster is described as a union catalog of digital resources, but it is much more than that. It is more like an indexing/abstracting database of open access documents, often with a link to the open access version of documents. It also provides succinct but informative (and often mouthwatering) information about the repositories, which is equivalent to include information about the publishers in a book catalog, not only about the items. It is another question that the information about the repositories covered by OAIster may not always be current.

This is understandable, as the blurbs are likely to be updated manually. However, at least with the top resources, it would be essential, and more inviting to users, to refer to the current size of the repository. For example, the HighWire Press collection, which is the largest contributor to OAIster and has been my favorite partially-open access collection, at end of October, 2008, had nearly 5 million articles from many of the best scholarly journals, and—more importantly—about 40% of them are free for anyone (although most of them with a 6-12 months or longer moratorium, i.e. they are delayed open access documents.) It is like the nucleation time for pearl oysters. The delay is seldom relevant because when users find the oysters, the shells are already open for the majority of the delayed open access documents on the subject searched, and thus the pearls are accessible in HighWire Press.

The record in OAIster about Highwire Press indicates that there are 1,155,246 million records from that collection, and a test search confirmed that. This suggests that the HighWire Press collection may not have been harvested currently. However, there are records for recently published papers in OAIster from the HighWire Press collection, so it is not the reason for the much lower number of documents reported than are available.

In assessing the size of the collection, it must be realized that for many documents there are multiple records. This happens because the same paper may have been deposited in several pre-print and reprint archives covered by OAIster. For example, this paper about Academic Rankings was retrieved both from the RePEc collection and the author’s institutional repository. In some cases, a paper may be present in four to five repositories (which is good because in a distributed system it widens the access for users who use different repositories on the same subject), but in OAIster, the records will also appear four or five times. They could be collapsed under one entry, but it is easier said than done correctly.

Other duplicates are more puzzling at first, but a closer look at the records may suggest a technical reason for the duplicates. This is the case when a duplicate pair is retrieved from the same collection, as shown in the screen shot. It turns out, that the last element of the URL is different, probably after some reorganization of the directory path and in the naming convention. For the OAIster crawler—understandably—these are not the same documents, even if for the naked eye they obviously are. The content of the records depends on the standards used in the target collections, and OAIster nicely retains the bibliographic details and layouts. Some have substantial abstracts or summaries or notes (which union catalogs do not have either), others have only bare bibliographic records — as the depositor created.

Most importantly, in spite of its limitations, OAIster brings you information about more really scholarly materials (and often the free full text) than Google and Google Scholar, which still harvest far fewer OAI sites than OAIster as reported in the July/August issue of D-Lib Magazine by Hagerdon and Santelli in the paper “Google Still Not Indexing Hidden Web URLS”.

My tests searches in OAIster regularly provided information about high-quality primary documents, including not only published journal articles, but also conference papers, book chapters, books, dissertations and government reports—all in one fell swoop. More importantly, many of the OAIster records had a link also to the open access full-text documents.

For example, the search for the exact phrase “digital libraries” yielded 4,665 records from a wide variety of more than 250 repositories hosted in a range of institutions from South Africa to Australia. The query “tsunami AND warning” (as two words in AND relation) yielded 135 hits — again from a wide variety of 50 repositories that most of us would have not even thought of searching.

THE SOFTWARE

The software allows browsing the list of repositories, which is a good idea, as the great number and variety of them may whet the appetite of users. However, the names of the repositories are not always descriptive enough to give a good clue. It would be a good idea to offer this browsing also by country and document type, in case users are interested in identifying, say. repositories in Australia and New Zealand with significant image collection. True, after a few searches users would develop some knowledge about the most pertinent repositories because—as shown in the earlier screen shot- next to the result list, there is a cluster list of the repositories along with the number of hits found for the query.

There is only one search mode, and considering the set of searchable data elements, it is reasonable. It is another question that searching by language and publication year range would be a useful option, maybe as filters, as are the resource types. The search template is a regular one, offering a pull-down list of the index fields that can be searched. These include title, creator/author, subject, language and entire record. As for the latter, entire record means the entire bibliographic record, not the full-text record. The search can be limited to five resource types, but be careful with this option as my tests indicated that about a quarter of the records may not have resource type designation. The otherwise informative, clean and well-structured help page should include a warning about this.

Sorting the results is possible by dates (in either ascending or descending order), title, author/creator, frequency of hits and weighted frequency of hits. The latter two are based on the frequency of the search terms in the OAIster records (not the source document). The weighted hit frequency takes this sort algorithm further by giving more weight to records where the search term occurs in certain fields. Common sense suggests that these are the title and abstract (note) fields. In either case, it is a more modest and sincere naming of what most other software would pompously refer to as relevance ranking — even when they do the same. (For fairness, some software does apply more complex ranking/sorting algorithm, considering other traits, such as the overall frequency of the search term(s) in the database).

The default interpretation of the software for searching multiple words is that the user searches for exact phrase. I strongly feel that in full-text databases this is the right choice, but not in abstracting/indexing databases, catalogs and directories. When users enter the search string “tsunami warning” in OAIster, they hardly mean that these two words must occur next to each other in the given order to qualify as a hit.

The default interpretation of space as a phrase in OAIster means that items with the words “warning for tsunami,” “tsunami early warning system,” etc. would not be retrieved. Actually, in my test the search “tsunami* warning*” retrieved only 63 records, while the query “tsunami* AND warning*” retrieved 135 records. Scanning the result list of the 72 other records that were not retrieved because of the default phrase interpretation of the space character between two words, showed that they had practically the same level of topical relevance as the 63 hits in the phrase searching. This is important because most of the search engines also trigger Boolean AND operation for space between words in the query, so users would not get suspicious. True, Boolean AND interpretation for “Big Island, New England” would not be appropriate, but users are more likely to know, or would realize that they must enclose such terms between quotation marks exactly as in Google, Yahoo, etc. to reduce false hits than using AND to reduce missed hits.

Results are presented in one format that lists all the data element. Offering a short format with title, publication year and abstract would be helpful, as the more compact format would encourage users to scroll down the result list. The number of hits per page is automatically limited to 10 items, although users who look at the query URL in the address line, can increase the number of items per page changing the size=10 parameter to, say, 20 or 50. Combined with offering options of displaying 20 or 50 records per page and a compact record format would encourage to go beyond the first 10 items, and appreciate the records ranked lower than 10. This is particularly important because non-English language documents with abstracts in both the original language and in English may get ranked higher than documents in English simply for the reason that certain international words and acronyms, such as tsunami, software, mobile, digital, AIDS, NATO, HCI, ACM. etc. may appear twice as often as in the records for English language source documents. Then again, relevant documents in German, Spanish, Portuguese or French may float up to the level that their content justify.

In spite of some deficiencies, OAIster is a useful one-stop ready-reference source for discovering open access scholarly documents in all disciplines. OAIster operates through a set of much better and much more reliable set of metadata elements that Google Scholar could come up with even though the largest science publishers presented to its developers very good metadata (in addition to the full text) of tens of millions of documents. The best move from OAIster would be to add full-text searching of the source documents while maintaining and expanding its huge collection of metadata records.

Careers at Cengage   |   Contact Cengage Cengage Learning     —     Gale   |   Course Technology   |   Delmar   |   Academic   |   Nelson
Privacy Statement   |   Terms of Use   |   Copyright Notice