![]()
Title: Scopus (2008 Winter Release)
Publisher: Elsevier
URL: http://www.scopus.com
Cost: to be negotiated
Tested: November 3-15, 2007
It was 18 months ago that I wrote about Scopus. In the meantime, the system was periodically enhanced by smart new software features, additional journals and conference proceedings. It is quite telling about the pace of development that the previous release was in the middle of August, and it would have deserved a review for its novel options. The same is true for the previous two releases.
The fierce competition with the Web of Science (WoS) of Thomson Scientific which I reviewed earlier this year and I will cover it again early next year, trigger many innovations at both companies—for the benefit of the users. These are the two databases that can be used beyond citation-based searching also for citation analysis in almost any science and social disciplines—although with some reservations.
Sure there is Google Scholar, but not for the bibliometric and scientometric purposes that pop up left and right as if taking fool's gold to the bank.
Undoubtedly, Google Scholar puts the most competitive pressure on both Scopus and WoS because it is free and definitely multidisciplinary. However, its abysmal software lacks essential features for scholarly research, such as sorting, item numbering, set creation, and truncation. The software features it offers are brutally malfunctioning such as the simplest OR operation which reduces the (purportedly) 159,000 item set for the term dumb to (purportedly) 92,000 items when the query is broadened to dumb OR dumber. When you limit the query by year range (using 1458-2008 to accommodate Gutenberg's digitized private notes and the scholarly papers in press, and to check how many records in this latter set have no publication year data), the results drop to 38,200 items.
In fairness, Google Scholar advises you about papers to be published in 2010 and in the following decade and to show its prowess, it lists papers that have cited these documents but are not coming soon to you. Why? Because they were published many years ago, but the dumb software that did the crawling, data gathering and indexing did not recognize the metadata for the creation date. It used as publication year the first number it fancied the most to be a publication year, irrespective of the fact that it is a page number, part of an ISSN, or a phone number. This does not have an effect on all records but only on a few million. It was just dumb, not lazy, because in this case shown in the screenshot above, it added 20 to the issue number 10, hence the 2010 publication year, as I figured out after studying a few hundred source code pages to decipher the incomprehensible patterns of determining the publication year by Google Scholar).
The hit counts and citation counts are often created by similarly dumb logic (using the root for "logiciel" what the French prefer to use what is called software from English to Hungarian to Swahili), and all is topped off by the loosest citation matching software I have ever seen.
All these matter only in cases where these hit counts and citation counts are compared to those of Scopus and/or WoS are taken seriously. As a peer reviewer of manuscripts for a few scholarly journals, just in the past 10 months I have seen several papers which relied on massively inflated hit counts and citation counts of Google Scholar in comparative evaluations. Often, there are phantom links and tracing them lead to a wild goose chase. The inflated citation counts please most authors and the users. I see the danger of this when it comes to promotion, tenure, grant applications and the other forms of recognition for and measurement of research and publication productivity and impact.
PsycINFO with close to 2.5 million records (as of mid-November) is a strong contender but only in psychology and to much lesser extent in psychiatry, and for a limited base period, as records have been extended with cited references systematically only from 2001. It is notable, however, that as of October, 2007 there are 663,000 records enhanced by a total of about 26 million cited references in PsycINFO. Scopus has about 563,000 records for psychology, and—according to my test—about 291,000 of those are enhanced by cited references —in my estimate— to the tune of close to 6 million references. The reason for the remarkable per record difference is that in PsycINFO there are about 61,500 book records enhanced by cited references (out of 265,000 book records). Books often have thousands of cited references, yielding an average of 40 per PsycINFO item, whereas in Scopus, there are no records for psychology books, let alone with cited references.
About 80% of the psychology items are articles, which in turn have about 20 cited references on the average. It is another question that in many implementations of PsycINFO, the cited references are barely used by the host software (beyond displaying them). Only CSA, Ebsco and Ovid offer good search options based on the cited references, and only the former two display automatically the number of times the paper was cited by other documents in PsycINFO.
CINAHL which has about 1.7 million records, has similar limitations. It covers only nursing and allied health subject areas. It has 302,000 records enhanced by cited references. Scopus has 322,000 records for nursing related journal articles, and about 128,000 have been enhanced by cited references. According to a recent analysis by Taiwanese researchers, the average number of cited references in nursing journals is between 27-30, so the total number of cited references in CINAHL—in my estimate— is between 8-9 million, and in Scopus between 3.5-3.8 million.
There are some very good discipline-specific digital archives, especially in astrophysics, physics, economics, and computer science, but by definition their scope is limited, and so is the time span for which records have been enhanced and unambiguously tagged for citation-based searching, citation analysis, and scientometric projects.
Several scholarly publishers also offer citation-based searching and citation counts for the articles—within the domain of the journals published by them, and in rare cases (such as the one of Annual Reviews, Inc.) by other scholarly publishers who are members of CrossRef. Similarly, Highwire Press leads you to any of the journals—independent of their publishers—which are powered by its splendid software.
The Scopus system has always been much more than just the Scopus master database itself. The system has had from the beginning much (but not all) of the content of the free Scirus service, and a group of patent databases. The former is a formidable component of the scientific part of the Web (as defined by Elsevier), with 365 million records of and links to Web sites, including those of the Old Medline database, and the content of nearly 100 open access preprint and reprint archives ranging from ArXiv, to Cogprints to RePEc with many institutional repositories in between. The patent component itself holds information about 22 million patents, including the content of the US Patent and Trademark Office, the almost equally large content of the Japan Patent Office, and the much smaller holdings of the UK and European patent offices and that of the World Intellectual Property Organization (which—combined— make up 20% of the patent subset).
I don't deal with these components here, but focus on the Scopus master database, and a brand new database (labeled on the result tab simply as "More"...), and, indeed it is much more. True, these are simple records for individual references which may miss one or more of the bibliographic data elements, or may get one or more of them wrong, like the volume number or the starting page number. I refer to it as the Orphan References database, because they cannot join the family of other references that are attributed to master records. The entire Scopus system now has about 470 million records.
Elsevier offers an informative Facts and Figures Web page about the size and content of the service here. Instead of parroting their data I tell you what I found as my numbers and their numbers does not match exactly, sometimes mine are lower, sometimes theirs, but the differences are not significant except for the number of records enhanced with cited references.
As this aspect is very important for many users, I will elaborate on this issue. After all, these days indexing and abstracting records are freely available in the range of 100 millions. It is the set of records enhanced with cited references which make Scopus and Web of Science so precious (and expensive). Their creation requires a huge investment in terms of subscription costs, wages to human indexers, etc. to the tune of millions per year, because Elsevier and Thomson were not offered—as Google, Inc.—unfettered access to the huge full text digital warehouses of hundreds of scholarly publishers.
It is the best of both world if a citation database offers subject indexing terms and abstracts. But just as when it comes to the most expensive wedding rings it is the size, clarity, shape and color of the diamond (not the setting for it), that costs the most, and deserves and requires close attention. I don’t vax philosophical about this here as in my Savvy Searching column of Online Information Review of Emerald a paper was just published.
Scopus itself has about 32 million detailed bibliographic records of the indexing/abstracting type. In my test, a little more than 12 million records have cited references in Scopus, not 15 million Keywords are available for 27.2 million records, and abstracts for 22.5 million items according to my tests.
But in this latest version released in early November, there is a useful additional component of Scopus, that of the Orphan References. This component has about 51.5 million records, which are references extracted from the source documents processed by Scopus, but which could not be associated with a master record in the full bibliographic database of 32 million items.
I discuss some of its details in the section about the software, as it is the integration and automation of this function in the main search process, and the bringing of extra results to the user directly with extra software options that makes this component so useful. It is to be noted that WoS has had for many years a similar index available through the cited reference search module, but users—unfortunately—rarely use that powerful feature because it requires a separate step. In addition, WoS shows the title of the cited paper if it had a master record created for by Thomson.
Journals form the core of the database. 90% of the records are for scholarly journal content, 7% for conference proceedings, 2% for trade publications, and about 155,000 records for books and technical reports.
Within these broad source type categories journal articles (67.5%), conference papers (9.3%), review articles (5.4%), letters to the editor (2%), short notes (1.6%), and editorials (1.2%) make up the bulk of the database. The rest of the records are for short surveys, business articles, errata, etc. It is unusual but much appreciated that Scopus reports the number of records (about 11% of the records) which have not been assigned a document type, so take it as a warning for using document types as filtering criteria. The journal base is very wide, covering nearly 15,000 serial sources. But it is to be noted that the retrospective coverage varies widely, as is the case in many other databases. The Source List provides a fair look at the coverage of these serials. I tested the coverage of library and information science sources, not only browsing the list but searching by their variant titles. I applaud Scopus that it filled some of the most bothersome gaps I mentioned in my earlier review, at least back to 1996-97, such as the Bulletin of the American Society for Information Science and Technology and its predecessor. The coverage of the Proceedings of the same society also improved (but there is a gap for 2005). The coverage of the journal of the society also improved, at least back to 1999. The coverage of the earlier years is still too selective for this top ranked journal...
However, I remain very unhappy about the pre-1996 coverage of the Annual Review of Information Science and Technology with absolutely no records for 1988-1995, 1983-1986, 1975-1980, and the pre-1974 era. This is a quintessential publication of our profession (at an incredibly good price), and what the best experts wrote in those volumes absent from Scopus are still used in teaching and for research, and ignoring so many volumes of this high ranking publication is not in good taste, and is not a smart idea by information professionals.
Although I definitely don't restrict myself to reading (and sometimes writing for) scholarly journals, the sorry coverage of the Journal of Scholarly Publishing and its predecessor, Scholarly Publishing pains me. It is one thing that letters to the editor and other editorial materials are not covered by Scopus in some journals (although they are in the Journal of Scholarly Publishing), it is another thing to miss important articles, and even years, such as 2002-2004, and everything except for 1 item before 1996 when the journal is formally covered. Scopus has records for only about 10% of this journal. Of course, quality not quantity is the primary issue.
How was it possible to skip the paper about Trends in scientific scholarly journal publishing in the United States by no less than Carol Tenopir and Donald King who know the most about this issue, or the ones about Ranking Journals, The Electronic Journal and Its Relatives, Preserving the Integrity of Peer Review. These are everyday issues not only in the ivory towers of academia, but in the board rooms of the largest scholarly publishers, including the largest of them all, Elsevier.
Although 82% of the records in Scopus are for English language documents (which I agree with), there are more than 1.5 million records for German, around 800,000 for French and Russian each, somewhat less for Japanese and Chinese each, and around 300,000 for Spanish and Italian language papers, not to mention the very high number, Portuguese, Dutch, Czech, Danish, Swedish, Hungarian documents each between 40,000 and 100,00 items. The number of Polish documents is exceptionally high at 183,000. Obviously, the diversity of the language also reflects the geographic diversity of the sources. Sometimes the geographic diversity is not obvious, such as in case of India, where many of the scholarly publications are in English.
Having information about articles in press is an excellent feature. Although currently there are less than 1,000 records for such papers, but I am sure that it will be much better in a few months as the manuscripts for the 2,000 journals of Elsevier will be rolled into Scopus several months before they get published in print. Obviously, Elsevier has a huge advantage in this regard as far the largest scholarly publisher.
There have been several new features introduced since my last review. I will focus here only on the most innovative major features: the basic design concept, the cited reference list with citedness scores (which in the new release can be sorted by citedness score), and the handling of the 51 million record Orphan Citations subset.
There are other smart novelties, such as the automatic inclusion into the result list records from CSA databases which match the query. This applies only to those users who are subscribers of the qualifying CSA databases, primarily those in the social sciences and humanities, which is far the smallest subject area within Scopus. Imagine when it will be expanded to, say, Sociological Abstracts, and the ProQuest databases (after the merger with CSA), such as ABI/INFORM, Dissertation Abstracts, Criminal Justice Abstracts, and the ProQuest Research Library.
I can't cover here the Author identification module which is a too important and complex topic to squeeze in here (but I discuss it in the last 2007 issue of Online Information Review). I do explain why I disagree so strongly with one of the two algorithms for determining the ever more popular h-index.
I have fallen for the software at first sight, when I saw it showing in its debut version the grid-format presentation of the result list, offering the option with a single click to sort by author, source, publication year—and citation counts, but not by title (it still does not offer that option, but I hope it would, as it is pretty important for scientometric research purposes).
The clusters by sources, authors, publication years, document types, and major subject categories, are also in a grid format, and offer the flexibility to see cluster elements with fewer occurrences. Such grids provide that kind of "at-a-glance" overview which helps to get a feel about the result, to see who are the most productive authors on a topic, which are the most productive journals, to what extent was the topic discussed year by year, and how important are the non-journal publications for the topic. Then, clicking on the Times Cited button shows which are the most cited paper. All these, of course within the domain of the Scopus database.
It could be only better if the seasonally adjusted citedness score would be also squeezed in the grid, to see how many citations were received by the papers, to provide a more level field for the fairly current papers.
The overall design of the debut version was so smart and pretty that it did not need much improvement, but there were some. For example, now you can remove cluster elements from the top grid, and/or replace one or two (such as the source type, and the broads subject categories by specific document type categories, and keywords or language names, respectively).
The major subject categories have been refined which is a good move. It could be better if the user had the option to choose the display of the one level deeper subcategories. For example, within the Social Sciences categories, there is a subcategory for library and information science. I would like to see that kind of depth as an option. I also would like to use that subclass term or code as a search element in the search template.
Although this is not a new feature, it has a nifty new option that I have been bugging the Scopus developers for since I first met them at a conference in Warwick 18 months ago.
Showing the citedness score of the items in the list of cited references, has been a wonderful idea (also implemented by CSA and Ebsco on some of their databases). It can immediately indicate which are the most cited (and thus presumably most important papers) among those cited by the author.
Those items that have the small
button
indicate that there is a master record in Scopus for that document. (It does not indicate —in spite of its name— that the record has an abstract and references. With all my adoration for the design it would be better to have a small logo for Abstract and another for Refs. These would be greyed out if there is no abstract or there are no references). By clicking on the "cited by"link, Scopus will show the records of documents that cite that item.
However, there are cited references for which there is no master record, and thus Scopus cannot hang its "citation hat" onto one, or increase its citation score. From day one Scopus created—in the background—a shadow record, to accumulate the citation count for that work, and to show it when it appears in a cited reference list.
That's the reason that even when a cited document has no master record in Scopus, it can display its citation score if it was already cited. You can see this to happen with the 2nd cited reference in the list. As opposed to the 1st and 4th reference, It does not have a master record in Scopus because it is an encyclopedia, and encyclopedias are not source documents for Scopus. Still, it can show that the encyclopedia has been already cited seven times. Embedding the citation score in the list of cited references for documents which have a master record is a great idea. Doing that with cited references to documents which have no master record is an even greater idea, and then comes the brilliant one.
When I met the developers at that Spring, 2006 conference in Warwick, I wondered if they could sort the list of cited references by their citedness score if the user wants to do that. I certainly wanted, as very often there are hundreds of cited references, for example for almost all of the chapters in Annual Reviews. For this sample that I showed there are "only" 75 cited references. All I wanted was to have a button to list the references in decreasing citedenss score order. What Scopus did was to show the results not in the old, traditional bibliographic format, but in a—guess what—a grid format. Given my obsession for "gridding" anything possible, I did not think of it. But the developers obviously did and built this in the latest release. There is a Table button to display the cited references in a tabular (grid) format. This makes a lot of sense even for the users who are not obsessed with representing information in a grid format, simply because the basic result list is displayed in that format. Here is how the list of cited references look like in the alternative format. The 2nd cited reference in the original list shows up as item #37 in the rank ordered list by citedness scores.
There are many reasons for being an orphan record. The most common one is that a journal is covered now, but its coverage does not extend to the year cited. A variant is that the journal is covered but the particular issue is not, or the issues is covered but the paper was skipped. In the context of citation searching, gaps in coverage of journals of any type is more of a concern than in a plain I/A database, because it reverberating consequences.
The second most common reason is that the referenced item is a book or a book chapter or a paper in a journal or conference proceedings which is not among the 15,000 sources processed by Scopus. It may be orphan because the reference is erroneous, giving the wrong metadata (author, publication year, volume number, title), and it does not match the master record in Scopus. (It is also possible that the master record is wrong and the cited reference is correct). It is, however, possible that the journal, or the particular issue or paper was not processed by Scopus, or the reference is to a Web site, and Web sites, —like the archive of this column— are not covered by Scopus, therefore any citations that refers to one of the Digital Reference Shelf column will be an orphan reference even if they were consistently and correctly specified and entered.
Web of Science has had for a long time a similar index that you can search by cited author name, cited source name and cited publication year, but not by title words of cited journal articles from sources not covered by WoS. Scopus makes this also possible (for citations that include the title of the articles, which is not the case in many science journals where the citation style requires only the author name, the journal name in an abbreviated format, and the publication year, volume and issue number.
This is a very important feature. Here is an example to illustrate it. When I search for my name as a cited author Scopus came up with 188 items before this change. Now it shows an additional 122 items from the Orphan Citations component of citations it could not assign to any master records created from the sources it processes. If I use in an OR relations the most common misspelling of my name than the result would be 217 and 147 records, respectively. It is to be noted that many of these are references to the same document with slightly different content, which prevents it from being matched to a master record. Differences in page numbers, years, title words, errors of commission or omission in any part of the records, make references orphaned.
The idea of h-index for measuring the life-time scholarly publishing activity and impact of researchers was developed by Jorge Hirsch, a physicist at University of California at San Diego. Its theoretical soundness and practical utility made it a very popular scholarly evaluation tool in record time. Its essence is to do a search for the name of scholar in a database that shows the number of times the scholar's paper was cited, then sort the result list in decreasing order of citedness. The h-index is the value at the intersection where the rank order number is larger than or equal to the citedness score of the author. For example, the h-index of Professor Jane Doe is 12, if she has published 12 papers, each of which was cited at least 12 times. I discussed the concept and illustrated its excellent implementation in Web of Science (along with other important scientometric indicators) in my in-depth review of Web of Science in the January, 2007 column. My tests have shown that, indeed, it is a very useful indicator because it does not have the disadvantages of the other metrics that have been used for the same purpose (which is very well summarized by Hirsch in his paper. Hirsch mentions that the database must have a broad enough coverage (he used the version which covers the period from 1955 to the current date), and that he skipped some names on his target list because their names were not distinct enough to tell apart one from the other based on the cited works alone.
However, it is not merely the breadth of coverage, its depth, retrospectivity and broadness (in terms of documents) that matters, but also the extent of enhancement by cited references. Scopus makes it very clear that it added cited references since 1996 (actually there are a few thousand records already from 1995).
This makes it the most favorable tool for those researchers who started (and continued) publishing from 1996. Those who have started publishing in the early 1990s are much less favored, and those who did so before 1990 are strongly disfavored by Scopus or those versions of WoS which are licensed according the decision and budget of the library from 1996 or later).
This is because typically, papers get the most citation in the second and third year following their publications, then they start decreasing. For WoS it is up to the library which edition to licence (back to 1900), but for Scopus there is one edition, which still could be good or very good depending on the circumstances I mentioned above.
There have been many ranking lists published in the past two years based on the h-index. All of them, except for one, used Web of Science. The one which used Scopus, right after the release of the new version of the software with two built-in option for automatic generation of the h-index was released early June. The test by the editor in chief of the open access, highly rated Retrovirology was done mid-June, my test followed within a few days. (The number of papers and the number of citations increased since that time but the h-index is robust, not sensitive to such changes in cases of widely published and cited authors). The editor-in-chief checked and reported the h-index of 45 members of the editorial board of Retrovirology.
Unfortunately, the short paper by the renown, widely published and highly cited virologist, of the National Institutes of Health, Kuan-Teh Jeang perfectly demonstrates what a difference it can make if one chooses the option of Scopus that I find very inappropriate and unfair to senior researchers like himself and the members of his advisory board.
I tested the manual look-up method and the good option for the automatic generation of the h-index for 1/3rd of his sample. Here is the manual scroll-down example for one of the researchers, Jean Luc Darlix, with a h-index of 42. He published well over 200 papers in the past 40 years (Scopus has information about 188 of his paper), and he is highly active. In 2007 alone he (co-)authored eight papers. His total citation count (from 1996 onward in Scopus) is above 5,100, so since 1996 he received more than 500 citations per year from publications covered by Scopus for all of his then 185 publications that Scopus knew about (this figure has increased to 188 since).
His h-index is high even in the field of biomedicine where h–indexes are far the highest, and in spite of the fact that Scopus is not aware of any citations he received before 1995 (which is well over 2,500). This makes his h-value about 10% lower than he deserves. (I am so smart only because after my test I looked up his h-index in WoS), which is 48, based on 218 papers from 1967 and citations received from 1969). I can live with this difference, knowing that not even WoS can have the total picture.
The good option for automatic generation —as expected— produced the same result. My screenshot shows the ten most cited works of Darlix (the system lets you scroll down all the 188 items—correctly—attributed to the author), and the citedness count of each year by year (from 1996 to 2007). Notice that eight of the ten papers were published before 1996, and certainly received many citations within a few years after publishing.
He has a rather unique name (at least within the circle of scientists) and the double initials make it perfectly distinct. (Checking the h-index for people with common and/or easy to misspell, and/or accented character(s) in their name require careful, and time consuming preparation. I then tested the built-in option, and found that the editor in chief used the method that I strongly disagree with. The reason is that it strongly or very strongly discriminates against those who started publishing before 1995. (I am not affected by this policy, my low h-index changes one point or two only using the good or the bad method).
The alternative, ill-conceived automatic h-index generator is invoked by doing an author disambiguation search, which displays the potential candidates. In this case this is not a problem because of his distinct name. But the problem came when his details, including his h-index was displayed. It is only 27, a very significant drop from 42, based on a very unfair concept and algorithm.
In this option, only the post-1995 publications and the post-1995 citations are considered. This is a double whammy for productivity and impact. He is deprived of thousands of citations received from 1996 onward for his pre-1996 publications. There is a note about this on the Author Details page, but in seven point font size. Jeang’s paper does not mention this restriction anywhere. In his table, most of the editorial advisory members are short-changed, to different extent. This is a global problem. The h-index was decreased by using the ill-conceived alternative fo the h-index generator for all the 15 editorial member—to different, but considerable extent.
There are many features in Scopus which are top-notch and exemplary. The h-index generator triggered by the author identification process is not one of them. The h-index is a career measure. For its convenience, many will use it without realizing its unfair presumptions and implications. A couple of weeks ago Elsevier sent out a mass mail to authors (including myself) offering free 30-day access to Scopus (which is a laudable idea and smart policy), urging them to check their h-index—through author search. I suggest that you make use of this splendid offer to explore the smarts of Scopus, and to verify how discriminatory is this h-index generator versus the other built-in one or the scroll down look-up alternative for experienced researchers, which still disregard the pre-1996 citations received, but at least don't ignore the ones earned by pre-1996 publications.