Cost: to be negotiated
Scopus has been continually enhanced since its debut in November 2004, both in terms of content and software. It offers now more than 38 million records, nearly 15 million with cited references. The massive efforts to fill in the gaps of coverage of many journals is to be applauded, but there are still serial publications with significant gaps in coverage even in the most precious 1996-2009 segment of the database, which should have been given top priority.
The excessive lack of country affiliation information in more than 13 million records should be seriously considered before engaging in doing bibliometric analysis at the country level based on Scopus data, such as the OECD is planning to do in the future. The software has been excellent and a pleasure to use from the beginning (except for one feature which reflects an ill-conceived idea, not a software error, in the automatic calculation of the h-index) and only few improvements are needed for additional convenience of some users.
There are only three cited reference enhanced databases in the multidisciplinary league: Web of Science (WoS), Scopus and Google Scholar. I have also analyzed Web of Science in May 2009 and will present my detailed evaluation next month. Suffice it to say here that it has about 10% more records than Scopus and much more importantly, more than twice as many records enhanced by cited references. The reason for this is that Web of Science has included cited references for every record when the source documents had such information whereas Scopus has done so only for source documents published since 1996 (plus for about 7,500 pre-1996 works).
When I use the term Scopus, I refer to the Scopus database itself, not to all the components of the Scopus service, which includes a patent database and a database of Web resources (based on the free Scirus service of Elsevier that I reviewed in this column in 2006 and a year ago. I applied similar restrictions to the three traditional citation indexes of Science, Social Sciences and Arts & Humanities of Thomson in evaluating the Web of Science database.
Google Scholar is possibly the largest of the three databases, but it is not possible to make a reasonable estimate of its size (and any of its quantifiable characteristics), as you can in most other databases, because its software still has essential problems even with the simplest Boolean operations as I discussed and demonstrated in this column and elsewhere when Google Scholar was launched, then a few months later and most recently in 2008.
As good as Google Scholar is to find information about potentially relevant publications on any subject in any language in any document type, media and file format, it is not appropriate for scientometric analysis because its hit counts and citation counts are unreliable, irreproducible and very often simply nonsense. It flies in the face of not only Boolean logic but also elementary common sense that searching for Aussies finds 11,700 records, Kiwis finds 20,100, but Aussies OR Kiwis finds only 10,100. The same is true for the absurd search results when Google Scholar produces an increasingly larger set of hits for an increasingly narrower time span as shown in these screenshots made for a search for 1975-2009, 1980-2009, 1990-2009, 2002-2009, with hit numbers growing from 183,00 to 192,000 to 210,000 then to 232,000. Google Scholar also reports implausibly high citedness counts for most items, which becomes quite obvious when tracing the purportedly citing papers and finding that the first few cited the paper in question 10, 5, 4 and 6 years before it was published.
For some disciplines, such as psychology or economics, the discipline-oriented cited reference enhanced databases, such as PsycINFO and RePEC can be reasonable alternatives for Scopus, especially if topical searches and productivity and impact evaluation of journals for recent years are the primary purpose of searching.
My latest actual tests to evaluate Scopus (and Web of Science) were made in May 2009 – unless otherwise indicated- in preparation for a lecture tour in Australia and New Zealand.
I looked at the widely touted figures in the promotional materials, including the narrative and numeric information on the Quick Facts & Figures and other Web pages of Scopus but I verified them as they should not be taken for granted. Many of these are incorrect and exaggerated. Their compilation has been fast and loose, sometimes making them fiction rather than fact. I have complained about this PR problem for years to no avail. Unfortunately, many bloggers, twitters and even professional reviewers accept the data of the promotional materials and Web pages as fact and search engines happily spread the word to the naïve users who take them at face value. I try to reduce this impact by pointing out the PR misinformation.
There is much to like in the content coverage of Scopus of the natural and applied sciences areas (but not in the social science and especially the arts & humanities areas), in the coverage of a large number of science journals and conference proceedings, as well as in the presence of abstracts for nearly 70% of the records (which is much higher than in Web of Science which has abstracts for 40% of the records). Although this does not matter from the bibliometric/sceintometric perspectives, it significantly can improve the recall rate of relevant information in the general topical search process.
It is another question that the PR materials of Scopus cannot refrain from gilding the lily, claiming that Scopus has 33 million abstracts. This false information is believed and then spread by hundreds of librarians and bloggers. Even at the very end of May 2009 – a year after this false claim popped up - Scopus had 27,079,989 abstracts. This is more than a good enough number by itself.
Ironically, the most current Facts & Figures page underestimates the content in one regard, the size of the database. The May 2009 edition of this PR material claims that there are 37 million records in Scopus, whereas my search clearly indicates that there are 38.1 million records - more than 19 million for works published from 1996 onward and 19.1 million for works published before 1996. Not very smartly, the precious 1996-2009 segment of Scopus is under-reported by 1 million records.
On the other hand, the new promotional material boldly claims that "Scopus is the largest abstract and citation database". The original claim was that "Scopus is a large abstract and citation database" which is entirely true. Then at the launch of the service in November 2004 it was announced as world's largest scientific abstract database, which also is true. But then the tagline changed to largest abstract and citation database, which is certainly not true for the citation aspect. It is very far from being the largest citation database .
There are close to 15 million records with cited references in Scopus and this is less than half of what is offered by Web of Science in the more than 31 million cited reference enhanced records. The total number of cited references in those cited reference enhanced records drives home the point that the difference between the two databases is several hundred million references – depending on how precisely the average number of references is calculated per discipline; the proportion of records by disciplines; and how the average number of references increased by year in most disciplines.
This is an essential issue because both Scopus and WoS are licensed at quite a steep rate for the significant extra value represented by the cited reference-enhanced records, not for the bibliographic records and the abstracts. The chart about the cumulative number of records with and without cited references in Scopus and Web of Science illustrate the essential difference the best.
Traditional indexing and abstracting databases are readily available as open access (free) sources totaling more than a hundred million records. All of the largest publishers of scientific journals and other serial publications also offer free access to the bibliographic records and the abstracts of scholarly papers. Elsevier is one of them and it deserves credit not only for Science Direct but also for the free Scirus database that provides excellent one-stop searching experience across many important publishers' sites and institutional repositories.
The Scopus database combines records for publications in science, social sciences and arts & humanities. There is a good reason for this aggregation. While the applied and natural science part of Scopus is impressively large (making up more than 95% of the database), actually larger than the Science Citation Index segment in Web of Science, the coverage of Social Sciences is puny (slightly below 4% of all the records), although its share gets better if we include the group of Economics & Finance, Business, Management and Accounting, Decision Science and Psychology (which covers also psychiatry, which is more of a science than social science field).
The coverage of Arts & Humanities is extremely poor (representing barely 1% of the database). It is to be realized that more than one subject area is assigned to journals, the average number turns out to be 1.4 – which is of course an odd indicator like the 1.7 child per family.
Elsevier promised last November to double the Arts & Humanities journals, but as of the end of May, this was not delivered. For a reality check, Web of Science has about four times as many records for social science works than Scopus and 10 times as many for arts & humanities.
In light of the above, the PR statement that Scopus has the "broadest coverage available of Scientific, Technical, Medical and Social Sciences literature, including Arts & Humanities", is the most irritatingly misleading PR statement about Scopus. I have met many of the excellent developers of Scopus and I have never experienced this baseless bragging attitude. These false statements should be stopped, as they undermine the credibility of the valid and true statements about the really excellent features of Scopus.
Adding 1,600 arts & humanities journals to the source base may not improve the coverage of these disciplines much better, because if only less than 330,000 records were produced from the existing 1,600 arts & humanities journals (about 230,000 from the recent 10 years), then the doubling is likely to double the number of records, which still would be about 1/6th of what Web of Science has for these disciplines. I think arts & humanities is as close to the heart and mind of Elsevier as ballroom dancing for train-spotting teenagers in Edinburgh.
The number of journals covered (16,500) looks very impressive. I use Scopus regularly for the benefit of the broader coverage in my primary field of interest. It allows me to find information about papers in journals and conference proceedings, as well as scientometric information about the journals not covered by Web of Science, such as the proceedings of various conferences on digital libraries, First Monday, D-Lib Magazine, Internet Reference Services Quarterly, Internet Research, Journal of Medical Internet Research, Journal of Digital Information, Cybermetrics and Libres.
Of course, I miss in both databases the coverage of some other journals, such as the Issues in Science and Technology Librarianship, the very good open access journal, which has many great reviews, such as the excellent, objective, comprehensive and still practical review of Scopus by Howard M. Dess and the comparative evaluation of Scopus and Web of Science by Susan Fingerman, both from 2006.
It also is a good aspect of the source coverage of Scopus that it includes many journals published in Europe and the Asia-Pacific Region. The idea of including records for forthcoming publications is the best unique feature of Scopus even if these records do not include the cited references. Currently, there are about 140,000 such records and this is often the first filter that I use to learn about upcoming papers (often being able to read their abstract). Considering that Elsevier is a large publisher of science books, the number of records for this document type is surprisingly low for the recent years and only 27 are enhanced by cited references, half of them are records for chapters in the Annual Review of Materials Science – a serial publication, usually considered as a journal even though its format is indeed of a book.
In spite of the positive aspects of the source coverage, I am critical of the often gappy and shallow coverage of many important journals in Scopus. I readily acknowledge that the gaps I complained about earlier in journals such as Reference and User Services Quarterly, Interlending & Document Supply, Journal of Documentation and some other LIS journals have been filled – at least back to 1996 and in many cases much earlier. The large-scale fill-in-the-gaps project is reflected by the histogram that I created to show the yearly distribution of records in Scopus between 1975 and 2009 as of the middle of May. However, I am still displeased with the spotty coverage of some core periodicals in Library and Information Science which I am still analyzing, being the most important in my research domain.
I also found shallow coverage in many journals of other disciplines, which is the primary reason for the Scopus record counts and h-index being smaller for many journals (and researchers) than those in Web of Science. While the number of sources in Scopus is 50% larger than in Web of Science and it is a much touted feature of Scopus, this is not reflected in essential indicators for the depth of coverage.
Much lesser-known is the number of records per source, which happens to be 2,540 (38.1 million records from 15,000 periodicals), while this indicator in Web of Science is 4,210 (42.1 million unique records from 10,000 journals). The difference would be even higher if I had used the latest figure from the Quick Facts & Figures file which refers to 16,500 peer reviewed journals.
To some extent, the same applies to the time span of coverage of Scopus. It is a very impressive 187.5 years from 1823, but don't get too impressed because records for the first 100 years represent merely 1.1% of the database content. Quite tellingly, the number of records per year in Scopus is 204,290 and in Web of Science this indicator is 384,745
Filling the gaps for the pre-1996 period still would be important for at least back to 1975 or at least 1980 because the most recent 30-35 years are critical for the life-time evaluation of living and working contemporary scientists, who started publishing their research results from the mid-1970s.
When comparing the number of records in Scopus and Web of Science for the same author, journal, author affiliation organization or country, it is often brought up that Scopus does not include records for book reviews. This is a questionable decision and it does not help if we limit the search only to articles, literature review papers, conference papers and research notes for three reasons.
One is that the two systems do not use identical categories. The second is that both are inconsistent in assigning their document types. The third reason is that Scopus does not have document type assigned to 3.3 million records, so any filtering by document type would handicap its comparison because Web of Science has document type assigned to all the records.
It is some consolation that Scopus now specifically lists the undefined category for document types, subject areas and language as the last entry in the excellent distribution matrix optionally displayed on the top part of the screen, before the result list, so users get some warning about the missing data elements. I wish the country field would be added to the matrix, to warn the users of the 13.3 million records missing this information.
Coming back to book reviews. Actually, Scopus does include book reviews, but it does not have a document type for book reviews. Actually, there is at least one journal covered that has nothing but book reviews, the Contemporary Psychology, the former book review journal of the APA. These records are designated as review or article document type in the lack of book review as document type – but they are book reviews.
I happen to know this because a decade ago I criticized Thomson for including this periodical in the Journal Citation Reports that has nothing but "non-citable" book reviews (a rather loaded and inappropriate term) – asking for trouble. The trouble hit in a big way when the journal became the #1 journal by impact factor in the entire social science domain. The reason was that book reviews did get cited 30 times and three records for book reviews were erroneously assigned the article document type, so the journal had an impact factor of 10 for the citable documents – a very high one in social sciences.
In Scopus, one cannot know to how many of the book reviews were the categories of review or article assigned and how many of the 3.3 million records without any document type assigned are indeed book reviews. This is not some baggage from the past, even a casual search finds book reviews like this record for a soon to be published book review about the Pursuit of Unhappiness.
The other metadata elements that can be important for scientometrics studies is author affiliation. We can't expect 100% presence of the related data elements (affiliation country, city and organization) because if these are not present in the source documents neither Scopus, nor Web of Science would go out on a mission to determine these data for all the authors. I don't blame them.
Given this limitation in the source documents, it is understandable that these data elements are missing. What is surprising is the huge number of such records. For technical reasons, I could check "only" the 1975-2009 time period for the presence of the affiliation country and affiliation organization of authors for comparison, but the difference is very notable.
In Scopus, only 66% of the records for 1975-2009 have country affiliation for the author(s), in Web of Science the rate is 85%. This is quite a difference, especially if we consider that in many cases there are multiple authors for a paper. The 13.3. million records in Scopus with no country affiliation is even higher than this stunningly high number would suggest for the sharply increasing number of multiple authorship. One additional important consideration is that Scopus included all authors' affiliation only from 2003. This is not my finding but is reported by Gary Horrocks in his comparison of Web of Science and Scopus.
As for the author affiliation organization data elements, the difference is not that large, but still considerable. Scopus has this data element available for 76% of the records and Web of Science has it for 85% of its records. Once again the omission rate hits harder when multiple authors are involved – whether from the same country and organization or not.
There is a relatively small but strange omission in Scopus records. In about 656,000 records, there is no subject area indicated. The oddness of this is that the omission could be very quickly eliminated because the subject area(s) are assigned at the journal, rather than at the article level and in most of the 150-160 cases that I looked at, the assignment of one or more of the 27 broad subject areas is not an intellectually demanding task to such journals as BMJ Clinical Research, Physica, Journal of Experimental Psychology, Nursing Time, Journal of Inorganic and Nuclear Chemistry and many of the other journals where the subject area is very obvious from the tiitle. This is not a critical issue, except when a search term needs to be qualified by a subject area name, such as depression, so that irrelevant papers from geology, materials science, physics would not come up.
The Scopus software was love at first sight for me. It was incredibly fast, light and limitless for such a huge and complex database. Instead of repeating my reasons for being so much into it, I suggest you read the software sections of my two previous (p)reviews of Scopus in September 2004 and November 2007.
It is a joy to use, with one exception, which is not a software issue, but the implementation of an ill-conceived idea. It relates to the h-index , which was developed by a physicist, Jorge Hirsch, and has quickly become the widely accepted single number indicator for expressing the productivity and the impact of the scholarly publishing activity of researchers.
Scopus offers an option to show the automatically calculated h-index through the author search as part of a well designed informative table. My problem is with the very inappropriate decision to ignore all papers that were published before 1996, even if they received citations in the past 13.5 years.
It is one thing that Scopus has no cited references in records for papers published before 1996, but it adds insult to injury that the pre-1996 papers are ignored. This results in absurdly low h-index for many of the senior teaching and research faculty members and independent researchers who published papers well before 1996 which have been widely cited in the past 25-35 years and are still being cited in 2009, such as the works of Eugene Garfield and Wilfrid F Lancaster in library and information science.
Still, they may come up with low h-index through the author search of Scopus as W.F. Lancaster did, with an h-index of 3, indicating that he had 3 papers that were cited at least 3 times. This is quite an insult for a person like him.
Lazy administrators and bureaucrats stop here and ignore him for some lifetime award. If at least they were to do a search by his name, then scroll down in the hit list sorted by decreasing citedness of the master records for his publications, they would come up with a still low, but less insulting h-index of 7.
The smart users would also look up the ill-named More index, which should have been called the Index of Orphan and Stray Citations as they either have no matching master records in Scopus, or are erroneous citations which prevent the smart software from finding the match. They would then manually calculate the reasonable h-index from combing and meshing the data from the above information pages.
This is complex project but some of the issues are discuss and demonstrated in my research paper written for the special Festschrift edition of Library Trends celebrating the 75th birthday of F.W. Lancaster and in two other research papers about The Plausibility of Computing the H-index of Scholarly Productivity and Impact Using Reference Enhanced Databases and about The Pros and Cons of Computing the h-index Using Scopus - blame me for self-citing if you wish.
This is where I see a great chance to further develop the Scopus software, realizing that many users would just choose the pre-calculated h-index for convenience. Scopus actually shoots itself in the foot by the ill-conceived restrictions imposed on the automatically calculated h-index.
There are some minor software issues like the need for optional inclusion of country names in the matrix or the user's approval for the inclusion of special subject area Multidisciplinary which is now done unsolicitedly whenever you use a subject area limit - which is like adding extra mayo to your burger without asking first. Beyond facilitating the appropriate accrual of the correct, orphan and stray references for the reasonable h-index calculation , the only significant improvement that I would like to see is the saving of a set queries instead of the current technique of saving each query individually, that would make it much more easy to rerun a series of queries with additional or different filters for, say, publication year range.
In spite of the content limitations and the annoyingly inaccurate and misleading promotional materials, Scopus is an appealing system. The gaps and the absent data elements have an implication not only on topical searches but also on productivity and impact counts. These will be discussed in my review of Web of Science in the next column contrasting its results with those of Scopus for a test suite.