Title: Microsoft Academic Search
Tested: December 26 – January 7, 2010
This second coming of a free academic database is much smaller than the earlier (very poor and withdrawn) version was, but it is far better in terms of both content and software, focusing on computer science and – to a limited extent on information science. It is a promising start by the Microsoft Research Asia group for extending it to many other disciplines.
Microsoft was surprisingly late in coming out with a free academic database in 2006. It let Google Scholar (GS) have a jump-start. When Microsoft did release Windows Live Academic (WLA) in 2006, it was too little, too late. I discussed the problems in the April, 2006 issue of this column, which got hacked, along with a dozen of other reviews.
However, I reconstructed much of it and posted it on my site to show the substandard quality of the service launched by the traditional leader of the microcomputer software industry.
Microsoft claimed that it covered 4,300 journals and 2,000 conference proceedings. Whoever came up with those numbers apparently did not realize that many of the same periodicals appeared with different typos and abbreviations several times in the source list in great variety.
The joy of being able to check out in WLA the source list (which Google never released for Google Scholar) very quickly evaporated, and the list was emblematic for the sloppiness of the whole project, the absurd attributions of journals to arch-rival publishers, such as Science to Nature Publishing Group as a co-publisher.
As for the actual journal sources, it was quite telling that two key research journals of IBM, the IBM Systems Journal and the IBM Journal of Research and Development were not covered at all. They are covered by the new service although only from 1971 up to 2008, and it reflects the fact that the former was merged into the latter journal. It was particularly insensitive if not arrogant from Microsoft to ignore these two classic journals (which were influential outlets well before the 1980s for substantial computer science research, not just in modest hardware and software developments).
WLA got worse and worse during the following years as Microsoft tried to enhance this sorry database, changing its name to Live Search Academic (LSA), making more harm than good in the process. It ventured into the arena of citation analysis, counting the citations received by papers, and it failed miserably, grossly under-reporting the citedness of very influential research articles, as I illustrated in my April, 2008 review on papers authored by Jorge E. Hirsch.
Many of his works in condensed physics have been highly cited, but LSA apparently could not count beyond one in many cases. Neither was it encouraging that LSA could find only 37 articles from the very productive Carol Tenopir who wrote a few hundred papers. It added insult to injury that according to LAS only one of the 37 was cited, and according to the citation-matching module, even that one only three times.
In May, 2008 Microsoft put this embarrassing service out of its misery. It is unlikely that anyone would have mourned it, as Google Scholar has been offering its much larger database and has made it very convenient to find a few good scholarly papers on any topic in free full-text format very fast.Searching it by metadata, i.e. by author, publication year, subject category code, journal and publication year, and using its reported citation counts to measure the impact of authors and journals is a different story.
No wonder that recently professor Nunberg referred to Google Book Search (GBS) as a metadata train-wreck. The inferior metadata of GBS has a lot to do with GS as in my testing in September, 2009, all the GBS records (10.2 million at that time) seem to have been incorporated into GS. This further deteriorated its own metadata mega mess that I described and demonstrated repeatedly since its debut and recently explained in Library Journal some of the grave consequences of its basic illiteracy and innumeracy. A more detailed version was published in the online edition of Library Journal illustrating for users how phantom authors are promoted to co-authors, and real authors become ghost authors and lost authors in millions of records in Google Scholar. In spite of serious software deficiencies, in terms of size, GS is in the heavyweight multidisciplinary league while Microsoft Academic Search (MAS) is in the lightweight category. In spite of the functionally similar name of the two services, MAS is not an alternative to GS – except for computer science.
There are other databases that are dedicated to the discipline of computer science, and from that perspective, MAS is a heavyweight. For example, CiteSeerX, the new version of the original CiteSeer, the mother of all databases based on autonomous citation indexing, has nearly 1.5 million records, as after a long lull it received the funding to double the size of the original 1997 database.
The CSB (Computer Science Bibliography), which was compiled since the early 1990s from 1,500 bibliographies, has more than 2 million unique records. The DBLP database has more than 1.3 million records about computer science papers. The ever-growing and improving free Scirus database of Elsevier has information about 600,000 computer science journal articles and conference papers collected from scholarly publishers' web site and digital preprint archives (along with 140,000 patents in computer science).
There are more than 2 million other computer science-related entries in Scirus from a large variety of Web sites, but this partition of Scirus is too much of a mixed bag of document types to include in the comparison of scholarly articles and conference papers. In the subscription-based arena, Web of Science (WoS) and Scopus have about 1.5 million records in the computer science category, out of the nearly 46 million (WoS) and 40 million (Scopus) records of these mega-databases.
Much smaller are the digital archives of the two most important computer societies, the ACM Digital Library (nearly 300,000 items) and the IEEE Computer Society Digital Library (308,000 items), but they cover their own journals and proceedings completely, and the much broader ACM Guide has almost 1.410 million records with sophisticated search options and bibliometric data.
MAS was developed by the Microsoft Asia Group in Beijing, and in many regards it is a far better service than the previous ones discussed above. MAS is accessible through a U.S. URL-address, but there is a mirror site at http://libra.msra.cn/Default.aspx?searchtype=1. It is somewhat strange – until you realize that MAS was developed by the Microsoft Research Asia Group. This group also has been operating the LIBRA service, which had about 25% fewer records than LSA, but LIBRA was a far better database. It is no surprise that the current incarnation of MAS runs circles around the earlier versions, in spite of its own idiosyncrasies.
Microsoft claims to focus on computer science and indeed there were only a few records for many of my test queries that were seemingly not related to computer science and its related area of information science. For example, there were 63 hits for toxoplasmosis, two for furlough, 646 for recession, one for "facial symmetry", nine for Islamic, seven for Catholic, 32 for terrorism. Actually, most of them were obviously related to computer science from the title of the papers. This means that indeed, the scope of the database was narrowed compared to the earlier incarnation (probably because of the primary interest of the Chinese developers).
The size of the database (as claimed on the tagline next to the logo), was immediately questionable, given the narrow subject focus. It claims that there is information about 4,471,627 papers in this computer science database. This is an unlikely high number for the discipline. There are not nearly as many scholarly journal articles and conference papers in computer science.
For a reality check, at the end of December, 2009, Scopus had 1,541,100 records for -among others- 730,300 journal articles and 678,360 conference papers, and WoS Century of Science, along with the Conference Proceedings database, had 1,470,555 records, including records for nearly 518,000 journal articles and 870,000 conference papers in computer science. These are somewhat under-reported figures because of the incomplete coverage of computer science journals and proceedings in Scopus and WoS. MAS covers more sources, but its retrospectivity and currency are not perfect either. In Google Scholar, the computer science category is grouped together with engineering and mathematics, so no realistic figures can be estimated of its coverage of computer science.
On the surface, it seems as if MAS would have records "only" for journal articles and conferences papers. Its main menu has search tabs for Papers, Conferences, Journals and Authors. Still, the top items for searching the term "information retrieval" bring up on the top widely known books.
However, these are not source items but target items, the most cited books on the topic. These do not have master records, they are pseudo records created from cited references to gather citations given to the books. This is a good idea, and after doing it manually for a Festchrift paper to celebrate F.W. Lancaster's 75th birthday to calculate a realistic h-index for his remarkable ouvre, I proposed to use pseudo records to hang onto them tens of thousands of citations that otherwise would be lost in lack of coverage of books as source documents.
Unfortunately, MAS does not offer a list of journals and conference titles to gauge the dimension of the source coverage, so one is restricted to do spot checks and comparisons to get a sense of the breadth and depth of the source of coverage. IEEE Micro brings up 1.262 hits and volunteers the information that this is for the period 1988-2008, i.e. it is missing records for the first 6 volumes from 1981 to1987. I did not limit the search to any time period, that's why I used the word volunteering.
This is an excellent idea, as it automatically informs users of the time-span of coverage of the source. WoS brings up 2,036 hits for the journal, and when the search is limited to 1988-2008, the hit count goes down to 1,688, i.e. still significantly higher than MAS. Scopus brings up 1,177 hits, and applying the 1988-2008 limit decreases the set to 931 hits, retrieving significantly fewer hits than MAS. The publisher of the journal has coverage of the entire run of the journal but its hit count is limited to 100, so the total number of records for the journal cannot be determined.
It is disappointing that MAS does not have a browsable list of journals and proceedings that are used to build the database. True, the list in the first release in 2006 had many embarrassing warts, and the list was removed from the second release. But the new, restricted domain would have made it feasible to make the apparently cleaned up list of the computer science sources browsable. Spot-checks for journals, transactions and conference proceedings of the ACM and the IEEE Computer Society gave good impressions about the much more consolidated source names.
The same is true about the breadth of coverage of the sources – except for the time span of coverage in several cases. For example, ACM SIGMETRICS is covered only up to 2008, even though this well-cited scholarly newsletter of the eponymous Special Interest Group is alive and kicking. The most recent issue on the ACM Web site was the 2009 September issue. It will have to be checked if this is just a currency issue, or the source is dropped by MAS. It should not be, because its citedness rate as reported by MAS is impressive: 1.646 papers were cited 22,666 times. It is good that Library Trends is among the sources covered in MAS, but it must be realized that its coverage is limited to 1992-2003.
As for testing the breadth of coverage of another journal (which definitely belongs to information science, if not to computer science), it was quite telling that MAS reported 2,174 records for the journal Information Processing & Management from 1963-2009 (i.e. including records for papers in its former title (Information Storage & Retrieval for 1963-1974). At the site of the publisher, Elsevier, the number of hits were 3,086 for IP&M and 468 for IS&R. WoS produced 2,743 plus 372 hits for the current and former titles.Surpriisngly, Scopus had only 1,925 hits for IP&M, and 277 for IS&R. I expected that Scopus would have records for all the papers published in journals owned by Elsevier, the producer of Scopus.
For IEEE Micro, MAS has 2,647 hits 1984-2009 period, Scopus has 1,913 for the same time period and WoS has 3,820. As for computer science conference proceedings, MAS is better than either Scopus or Web of Science (in the version enhanced by conference papers). Neither of the subscription-based databases can claim consistent coverage of many journals and especially proceedings.
Overall, MAS' width of coverage of the field of computer science is very good, but their length and currency of coverage need improvement. Beyond article-related metadata, it also has a growing segment of information-rich and appealingly presented author profiles that I discuss in the software section of the review.
The new software, developed by the Microsoft Research Group in China, makes a huge difference compared to the substandard earlier efforts of Microsoft. It still has annoying limitations in the search module, some bad output choices. but overall MAS now has several smart features and more reasonable hit counts and citation counts than the previous incarnation.
Let me start with the disappointing software aspects. There does not seem to be an option for truncation. The term computers retrieves 69,121 hits, computer finds 286,319 items and computing 183,002. You can build a query in the Advanced Search mode but as you add more terms they are evaluated in a Boolean AND relationship. You don't get en error message if you insert a Boolean OR operator, but it reduces the results instead of expanding it. It would make searching much more efficient if truncation searching, such as comput*, would be possible.
The advanced search does not offer a tab for searching the Papers category (as in the Basic Search mode). There is the same problem for author searching, when two variants should be searched to accommodate the full first name and initial-only variants. There is no way to OR together the two or more variants for author names and journal names.
MAS has some strange author names, but to a negligible extent compared to GS, which attributed millions of records to authors named Introduction, Methodology, Background, Password, Login – depending on which section of the article, or which menu options on the search template, it fancied an author name.
There are no options for marking, saving or forwarding selected items or range of items – regular options in the professional online information services. The help file is rather hidden, it should be placed much more prominently with a link button close to the top of the search menu. On the positive side, the search results automatically display on the side the most productive authors, journals and conference proceedings associated with the query term. It is another issue that if your display is not set to full screen you may miss that important panel.
Many other services offer such a panel better by showing the number of items associated with the author, journal, etc. MAS does this on the author summary page for their co-authors.
What makes this clustering really attractive is that it leads the user to a very informative summary page about the author. It shows a chart of the author productivity (through the number of research publications) and influence (through the number of citations received). This is the chart for Gary Marchionini, which provides an instant view of his essential research performance indicators, as well as his h-index and g-index.
It would be handy to see similar scorecards for journals and conference proceedings, as these performance indicators are getting to be very popular measures for gauging the research productivity and influence of individuals, groups, institutions, journals and even countries . Using two Y-axis for the number of papers published and the number of citations received, would make the chart even better.
The values of the performance indicators were often much higher for computer scientists than in WoS and Scopus because of the much broader coverage of conference papers in MAS. For information scientists, such as Carol Tenopir, who published many well-cited papers in library and information science journals, the opposite is true, because these journals are covered in MAS only to a limited extent (although much better than in the earlier incarnations of the software).
In addition to these two essential indicators, the h-index and g-index also are displayed. The information in the help file of MAS refers in the definition of the h-index (h papers each of which has been cited by others at least h times) as if it were to exclude self citations – but it does not. The very same summary charts and scorecard would be very useful also for journals and proceedings.
The h-index and its derivatives are getting very popular, and more and more administrators, tenure and promotion committee members are using these scores without realizing their dependence -among others- on the dimension of the underlying databases that were used to calculate the indexes. It is essential to add notes that these indices are based on citations from articles that are covered by MAS. This would make publishing a digital list of all the sources covered even more important.
The output list is well-organized, compact and easy to scan. It adds to its value that the results can be sorted by year and number of citations. There is also an option for sorting by rank, but that provides the same list as sorting by citations. Good as MAS is, it is begging for an additional indicator at the article level: the number of citations per year that the paper received. This provides more level playing field in comparing the influence of authors who are in the same positions, say, associate or full professors, but started publishing many years apart.
MAS is a good restart and its scope deserves significant extension by the inclusion of many other disciplines. It is quite clear that citation-based evaluation of publishing productivity and impact is being used on a rapidly widening scale, and by developing a multidisciplinary player Microsoft can become an important player in this arena.
As I retested my queries before I sent my final draft on January, I saw a very significant increase in the size of the database (which rose to 5 million records), and in the hits produced for my test words not primarily associated with computer science such as toxoplasmosis, Islamic, terrorism to increase dramatically (and apparently unrelated to the botched Christmas Day bombing), which means that the extension of MAS to other disciplines may have started just in the past few days.