Title: Live Search Academic
Publisher: Microsoft
URL: http://search.live.com/academic/
Cost: free
Tested: April 1-20, 2008
I reviewed the database exactly two years ago in this column. Although that review was hacked along with a couple of other reviews, I could recreate most of them. It was a negative review (according to medical librarian and intense blogger Dean Giustini, it was a scathing review—perhaps so, amidst the naively positive press announcements based on the PR materials rather than actual tests.
For products and services released by one of the largest and most famous software companies, the expectations are obviously higher, than for an elementary school’s computer club, even when it is a free service. Microsoft fell far short of expectations then—and now once again.
The original name of the database/service was Windows Live Academic (WLA) when it was launched in April, 2006. The preferred name now seems to be Live Search Academic (LSA). The URL changed from http://academic.live.com to http://search.live.com/academic, which—luckily—is redirected to the new URL. Otherwise there was not much change in the past 2 years. There are now some new software features in its advanced mode, but many of these are ill-implemented, or plain non-sense.
As a free, multidisciplinary, cited reference enhanced database it is —theoretically—in the same league—as Google Scholar. However, the comparison is meaningless because Google Scholar’s hit counts and citation counts are absurdly high, while the same counts in LAS are absurdly low.
Both LSA and Google Scholar have similarly serious software problems, but most of them are less obvious in the latter, unless one does plausibility searches to get a feel about how trustworthy Google Scholar data are. At best, they are as true as Baron Münchhausen’s tales of his adventures. Often, they can be so absurd that they get funny, just as Leonardo DiCaprio in the movie, Catch Me If You Can, as a fake pilot, lawyer, doctor—quacking around and providing good company—for the spectators, but not for the ones who are fooled.
I must make a detour here to explain what makes Google Scholar so appealing, and Live Academic Search so appalling in spite of many common errors. The key is human behavior. The reality is that even scholars, and fine editors may prefer the mirror on the wall that tells them that their articles, their journals are more cited than they are in reality. Google Scholar perfected this, while Live Academic Search grossly underreports the real citation counts of researcher and journals.
In exchange for quickly and simply finding free abstracts, and full text documents, Google Scholar’s quackery may be tolerated, as long as the users don’t take its hit counts and citation counts at face value, and don’t produce various measures from the seemingly impressive data reported by Google Scholar purportedly about 19,500 documents to be published in the rest of the century. The 999 documents that Google Scholar is willing to show are for documents to be published between 2009-2050, and allegedly have been already cited nearly 117,000 times, indicating that Google Scholar is not just a seer, but a citeseer.
Google Scholar is genetically programmed to inflate its hit counts and citation counts, they are just not obvious as the above non-sense hit counts and citation counts are. It doesn’t stop this flimflammery, because it has been able to get away with its shenanigans, even in the hands of real scholars. Most of them use the hit counts and citation counts for determining the productivity, citedness and the h-index of authors, journals, institutions without corroborating them. I understand this attitude because tracing the citing documents is very cumbersome, but spot checks for a few of them would be quite enlightening, to get a sense for the extent of the distortion.
In my review of Google Scholar earlier this year I illustrated some of the absurdities of Google Scholar hit counts by showing that while the search for Vietnam in the title yielded 135,000 hits, the enhanced search Vietnam OR Vietnamese in the title produced only 46,100 items which flies in the face of commonsense. By mid-April, 2008 the two numbers were 753,000 and 79,200, an even more absurd outcome of a Boolean OR operation. The real number of records that Google Scholar has that meet the criteria may be much lower, say, only 7,920, or 1,792—one would never know as Google Scholar shows only 1,000 items at most. But just as a quack, it reveals its bluffing when grilled a bit. It easily gets confused, producing obviously nonsense hit counts and citation counts for plausibility test searches, and even for regular ones. It can get away with it as the court’s jester with his pranks. Scholars, however, should draw the line in using the counts without emphatic warning about their flimsiness.
While the regular, pedestrian Google can do a magnificent job of making sense from billions of mostly unstructured and untagged web pages, Google Scholar with the mortarboard on its head, and with special pass to the well-structured and well-tagged digital collections of the largest publishers makes a fool of the users by taking menu items of its host systems, as well as chapter headings of the documents and reports them as authors. To spice up all this, takes page numbers, volume numbers and issue numbers as publication years—even when they don’t look like publication years at all, because they are not four digit numbers. No problem, Google Scholar makes them 4 digit numbers, but it not transparent, but what is in Google Scholar? You may struggle to figure from where did it get the publication year 2019 for a document which was published in 2002. The number 2019 does not appear anywhere in the entire document.
For the human eye, it is obvious that the publication year is 2002, but it caught the fancy of Google Scholar’s crawler that the number 19 occurs 2 times, i.e. twice as prominently as the real publication year , so added the leading characters 20 to volume number 19 to make the publication year 2019 as it is a cool publication year. It conveniently ignored the copyright year symbol which would have given enough clue for a mildly literate non-scholar about the publication year. If you doubt this, look at a few dozen of the purportedly post-2008 publications, and see the pattern.
This happens with issue numbers and page numbers in Google Scholar as well, although not in all the records, just in a few million. This is only obvious for the publication years beyond 2008, but it is pervasive, and gets very interesting when someone would calculate the citations received per year measure. Even if the citation counts may be close to the real one, the relative measure would get grossly distorted.It certainly should make one pause.
Thinking a bit more about it, you may have a Malox moment about the strictness of Google Scholar’s citation matching algorithm, and your claims about your papers’ citedeness and your h-index, but you made the first step toward understanding that why so many citation counts in Google Scholar are too good to be true, and how the millions of phantom citations are born by ignoring volume numbers, issue numbers and page numbers in citation matching.
Google Scholar gets more confused and make it bluffs more easily called when the search involves publication years. For example, it claims that the word deception occurs in 46,500 documents published between 1958 and 2008 , in 48,700 documents published between 1968 and 2008 , in 48,900 documents published between 1978 and 2008 , and in 52,100 documents published between 1988-2008. Even when you realize that the crawlers of Google Scholar fancy page numbers, volume numbers, and many other numeric strings as publication years when parsing data, the increase of reported hit counts for narrower year ranges flies in the face of plain common logic. One should not ask even the time of the day from Google Scholar or be ready to learn that it is 48:75 a.m.
Microsoft’s LAS does not play fast and loose with numbers as Google Scholar—except for the nonsense Boolean OR operator, but it has enough natural unintelligence (in addition to shallow coverage) to report absurdly low hit counts and citation counts, that is obviously disliked by authors, editors and publishers. However it has its own fatal and crippling deficiencies if one wants to use it for more than just finding a few good items. I will point out the most annoying ones after the content section.
THE CONTENTAt the debut in 2006, Microsoft claimed that there are 6 million records in its database. My tests at that time indicated that there were less than 4 million records. Current test for the most common definite and indefinite articles and prepositions suggest that the size of the database did not grow much.
Simple searches yielded 2,020,00 hits for a, 3,480,00 for an, 3,490,000 for from, 2,050,00 for in, 3,880,00 for to, and 2,010,000 for the. There were records added, and there were records removed on a large scale since 2006 in order to get rid of the large number of duplicates, and fix other oddities, such as this record for a paper published in Science. As I pointed out then, the record of WLA claimed that Science is published by Nature Publishing Group American Association for the Advancement of Science AAAS (Science).
Even if we disregard the odd lack of punctuation this is not only a very odd couple, but as morbid as claiming that Romeo and Juliet was published by Montague-Capulet Partners, Ltd. I can’t fathom how could WLA come up with such records that looked like merging under the influence. By now this was fixed, replacing the original record with one from MEDLINE.
Indeed, very often, MEDLINE and several of the repositories are the preferred sources rather than the original publishers’ digital collections—without the advantages of being taken directly to the best publishers archives such as those of Elsevier and Oxford University Press (hosted by HighWire Press). In addition, in the result list identifies NLM as the publisher of thousands of journals, instead of the source of the bibliographic records. Librarians know that Lancet is not a publication of NLM , but it confuses the readers big way.
While the list of publishers may seem impressive, the coverage of their journals is very unimpressive. For example, the keyword search for nanobacteria finds 22 hits in the journals of Nature Publishing Group on its own web site, including 7 from Nature magazine . LAS finds a single record from Nature with this word, and it is retrieved from PubMed. There are only 145,000 records in LAS for items published in Nature magazine, while the native search engine at the NPG web site finds 359,515 records for documents published in NPG publications. This is not a PubMed problem, as it cannot be expected to have a records for all of Nature’s articles that deal with topics unrelated to health and medicine. Some of the papers that made it to Nature may have a preprint in one of the repositories, but the coverage of the NPG journals is still very disappointing in LAS.
Even with the large scale de-duplication, it is strange that the database seems to have barely grown. Test by author keywords, author names, and journal titles, however, strongly suggest that indeed, Live Academic Search is way too small a database for serving as a resource for multidisciplinary searches. No matter how many publishers’ archives and preprint servers may have been crawled by LAS, the process must have been superficial, ignoring tens of millions of records. Elsevier alone has more than 7 million records, HighWire Press (the digital facilitator for many large publishers) has about 5 million (1.5 million free) items, Springer about 3.2 million, Wiley-Blackwell 1.5 million. The full text of these can be searched and the bibliographic records with abstracts can be displayed for free by anyone in.
Take the example of searching by author Carol Tenopir. (Searching by author, journal, publication year range are new software features in LAS). She has been my mentor, editor (beyond two of my books where she appears in this capacity), and my academic role model (I can’t avoid this term) for more reasons than one. I keep reading her papers, so I am familiar with her works. Knowing the persons, subject and journals are essential when testing the coverage of a database empirically, and especially when judging hit counts and citation counts. She needs no introduction for readers of this column, either. Suffice it to say here that among her many coveted scholarly awards, she is the most productive faculty member among the 800+ faculty members at ALA-accredited library and information science programs. She published at least 450 articles, conference papers, books, book chapters, technical reports, including 240+ columns in Library Journal.
LAS finds 37 hits matching the last name Tenopir. Two of them are from other Tenopirs, demonstrating the additional bonus in searching for her: the fact the she has an almost unique, and hard to-misspell last name among researchers, and a single initial, so you don’t need to worry about undiscovered items because of inconsistencies and misspellings.. Searching for Tenopir C, i.e. with initial, retrieves 34 items, correctly omitting the two records mentioned above, and one where the software of LAS chose not to pick up her first name (and the name of her co-author)—a very common and very bad practice in a purportedly academic database as LAS.
This barely 7% coverage of her works is quite miserable, only their citedness (as reported by LAS) is worse, and more insulting. For comparison, the excellent open access LISTA (Library and Information Science and Technology Abstract) database of Ebsco has 364 records for her books, book chapters and articles. Someone would certainly bring up the excuse that LAS focuses on computer science, physics and electrical engineering. This may have been a poor excuse when it was launched, and is a very lame excuse two years later, especially, when it becomes obvious that the coverage of physics is not good either in LAS.
This is surprising, because physics is the discipline where self-archiving and repositing started. The digital collections of the most important publishers are readily available for data mining, and the best ones serve as models for those who want to implement services based on cited reference enhanced databases, such as the relatively small but top notch PROLA database of the American Physical Society, or the Astrophysics Data System (ADS) of the Smithsonian Astrophysical Observatory with nearly 7 million records. For computer science the CiteSeer database has been available for a long time, demonstrating the feasibility of autonomous citation indexing, and on the side led searchers to 750,000 open access computer science papers. All this happened well before Microsoft woke up to the sound and sight of users searching scholarly mega-databases, so Windows Live Academic was not much to write home about.
As you can see from some sample searches it remained that way, and its content and new software features, especially the citation matching feature are pathetic. Searching for the works of Jorge E. Hirsch, a physics professor at University of California, San Diego is a case in point. He has published nearly 200 papers, the majority of them in the top ranking physics journals of the American Physical Society, some in journals of the American Institute of Physics, the Institute of Physics, and in two Elsevier journals. These are not grey literature by any stretch of the term, they should be covered in any database that claims to cover physics literature.
This search was very difficult because LAS allows the use of only a single initial beyond the last name (which is a senseless limitation) when using the new author search cell. In case of Jorge E Hirsch this is a problem because there are 1,350 items with Hirsch J as the last name and first name initial, and LAS displays only the first 250 hits, a very serious limitation, even if I did not see LAS to inflate its hit counts as Google Scholar does to impress users. One of the few good features of LAS is that the sets can be sorted/grouped by several criteria (but it is ill-implemented for some of the them, such as author name as I discuss in THE SOFTWARE section). Suffice it to say here that I could find records for about half of Hirsch’s paper in LAS, and their citation counts were incredibly low, considering that about a dozen of his papers have been cited nearly 200 times, each.
Journal coverage is also very poor in LAS. The list of journals covered is not available any more, and the software does not offer any browsing (playing the mystery game like Google Scholar). One is left in the dark as for the journals covered. Because the journal name field is only phrase indexed, the journal name as a search criteria must match the name exactly, including all spaces, and punctuations. It is a guessing game. Even simple journal names are hard to guess. Searching for the Journal of Medical Library Association yields no result. The same is true for the standard abbreviation format of the journal , or for its common acronym format. It turns out that the journal appears in its full name followed by a space, a colon, two spaces and the acronym, like this. You can’t use truncation to ease the pain of guessing endlessly. Your joy of finally finding quickly ends when learning that there are only 249 records in LAS for this excellent journal, as opposed to 459 in PubMed. There are 66 records for 2002, 66 for 2003, 68 for 2004, 49 for 2005, and the rest is silence . So much about currency. You may start wondering “did I shave my legs for this?”. Under the prior title, Bulletin of the Medical Library Association, there are 1,920 records through LAS , much less than half of the 5,003 records retrieved directly from PubMed.
The most disappointing content element relates to the new feature of LAS, the citation count, which indicates how many times the document was cited by other documents in LAS. While the feature of sorting result list by citedness in LAS is a good idea (and unavailable in Google Scholar which switched to ranking by relevance with an explanation straight from the vocabulary of snake-oil sales associates), the results are incredibly bad in LAS. In case of Carol Tenopir, LAS reports that only one of her papers, that LAS is aware of, was cited—3 times . When you hover above the second entry in the list sorted by citedness count, you will see its citedness as 0, and the same is claimed for the rest of her records. The shallow coverage of many of the journals is not an explanation for this symptom alone. It is probably a lethal combination of poor content and sorry citation matching software.
At first glance the advance search template is good news It finally offers options to search by journal title, author name, and limit the search to publication year or year range. Or so it seems.
At least the last option is working, which cannot be said about Google Scholar, which often increases the size of the result as you reduce the time frame for the otherwise identical query as was shown earlier.
As for the other software features, there are problems. Searching by journal name is very difficult, because it is a phrase indexed field, and there is no truncation symbol which would alleviate the problem of guessing what is the exact spelling of a journal, or its spelling in LAS—as was shown earlier. No user is likely to be willing to waste her time with trying several different versions. If truncation were available, it would help to reduce the donkey work of trying Online Review, Online Rev*, Onl* Review, Onl* Rev* when you don’t want to believe that there is not a single record for articles in this journal. The same is true for Journal of Documentation, one of the top five library and information science journals. It brings up 18 records for the full title, so one would try to see if other variants like J Doc, J of Doc, Jnl Documentation might bring up a few thousand more hits. Browsing by journal title could also help in this regard, as chances are good that collecting records form a variety of different sources will yield the same variety of spelling as we could see in the list of journals, which was withdrawn by Microsoft (and has never been available in Google Scholar).
Author name searching is made very inconvenient. Although there seems to be space reserved on the advanced search template for 4 first and middle initials, only one can be used. The developers ignored the reason why so many scientists have a middle initial. To differentiate themselves. In the test example of Jorge E Hirsch it would have been easy to for Hirsch, JE (or Hirsch, J. E.) if the middle initial could have been used to tell him apart of the dozen other researchers with Hirsch as the last name and J is the first initial.
In a multidisciplinary database the distinction becomes more important, and LAS is supposed to be such a database, where there are authors with Hirsch as the last name and J as the first initial in physics, audiology, surgery, etc. and there is no option to limit the search to a discipline in order to tell apart J E Hirsch from the many researchers with the last name Hirsch and first initial J for John, James, Jorge, or Janacek. You can use a keyword, like I tried “physics” for the Hirsch J search. This assumes, of course, not only that the word physics appears in all of his papers, but also that it is indexed in LAS.
You can imagine the impossibility of distinguishing authors with the same last name and first initial, who work in the same discipline, even within the same sub-disciplinary area as Herbert S White and Howard D White have worked in the information retrieval research area.
It is senseless and insensitive from the Microsoft developers to limit the search to the first initial, when the second is usually available in the source record that LAS is using, often leaving behind not only the additional initials but also the additional authors after the first one.
Then again, at least Microsoft does not deprive the authors from their authorship as Google Scholar does when assigning the authorship to F Password, FY Password, V Cart, I View, M Profile—and to a variety of other menu option names that appear on the host systems’ main page such as P Reminder (for Password reminder) , or in the text as chapter or section headings, such as I Introduction, II Introduction, II Methodology, V Conclusion, which Google Scholar treats as author, often replacing the real authors’ name with the pseudo names, and its mortarboard with the court jester’s hat. For fairness, it does not affect the entire Google Scholar —just a few million records. The actual number depends on the extent of inflation of the hit counts and citation counts of these records with obviously phantom names, and we remain unaware of the records where. You may claim that it is not a problem until the joke is on you.
Sorting/grouping has been available, but did not work for journal title. Now there are more options, and more problems. Sorting/grouping by name would have been reasonable to find Hirsch, J. E. among the many other authors with Hirsch, J, as their last name and first initial, but the logic of the grouping had no logic for me.
The entry headings used for grouping are created for each author and co-author, instead of the traditional author main entry and co-authors’ added entries with cross reference to the main entry in print indexes. Although it consumes quickly the 250-item display limit, in the digital context the approach of LAS is feasible. In this case it was supposed to allow me to zoom in on the Hirsch J. E. section in between the Hirsch J. A. and Hirsch J. L entries.
The grouped list, however, made no sense to me. The first entry is Kim K, surprisingly followed by Kim J who are co-authors of a certain Hirsch, J an a Hirsch, P. J. The list gets more weird when Wischmeyer’s entry is followed by that of Bensoussan. At first, it seemed that there is no problem with date sorting. Actually, there is, because surprisingly high number of records have no publication year (in LAS, that is), and these are left out from sorting. That’s the reason for the enigmatic display that no result is shown when sorting the 18 records retrieved for the Journal of Documentation as shown above. None of them have publication year in LAS (all of them have in the publisher’s archive).
There are other data elements that are left behind by the crawlers of LAS, such as the journal name field in this record for an article about the importance of metadata quality in Library Review. I have seen this with many records, but there is no way to estimate the extent of this omission. In addition to the inconvenience of adding items one by one for a bibliography (instead of marking records and exporting them in one fell swoop), it will annoy the user to find out that many records have no journal name, making the bibliography of modest use. Of course, such records would not be retrieved for journal name searches either.
The most irritating problem is the grossly under-reported hit counts for authors. I mentioned that LAS credits Carol Tenopir with 3 citations for one of her papers, and with none for any of the other 33 it was able to find. Jorge Hirsch fares somewhat better. His seminal paper that introduced the h-index measure in 2005, is credited with 4 citations, and a few other articles with one citation each, even though many of them were cited in more than 200 papers.
It is likely that these absurdly low citations which show up in case of every author are caused by the primitive citation matching algorithm of LAS. You should not be surprised by that because LAS does not even know how to handle the Boolean operator. Maybe the developers thought that bad Boolean contributes to Google Scholar’s success, so they tried to outperform it, by apparently making Boolean OR function like a Boolean AND. I use the same example as appears in the LAS help file for Boolean OR to illustrate this nonsense, careless operation. The search term car finds 70,100 records, truck retrieves 8,390 hits, so how many hits will be retrieved by the query car OR truck? If you think that the hit count would be in the mid-70,000 range, you would be right, but LAS reports 1,790 hits, that is one thousand seven hundred and ninety . This is the same number of hits that the query car AND truck produces. Congratulations, LAS.
It was a poor debut two years ago when Microsoft launched Windows Live Academic —18 months after Google launched Google Scholar. I don’t know what kept the developers busy for 2 years to come up with this sorry upgrade of LAS, which makes it worse especially by academic measures. No tenure would be granted, let alone promotion in academia for such performance. I don’t believe that it merely mirrors the incompetence of its developers. It is more likely the result of the lethal mix of gross incompetence and gross indifference. I do know that it sheds bad light on the excellent system developers at Microsoft whom I met several years ago in the Redmond headquarters to discuss the software and content issues of the Encarta service. They may have left, and could not be substituted. No wonder that Microsoft is so desperate to acquire Yahoo.