Publisher: Wolfram Research
This is my farewell review. Mary Kay Dodero offered me this platform 10 years ago, and many editors have helped me, Mary Claire Krzewinski for the longest time, in getting my message through every month. I much appreciate their support and the privilege to have had the opportunity to publish 227 database reviews. It was icing on the cake that my favorite reviewer and predecessor, Jim Rettig, recommended me to take over his position as Gale's reviewer for digital ready reference sources. While it is the omega of my column, it happens to end with a review of an innovative and promising factographic database called Wolfram Alpha.
This unique "computational knowledge engine", the brainchild of one of the most talented contemporary mathematicians, Stephen Wolfram, is said to be based on more than 10 trillion data (which number is comparable to the number of people who ever lived, and more than three times the number of stars in our galaxy. I would have not known this, but I quickly learned it by looking up the term trillion in Wolfram|Alpha.
If this were not enough, it can serve much more data than that because it also calculates new data from many of the raw data that appeared in economic time series, factbooks, yearbooks, encyclopedias, almanacs, directories and a large variety of statistical compendia. It is meant for questions that can be answered mostly through numbers. It has great potential to become a widely used important resource for situations when numeric data is needed rather than deep thoughts and verbalization, but it is not there yet, it is not a finished work that would only need updates with fresh, current data.
Currently, it is more like Wolfram|Beta. Actually, it should have been called Wofram|Numeric, as its essence are numbers, and it shows that numbers are beautiful (although they often can be also overwhelming as much as the streams of consciousness in Ulysses). Then again, Alpha has many subtle and not so subtle meanings beyond being the first or the leader that Wolfram may have wanted to play on in choosing this name.
The mother of all quantitative questions may have come from Cleopatra when she learned that Mark Antony had been not only fooling around in Rome, but also married Octavia. On his return to Egypt and reassuring Cleopatra that he still loves her, she challenged him by asking, "If it be love indeed, tell me how much". Antony's reply, "There's beggary in the love that can be reckon'd", is the father of all evasive answers. Librarians can't get away with such answers. They often must answer ready-reference questions that beg for numeric answers instantly.
Wolram|Alpha can be of great help, but it is not a typical search engine. There is a reason that the author (or I might as well say composer), calls it a computational knowledge engine. He wants to set it apart from the dozens of search engines. Still, many reviewers compared it to Google, which is like comparing apples and oranges. Google and the other search engines are actually pointers, sending you to Web sites, whereas Wolfram|Alpha is a direct ready-reference source itself.
Apparently Google got jealous over the pre-release buzz about Wolfram|Alpha and quickly made some good dynamic charts of publicly available demographic, economic and environmental time series (that still beg for a log-scaleble y-axis, because the carbon dioxide emission per capita by China seems miniscule compared to that of Qatar and Saudi Arabia). This new Google feature was, by chance, announced the same day and time as Wolfram made his live demonstration about Wolfram|Alpha at Harvard University. Microsoft was not catty; it licensed Wolfram|Alpha to answer many of the numeric searches thrown at Bing. There was much blogging and tweet traffic, and many editorials using the term Google Killer, but Wolfram|Alpha is not meant to be a rival to the generic search engines. Neither is it a direct competitor of traditional ready-reference sources, mentioned below, because it has minimal text for explanation (as would be needed, for example, for the entry on the Richter-scale.) Wolfram|Alpha is quite in its own league, although still in the sparring room phase.
There are many encyclopedias, almanacs, factbooks and statistical compendia that include plenty of numeric data embedded in text. Some have well-designed structure to separate the essential demographic, economic and historic data from the narrative, descriptive parts, but relatively few of these are freely available online. The ones that stand out include the CIA World Factbook, Britannica World Data, Britannica Concise Encylopedia, Encyclopaedia Britannica (partially free), the Columbia Encyclopedia or the ever-improving Wikipedia.
Wolfram|Alpha is probably closer to the outstanding Information Please Almanac, which is far more than the now-ceased print edition was, by virtue of incorporating many other ready-reference sources.
Another service that is closer to Wolfram|Alpha than Google, Bing or Yahoo is the answers.com site, which covers many of the above-mentioned ready-reference sources under one roof, so a single search can bring up answers to many questions in one fell swoop.
For example, the query "Haiti" brought up entries from the American Heritage Dictionary, Britannica Concise Encylopedia, Investopedia, Columbia Encyclopedia, the New Dictionary of Cultural Literacy, the CIA World Factbook and a few additional sources with maps, dialing codes, etc.
The same is true for the encyclopedia.com site of HighBeam Research – now part of Gale, which in turn is part of Cengage Learning, which also shows side by side excerpts from maximum three user-selectable encyclopedias and dictionaries at a time.This is also very good for the essential task of corroborating reference data in different sources.
However, not even these multi-source ready-reference engines allow the users to specify "Just the facts ma'am" as a special filter. One needs to scroll through many paragraphs to get the essential economic, social, health and political facts about the country. In the emergency situations following the Jan. 12, 2010 earthquake many people wanted/needed to learn the vital facts about the country immediately.
Not just out of curiosity, but also for organizing emergency missions and informing (as news agencies, radio and TV stations have done) the public about the plight of this country that has suffered very much for very long under generations of the Duvalier family, better known as Papa Doc and Baby Doc.
Beyond the dramatic, and necessarily attention-getting images, interviews, video streaming about the disaster, factual information was much needed about the country, the capital and its vicinity (the epicenter of the quake), about the devastation of earthquakes of similar magnitude in the past to put the news, the actions and the pleas in context. This may be an extreme situation, but there are many other "normal" cases when users need factual information, numbers, statistics - in context. It is one thing to know the magnitude of the quake, it is another to realize the magnitude of the death toll, which in this case is close the 2004 tsunami.
Actually, a new variant of encyclopedia.com, the beta version of Smart Question and Answer, comes the closest to Wolfram|Alpha by answering natural language questions (including factographic questions) from hundreds of published encyclopedias, dictionaries, factbooks and other classic ready reference sources, and – for subscribers – from magazines and newspapers with links to the sources for further details.
You may wonder why I did not mention among the somewhat analogous databases and services, the Guinness Book of World Records (GBWR) which is also focused on numbers. The reason is simple. As a child, GBWR was one of my favorite sources, but in the past decade it went downhill. When its digital version was finally brought out, it was quite disastrous. Not only because it was full of records that only people with severe "televisionitis" could find interesting, such as the farthest nasal ejection of spaghetti, or the number of months a headless chicken lived after its beheading, but also because of the mindless and careless attitude of the compilers. The producers of the digital version have mangled facts, and showed ignorance about basic measurement units and rates.
They had no idea about the measurement units for areas versus length, and about the conversion rate between miles and kilometers, claiming, for example, that Lake Baikal "covers an area of 88,000 km (234,000 miles)" – setting a Guinness World record for editorial incompetency and recklessness. Now they can learn from Wolfram|Alpha what a square kilometer is; when to use it; and how to convert length and surface measures from km to mile or from square km to square mile or whichever Imperial units they fancy to use.
Wolfram|Alpha is composed from a huge number of mostly factographic sources, where numbers dominate. We cannot call them just numeric databases, because they have significant text components in addition to numeric information. Think of the yearly editions of the World Factbook, the OECD Factbook and most of the almanacs. It also has traditional alphabetic information extracted from Princeton University's excellent WordNet database and the British National Corpus for words.
Many of these sources are part of Wolfram|Alpha but in a transparent mode. Not surprisingly, one of its primary sources is the open-access MathWorld encyclopedia which was developed by Eric Weisstein and is hosted by the Wolfram Research Inc. on the Web. There are hundreds of very large data sets published by the various agencies of the United Nations, the World Bank, and many other non-governmental international organizations, and incorporated into Wolfram|Alpha.
Wolfram|Alpha provides a large number of impressive examples and a huge gallery of samples of the tables, graphs and charts that it can prepare from the raw data. Stephen Wolfram's video presentation offers an excellent overview of the most impressive features of Wolfram|Alpha . However, he speaks and clicks at the rate of an auctioneer at Sotheby's and in his native British English, so it is worth going through the illuminating examples and gallery to appreciate the scale and variety of abilities of this service.
These very good PR materials provide a much better overview of what is available from the more than 10 trillion data elements than I could ever do, so I chose to focus on what is not available in spite of the humungous, but well structured and worthy data silos of Wolfram|Alpha. I am not picking on some casual omissions, such as the lack of information about the speed of sound that would deserve as good a treatment as the entry on the speed of light, as these will be obviously added. I rather bring up some omissions of larger scale and higher importance for informing and educating users.
The absence of some essential statistics about education and human development (or rather the lack of these in a very large part of the world) is disappointing, especially the paucity of gender-specific and age-specific info. The paucity of data is odd because UNESCO, UNICEF, the World Health Organization (WHO), the UNDP in general, the ChildInfo and GenderInfo databases in particular and the superb GenderStats database of the World Bank have very comprehensive historic data sets about these two issues.
There are some entries where there is a clear message that development of the topic is under investigation, as is the case with famine, which is a deadly serious issue in many countries. I would give it much higher priority than to many other statistics, and I trust that the query starvation or starving will bring up not only a word definition as now, or also a movie (as for hunger) but eye opening statistical details about the shocking dimension of its real-life consequences. There are data readily available from FAO and from the UN Commission on Human Rights that can give a good start for this topic in Wolfram|Alpha, along with its related existing entry on poverty.
Given the key importance, and the escalating fatalities and costs of the wars in Iraq and Aghanistan, it is absurd that the development of war topic is also only under investigation. I suggest they don't investigate, just do it. There are some war-related data in Wolfram|Alpha – from 2002, when the yearly U.S. fatalities were around 40 . I know that statistics are often published only years after the data was gathered. This is no excuse in this case. Reliable and up-to-the-minute data are available that in 2009 the U.S. fatalities in Afghanistan alone were well above 300, that this data for the first month of 2010 is already close to 40, and the financial cost of the Afghan war is about 130 million dollars per day. Wolfram|Alpha could and should calculate a variety of crucial indicators from these two data alone and in context of other expenditures on a wider scale to illuminate users.
Education is one of the 30 main subject areas listed by Wolfram|Alpha, but the set of indicators is rather limited. While there are essential data about enrollments in elementary, secondary and tertiary schools , I could not find data by gender breakdown which could be quite a telling figure about the education and career chances of females in the richest Middle East oil exporting countries where secondary and tertiary higher education much improved – for males according to statistics.
The much more comprehensive, and regularly updated composite Human Development Index (HDI) is not available in Wolfram|Alpha, even though this has become the most important composite indicator for measuring the quality of life in 180 countries, an unusually high rate of worldwide coverage. HDI has an entry in Wolfram|Alpha but it is for the Hardwick Field Airport.
Knowing the elevation, temperature (with and without the wind chill factor) and the relative humidity at this airport is certainly important for some people, but including the HDI of 180 countries would have deserved much higher priority to let users know about the painfully low HDI of, say, Haiti (0.532), the worst in the Americas, where Canada (0.966), the U.S. (0.956), and Barbados (0.903) lead, and even Mexico shows up with a surprisingly high HDI of 0.854.
The fully spelled out query Human Development Index, is searched as human development, and reports the population growth of the world between 1990 and 2006. It is no consolation that the same chart is presented twice.
For information about the states of the U.S. data from the splendid open-access Measure of America database would have deserved to be licensed for incorporation into Wolfram|Alpha . Among others, the data and interactive charts in its 2010 Hunger and Health Report provide a stunning reality check about vital inequalities within the country.
Another area that is poorly covered is the progress and achievements in science and technology where the National Science Foundation (NSF) has highly valuable, freely available historic and current statistics in PDF and Excel formats in a variety of well-structured digital publications. They include succinct and competent textual summaries, graphs and charts about hundreds of informative and comparative science and engineering indicators, ranging from science labor force to R&D expenditures and scholarly papers published in academic journals.
NSF as an acronym is included with a good explanation, and so are all the quantitative and ontological and chromosomatic details about the NSF human gene and the Nsf mouse gene but for some of my test questions beyond the natural science areas the details were below expectations. At the same time there are entries that seem to be odd choices for Wolfram|Alpha (such as skinny records about not so important movies where the only numeric data was the release year, and the run time).
Even for possible relevant entries, the data was odd. For example, my query about innovation brought up as first hit the periodical with that title, and showed its circulation, and some bibliographic details. It was odd that the circulation figure was an estimate for 2008, as the magazine seems to have ceased publication by the Winter of 2006 – so whoever estimated its circulation in 20,000 copies for 2008, apparently miscalculated the future of this journal.
I also wonder why the sources included both the Urlich's Periodicals Directory, which is getting better and better with every new release, and the Publist.com web site of Infotrieve that is getting worse and worse every day because the data (licensed a decade ago from Bowker which was the compiler and publisher then) has never been updated by Infotrieve so it is more than a tad stale. It is hard to understand why Wolfram|Alpha disgraces itself with misinformation from PubList, which – according to a warning message "appears to host malware". Lucky that you can learn about it before proceeding to Infotrieve's site.
It is unclear what journals were selected for inclusion by Wolfram|Alpha because there is no information about such journals as The Lancet, Ophthalmology, Pediatrics, Physical Therapy, Physics World, Chemistry World, Journal of Clinical Psychiatry that circulate in much more than 20,000 copies, and are much more important journals than Innovation. The sources used to answer a question are meticulously listed at the end of the answer(s) – most of the time, but sometimes no source appears at all as was the case with the search about the NSF and Nsf genes mentioned above. There is no comprehensive list of the sources incorporated in Wolfram|Alpha.
The listing of the sources does not mean that all of them were incorporated in Wolfram|Alpha. Many are used to corroborate data, and then presumably the most complete and authentic source is used, but there are exception as was the case with the very stale PubList database. Many statistics, especially UN statistics are incomplete in terms of countries surveyed, or time period covered, and not even the multiple source coverage policy can help in these cases. For example, the very high poverty ratio in many countries is the source problem for many of the devastatingly high negative indicators, such as infant and maternal mortality rates, illiteracy. Still, there are no poverty ratios reported for more than 130 countries in the primary source documents and there is not much that Wolfram Research Inc. can help in this regard.
Suffice it to say, that this is an enormous data collection by any way it is measured. Much more importantly, it is a curated data source collection, meaning that data sources are selected by humans, who take care of getting the current version of data tables, time series, directories, dictionaries, encyclopedias, factbooks, etc. and consolidate the format and content. True, they occasionally make mistakes, or miss an update.
This is in sharp contrast to how Google collects data through its grossly undereducated crawlers and parsers that cripple Google Book Search and Google Scholar on a very large scale for users who want to use these sources for bibliographic and bibliometric analyses rather than just for finding a few good books, journal articles and conference papers, where these two Google sources are unbeatable because of full text searching which compensates for the metadata mess.
The Google Public Data module is very limited yet, and hopefully it will not experience the metadata mega mess of the above two Google Services that are finally getting corrected, but new ones of large scale keep popping up.
In Wolfram|Alpha there is no need to pre-select and commit to a source, because the sources are – in a special way - connected, related, and correlated behind the scenes and this allows Wolfram|Alpha to compute new indicators or convert the measures and unit of measures on the fly.
I must add that I had strange results for one of my tests. In looking for maternal mortality statistics for countries, Wolfram|Alpha offers the number of maternal deaths per year (which was OK), but the other one looked strange for two reasons. One was that it calculated the maternal death ratio per 100,000 persons, the other that the ratios were way low to be true. For the U.S. it reported the ratio as 0.2 per 100,000. The problem seems to be that the denominator for this indicator should be 100,000 live births per year not 100,000 persons – this is not the run of the mill per capita indicator. Accepting that the yearly maternal deaths in the U.S. is 483 and the yearly birth figure is 4,290,00 the ratio should be 11.25 – a more than 55-fold difference than appears in Wolfram|Alpha. In addition, for the common alternative indicator, the maternal mortality rate (the term used on the Wolfram|Alpha result page) the denominator should be the number of females of reproductive age, so this indicator should be corrected.
In many regards this is a very capable software. One of its apparent advantages is that queries can be entered in natural language. This is not entirely true because sometimes the software pulls a blank for fairly simple natural language queries. For example, the query "how much more expensive is Honolulu than Phoenix" would make the software to reply that Wolfram|Alpha isn't sure how to compute an answer from your input". True, it offers three choices: distance between the two cities, information about a city, and information about the Phoenix constellation in astronomy, and none of them is what was meant by the query.
However the query "moving from Phoenix to Honolulu" does provide a very good answer, and gives more info in more formats than was asked for (but likely thought of), about how much higher the cost of living is in Honolulu than in Phoenix, broken down into six different categories, and giving the bottom line: Honolulu is 66% more expensive than Phoenix.
The good answer is not given even if the query includes the word "live" which should be a good hint, i.e. "how much more expensive is it to live in Honolulu than in Phoenix". The simple query "living Phoenix Honolulu", still would not return the good answer. However, "living costs Phoenix Honolulu" does provide the above good answer, and so does "moving Phoenix to Honolulu" (the preposition "from" can be omitted but to must be included, which is understandable, but unusual as it is typically a stop word in most search system).
For its credit, Wolfram|Alpha warns you not to use lengthy sentences as queries, but users may not read that help information, and the optimal query construction remains enigmatic. The term "female literacy" returned the message that Wolfram|Alpha isn't sure how to compute an answer. After some trials I figured that female literacy rate made the software sure how to compute it, and that its preferred term is literacy fraction. The query "female literacy middle east", however, could also compute without using the word rate or fraction. Once again, spending a little time with the examples section pays back handsomely by seeing the sample searches, and giving ideas. Nevertheless, it would very much help the query formulation if the primary concept terms and their assigned subcategories were browsable when only the can't compute message comes up.
On the positive side, the software often recognizes misspelled words (such as Phoenix for Phonix, but not Nobel Peace Prize winner Fredrik Bajer when searched as Fredrick Bajer). It recognizes spellings that just recently changed such as the former international airport of Bangkok from Don Muang to Don Mueang, and correctly uses its new IATA code DMK, assigning its earlier code BKK to the new airport.
It stems search words well, for example, the search Nobel Hungarian finds the nine Nobel prize winners who were born in Hungary and shows a nice time line. It smartly assumes that the search term Paris alone means the French capital, and also displays the second most likely alternative (because of the movie), Paris, Texas, and in a pull down menu shows 9 other city/state alternatives. The display style of graphs and charts is very well chosen, as illustrated by the microcharts for the past 5 years weather indicators for the month of February from average temperature to wind speed to 70 years of historical temperature ranges.
It is disappointing that if no first name is entered, often there would be no hits reported. It is enough to enter Belmondo, but Delon retrieves nothing unless Alain Delon is entered – and the information about both icons is poor. Only European actresses fare worse, such as Sophia Loren, Liv Ullman, Annie Girardot, Simone Signoret and Juliette Binoche, with barely more information about them than Roseanne Barr.
The software smartly clusters and sorts results by reasonable criteria. The single term earthquake finds the earthquakes entry; shows a map of the recent ones of magnitudes higher than 5 (reasonable limits considering the frequency, magnitude and consequences of earthquakes); displays a timeline and a list of the earthquakes with key data; and allows changes of the filter parameters right in the output phase. The default parameters are smartly set according to the location (if specified), so "earthquakes Japan" shows the data for quakes of magnitude larger than four (probably considering the extreme population density and possible damage of a magnitude > four quake), and offers the option to show 30 years history. This is also one of the best software examples for illustrating the sophisticated but still intuitive filtering options of Wolfram|Alpha.
One of the best output features is that the data of two persons, countries, products and brands can be displayed side by side for quick comparison. As there is enough white space for at least a third country, this feature could be made even better. The displayed data pages can be converted to a single PDF file by the click of a button. This is very useful as the charts and tables are not always sharp enough for aging eyes, but the PDF printouts have much better contrast. In some cases only the first screen was converted into PDF format even if there was 4-5 screens of information for an item. The software warns you that this utility is experimental.
The software is excellent in reporting data in different units and scales of measures and/or converting them on the fly. In many of the print ready reference sources this has been a problem because reporting height, length, weight, temperature, etc. in feet and cm, miles and kilometers, pounds and kilograms, gallons and liters, Celsius and Fahrenheit would have increased the size and the cost of the publication. In the digital format this would not be a problem, but many digital ready reference sources have been created from the manuscripts created for the print version and included either the English measures or the metric measures. It could be relatively easy to change this by automatically calculating and adding the other measures (if they are well tagged internally), but this may have consequences on the layout of the articles in the digital version.
There was no such concern for Wolfram|Alpha as its software is based on the very sophisticated Mathematica software which has thousands of built-in conversion factors for all types and units of measures, and practically all measures could be converted when the data were imported into Wolfram|Alpha. Users can easily change from and to the metric system by the click of the button, although in a very few cases this button was not displayed. It would be useful to allow the users to customize the default to the metric or the Imperial units to their liking.
Wolfram|Alpha is a very interesting ready reference source, and there is no beggary in the answers that can be reckon'd. On the contrary, there is revelry in the answers if the key facts can be summed up compactly.
That's why good quality abstracts have been appreciated, and why senseless ones are depreciated as those produced by Google Scholar in 307,000 records with the same "abstract" pondering "why this message is appearing", and 500,000 "abstracts" assuring the user that "the visual presentation will be degraded" – both because of your browser.
Don't worry, it is not your browser's problem. The problem is with Google Scholar's crawlers that triggered these error messages and then gathered them as "abstracts" from the web sites of the most respected scholarly journals. Their publishers gave the key to their entire digital archives, and the precious metadata to Google Scholar's developers. If you go to the publishers' site you will find the real metadata, including the real abstracts for free.
Of course, we would prefer to see the real abstracts in the result list of Google Scholar directly. Unfortunately, its designers ignored the existing, well identified and fine abstracts, and sent their undereducated crawlers to figure out what is the abstract, who are the author, what is the publication year. This was not exactly smart, and the very large consequences are obvious for the realistic searchers.
If it is some consolation, the regular Google engine promises to have 9,900,000 hits with the former abstract, and 10,200,000 of the latter. If Google and the other search engines come up with such "abstracts" in such numbers it is not good, but Google Scholar developers were offered the option to search the well identified metada set to identify the author, title, journal name, publication year and other data elements. They obviously did not. Instead, grossly undereducated crawlers were sent, and created journal titles and author names from menu options (like Feedback, P Login),street addresses, and publication years from page numbers and parts of phone and fax numbers. To see a god dozen screenshots for illustrating how Google Scholar brutalizes the metadata fetching process see my most current review of Google Scholar in the 2010/1 issue of Online Information Review or in its pre-print version.
I did these odd "forensic" searches after I found many records in the Google Scholar result lists, especially from Science, BMJ and JMA with strange abstracts for my regular searches, and then an even higher prevalence of these oddities when I wanted to see how many records have abstract in Google Scholar for items published by Wiley Interscience in the first month of 2010. Google Scholar reported 10,100 hits. Then I searched for a unique part of the error message parading as abstract, and found the same number of hits. Luckily, on the native site the abstracts come as you would expect.
There may be a few million more similarly nonsense "abstracts" in the generic search engines and Google Scholar, so this is just a taster. It may be palliative if you remember that these hit counts are not estimated by a computational knowledge engine, but just bluffed, so the real numbers are probably lower.
It is unlikely that Wolfram|Alpha would consider licensing Google Scholar results and use its very inflated hit counts and citedness counts. Wolfram|Alpha really knows its math – and then some (except as I noted earlier). If it does not know the answer, it offers to send the query to Google, Bing and Yahoo. This offers a powerful fusion between the familiar search engines and the novel features of Wolfram|Alpha. Go and enjoy them.