In my last blog entry, I briefly mentioned a project called “Watson.” Watson is a set of IBM Research technology, running on POWER technology, which will compete in the “Jeapordy!” game against two of that game show’s most celebrated champions. Since my blog, a number of good articles have been written about it, including an excellent overview in Jon & Susan’s blog. If you're not familiar with Watson’s basics, you should probably start there, and then come back here.
Don’t worry. I’ll wait.
OK, now that you have the overview, I thought I’d point out one of the technical details of the project, and then discuss implications of such technology on computing in general.
Watson’s cluster of POWER 750s is using two 2 terabyte (TB) I/O nodes, for a total data repository of 4 TB. How impressive does that sound to you? In one sense, that amount of data can sound rather large. Many businesses do not have 4 TB of data. On the other hand, to compete in a general knowledge quiz show, is that enough? Well, a rule of thumb I once learned is that 1 Terabyte was about as much text information as is contained in 20 volumes from a printed encyclopedia. Now, the encyclopedia I bought more than 20 years ago had 24 volumes, and it added a new volume every year to cover events which happened in that year. So, by that analogy, 4 TB might be enough room to have a full encyclopedia’s worth of general knowledge at your fingertips (or, at the ends of your disk-arms, for Watson) as well as another 16 year’s worth of more specific information.
Somehow, while that sounds like a lot of data, it might still be less data than I think Ken Jennings and Brad Rutter have in their heads. They are veritable founts of knowledge.
And yet, if you listen to this Why Data Matters video on the IBM Watson site, 4 TB is still miniscule, compared to the amount of data being produced. The estimate given in those videos is 15 Petabytes of new data is generated throughout the world each day. That’s right, the new data each day is 4,000 times larger than could be stored in Watson for this challenge. But, for the challenge, the data is specific to a purpose, and that purpose is quite impressive – competing with human beings in knowledge-based answering.
By the way, one implication of this information is that Watson is only using data that's stored in its disks. Watson is not connected to the Internet, or any network outside of the clustering network that connects its 750s and its storage. So, don’t assume that Watson is getting some advantage, such as having access to search engines.
This is going to be very important as the DeepQA technology gets applied to real-world applications. Suppose, for example, you wanted to ask a computer about a fairly complex topic – taxes, for example – and you wanted to get a reasonably reliable answer. The Federal Tax code is pretty complex, but it could be contained on a single set of servers and storage, and you would want your answers to be consistent with the information in those servers. You most certainly would not want to have the QA system pulling data in from blogs on the Internet, written by people who are misinformed about the reality of tax laws.
For many of us in this industry, Watson represents another step toward the kinds of computers we saw on Star Trek. Computers that can understand our meaning, even if we ask a question in a strange way, to turn a question into a query, will make computers more usable in ways we have only dreamed before. Computers that have access to all of the pertinent information and know how to search it will provide extensions to our knowledge which approach artificial intelligence. Computers that not only can give us an answer but can also give an estimate of confidence in that answer will help us make decisions in more informed ways.
#ibmwatson
One terabyte is much larger than 20 volumes of an encyclopaedia. In fact, I've never heard that before and it would be a huge overestimation. Some simple calculations will show this point.
One page of printed text is 5,280 bytes (66 lines, 80 characters per line). Assuming that we have 900 pages per volume, that's 1,800 sides; 1800 * 5280 = 9,504,000 (or about 9 MB per volume). If we then take 26 volumes (one per letter of the English alphabet), we'd get about 235 MB.
Even if we were to double the size of each volume to 1,800 pages (or assume that it was 10 KB of text per page instead of 5 KB), we'd still have less than 1 GB. In fact, I think 1 GB for an encyclopaedia (text only) is a good upper estimate. (Images would greatly increase the size, but images aren't likely in helping Watson do the job it needs to do.) This is why Encyclopaedia Britannica or World Book Encyclopaedia could fit on a CD-ROM (650 MB) with limited multimedia; a DVD version contains all the printed matter plus vast amounts of multimedia content - all in 4.7 GB.
So, assuming 1 GB per full set of a printed encyclopaedia, 1 TB would hold 1,024 years worth of encyclopaedia sets - or every encyclopaedia set ever printed by Britannica and World Book combined (with room to spare for all the text of classic works).
Posted by: C. Speare | 01/11/2011 at 04:09 PM
When we announced the AS/400 in 1988, the largest Model B60 supported up to 27 gigabytes of storage. Since many of our customers still thought in terms of MEGAbytes, I think we explained that 27GB represented all the data stored in your public library.
(Just because I remember something, it isn't necessarily true.)
Posted by: Dan R. | 01/12/2011 at 05:34 PM
Your basic assumption of calculating a page of encyclopedia data at 66 lines by 80 characters is probably wrong. That is probably correct for a typewritten page. But encyclopedias use much smaller fonts and multiple columns and so would have a much greater data density.
Posted by: dale janus | 01/13/2011 at 09:04 AM
Even my Windows counterparts think this Watson story is cool. The big questions are how many processors and how much memory?
Posted by: John Techmeier | 01/13/2011 at 03:39 PM
@C. Speare & @dale -- It's a rough approximation based on knowledge I acquired years back, but I did get quick confirmation from a couple of sources -- one being http://www.jamesshuggins.com/h/tek1/how_big.htm . Anyway, the point is that the #ibmwatson storage space is, in some respects, much smaller than you'd expect, when considering it has to compete against general knowledge (trivia) experts, and has to store not just the knowledge, but representations of the knowledge in ways that can be acted upon by the DeepQA algorithms. Fascinating stuff. Now, I do wonder if photographic, and other graphic, information is accounted for in the estimates for encyclopedias, but I won't stress the details.
Posted by: Steve Will | 02/04/2011 at 12:47 PM
@John T - I just got more of the details yesterday. #ibmwatson has 2880 cores and 16 Terabytes of memory(across 90 connected POWER7 770 servers.) That's big!
Posted by: Steve Will | 02/04/2011 at 12:50 PM
Not to beat a dead horse, but I'm getting back to the comments made by @C. Speare and @Dale - you have greatly underestimated the amount of data contained in 2 terabytes. Here are two quotes from the reference you mentioned:
"100 Megabytes: Yard of books on a shelf; two encyclopedia volumes"
- and -
"2 Terabytes: Academic research ligrary [sic]".
Don't get me wrong, I still think the Watson effort is a huge step forward in data analytics.
Posted by: Kevin P | 02/09/2011 at 06:48 PM
@Kevin, OK, you folks are convincing me that my old "rule of thumb" was probably inaccurate.
One other piece of news: Initially we were told that the storage on Watson was 4 TB, but we're getting more details as we get closer to the event we're getting more specific information. And for some reason, the details don't match what we were initially told. For the most recent storage details, you might want to see Tony Pearson's article (if you can, it's in DeveloperWorks): http://ibm.co/hfqiXl
In it, he tells us that the total storage for the Watson cluster is 21.6 TB (so how did we initially get told it was 4 TB?) but that "The actual size of the data (analyzed and indexed text, knowledge bases, etc.) used for candidate answer generation and evidence evaluation is under 1TB."
So, we're all learning as the televised match gets closer. Thanks for the discussion, everyone.
Posted by: Steve Will | 02/10/2011 at 09:10 AM
Hey Steve, thanks for posting this, quite interesting. One question I had is that you mention there could be a perceived advantage if Watson had access to the Internet and search engines. And my question is simply would that have been an advantage in the Jeopardy scenario? Your tax example seems to imply that it would not... i.e., that a relatively smaller but more highly curated dataset is often preferrable to a much larger but unstructured one where you need to filter through what is pertinent and what is not.
Posted by: ben allen | 02/17/2011 at 11:59 AM
@Ben, Thanks for the comment. I don't believe Watson would have had an advantage if it had been able to access the internet. But when people first started hearing about a computer that could play "Jeopardy!" they almost exclusively said something like "Well, sure, it can search the internet." I believe that including the ability to connect to the internet would have been detrimental in at least a couple of ways, given the goal of Watson on the show. First, as I mentioned, there is a real possibility of getting incorrect information. However, given a weighting method to trust information from "known good" sources, I think it could have been resolved. The larger issue, I think, would have been how to consume all the data it would have had access to. A great deal of invention went into how to organize the data in Watson's storage to be able to operate on it efficiently. Given that "Jeopardy!" ha such strict time constraints, working with a known set of data was almost certainly better than sifting through ad hoc information.
Posted by: Steve Will | 02/17/2011 at 02:39 PM
They have an interesting storage requirement. Given that enterprises all have unique needs that pertains to storage, IBM will have a workable solution to file in storage requirements but as well as security and reliability variables.
Posted by: document storage | 03/07/2012 at 01:38 AM