You and i

« IBM i 2010 and 2011 | Main | Getting the IBM i Word Out: Omni, Webcasts, WMCPA & »



One terabyte is much larger than 20 volumes of an encyclopaedia. In fact, I've never heard that before and it would be a huge overestimation. Some simple calculations will show this point.

One page of printed text is 5,280 bytes (66 lines, 80 characters per line). Assuming that we have 900 pages per volume, that's 1,800 sides; 1800 * 5280 = 9,504,000 (or about 9 MB per volume). If we then take 26 volumes (one per letter of the English alphabet), we'd get about 235 MB.

Even if we were to double the size of each volume to 1,800 pages (or assume that it was 10 KB of text per page instead of 5 KB), we'd still have less than 1 GB. In fact, I think 1 GB for an encyclopaedia (text only) is a good upper estimate. (Images would greatly increase the size, but images aren't likely in helping Watson do the job it needs to do.) This is why Encyclopaedia Britannica or World Book Encyclopaedia could fit on a CD-ROM (650 MB) with limited multimedia; a DVD version contains all the printed matter plus vast amounts of multimedia content - all in 4.7 GB.

So, assuming 1 GB per full set of a printed encyclopaedia, 1 TB would hold 1,024 years worth of encyclopaedia sets - or every encyclopaedia set ever printed by Britannica and World Book combined (with room to spare for all the text of classic works).

When we announced the AS/400 in 1988, the largest Model B60 supported up to 27 gigabytes of storage. Since many of our customers still thought in terms of MEGAbytes, I think we explained that 27GB represented all the data stored in your public library.

(Just because I remember something, it isn't necessarily true.)

Your basic assumption of calculating a page of encyclopedia data at 66 lines by 80 characters is probably wrong. That is probably correct for a typewritten page. But encyclopedias use much smaller fonts and multiple columns and so would have a much greater data density.

Even my Windows counterparts think this Watson story is cool. The big questions are how many processors and how much memory?

@C. Speare & @dale -- It's a rough approximation based on knowledge I acquired years back, but I did get quick confirmation from a couple of sources -- one being . Anyway, the point is that the #ibmwatson storage space is, in some respects, much smaller than you'd expect, when considering it has to compete against general knowledge (trivia) experts, and has to store not just the knowledge, but representations of the knowledge in ways that can be acted upon by the DeepQA algorithms. Fascinating stuff. Now, I do wonder if photographic, and other graphic, information is accounted for in the estimates for encyclopedias, but I won't stress the details.

@John T - I just got more of the details yesterday. #ibmwatson has 2880 cores and 16 Terabytes of memory(across 90 connected POWER7 770 servers.) That's big!

Not to beat a dead horse, but I'm getting back to the comments made by @C. Speare and @Dale - you have greatly underestimated the amount of data contained in 2 terabytes. Here are two quotes from the reference you mentioned:

"100 Megabytes: Yard of books on a shelf; two encyclopedia volumes"
- and -
"2 Terabytes: Academic research ligrary [sic]".

Don't get me wrong, I still think the Watson effort is a huge step forward in data analytics.

@Kevin, OK, you folks are convincing me that my old "rule of thumb" was probably inaccurate.

One other piece of news: Initially we were told that the storage on Watson was 4 TB, but we're getting more details as we get closer to the event we're getting more specific information. And for some reason, the details don't match what we were initially told. For the most recent storage details, you might want to see Tony Pearson's article (if you can, it's in DeveloperWorks):

In it, he tells us that the total storage for the Watson cluster is 21.6 TB (so how did we initially get told it was 4 TB?) but that "The actual size of the data (analyzed and indexed text, knowledge bases, etc.) used for candidate answer generation and evidence evaluation is under 1TB."

So, we're all learning as the televised match gets closer. Thanks for the discussion, everyone.

Hey Steve, thanks for posting this, quite interesting. One question I had is that you mention there could be a perceived advantage if Watson had access to the Internet and search engines. And my question is simply would that have been an advantage in the Jeopardy scenario? Your tax example seems to imply that it would not... i.e., that a relatively smaller but more highly curated dataset is often preferrable to a much larger but unstructured one where you need to filter through what is pertinent and what is not.

@Ben, Thanks for the comment. I don't believe Watson would have had an advantage if it had been able to access the internet. But when people first started hearing about a computer that could play "Jeopardy!" they almost exclusively said something like "Well, sure, it can search the internet." I believe that including the ability to connect to the internet would have been detrimental in at least a couple of ways, given the goal of Watson on the show. First, as I mentioned, there is a real possibility of getting incorrect information. However, given a weighting method to trust information from "known good" sources, I think it could have been resolved. The larger issue, I think, would have been how to consume all the data it would have had access to. A great deal of invention went into how to organize the data in Watson's storage to be able to operate on it efficiently. Given that "Jeopardy!" ha such strict time constraints, working with a known set of data was almost certainly better than sifting through ad hoc information.

They have an interesting storage requirement. Given that enterprises all have unique needs that pertains to storage, IBM will have a workable solution to file in storage requirements but as well as security and reliability variables.

The comments to this entry are closed.