IBM Watson’s Storage Requirements
In my last blog entry, I briefly mentioned a project called “Watson.” Watson is a set of IBM Research technology, running on POWER technology, which will compete in the “Jeapordy!” game against two of that game show’s most celebrated champions. Since my blog, a number of good articles have been written about it, including an excellent overview in Jon & Susan’s blog. If you're not familiar with Watson’s basics, you should probably start there, and then come back here.
Don’t worry. I’ll wait.
OK, now that you have the overview, I thought I’d point out one of the technical details of the project, and then discuss implications of such technology on computing in general.
Watson’s cluster of POWER 750s is using two 2 terabyte (TB) I/O nodes, for a total data repository of 4 TB. How impressive does that sound to you? In one sense, that amount of data can sound rather large. Many businesses do not have 4 TB of data. On the other hand, to compete in a general knowledge quiz show, is that enough? Well, a rule of thumb I once learned is that 1 Terabyte was about as much text information as is contained in 20 volumes from a printed encyclopedia. Now, the encyclopedia I bought more than 20 years ago had 24 volumes, and it added a new volume every year to cover events which happened in that year. So, by that analogy, 4 TB might be enough room to have a full encyclopedia’s worth of general knowledge at your fingertips (or, at the ends of your disk-arms, for Watson) as well as another 16 year’s worth of more specific information.
Somehow, while that sounds like a lot of data, it might still be less data than I think Ken Jennings and Brad Rutter have in their heads. They are veritable founts of knowledge.
And yet, if you listen to this Why Data Matters video on the IBM Watson site, 4 TB is still miniscule, compared to the amount of data being produced. The estimate given in those videos is 15 Petabytes of new data is generated throughout the world each day. That’s right, the new data each day is 4,000 times larger than could be stored in Watson for this challenge. But, for the challenge, the data is specific to a purpose, and that purpose is quite impressive – competing with human beings in knowledge-based answering.
By the way, one implication of this information is that Watson is only using data that's stored in its disks. Watson is not connected to the Internet, or any network outside of the clustering network that connects its 750s and its storage. So, don’t assume that Watson is getting some advantage, such as having access to search engines.
This is going to be very important as the DeepQA technology gets applied to real-world applications. Suppose, for example, you wanted to ask a computer about a fairly complex topic – taxes, for example – and you wanted to get a reasonably reliable answer. The Federal Tax code is pretty complex, but it could be contained on a single set of servers and storage, and you would want your answers to be consistent with the information in those servers. You most certainly would not want to have the QA system pulling data in from blogs on the Internet, written by people who are misinformed about the reality of tax laws.
For many of us in this industry, Watson represents another step toward the kinds of computers we saw on Star Trek. Computers that can understand our meaning, even if we ask a question in a strange way, to turn a question into a query, will make computers more usable in ways we have only dreamed before. Computers that have access to all of the pertinent information and know how to search it will provide extensions to our knowledge which approach artificial intelligence. Computers that not only can give us an answer but can also give an estimate of confidence in that answer will help us make decisions in more informed ways.