Are you a database (DB) professional? If so, you can probably skip the blog this time.
Are you an IT professional who doesn’t get into the details of DBs? Then this blog is probably for you.
Are you someone other than an IT professional? Then I think you might have the wrong blog. Or you’re a member of my family. In either of those cases, thanks for the thought, but you should probably go read something else.
OK, I think I’ve reduced the audience now to people who might care about today’s topic. See, if you are a non-DB IT professional, you know that DBs are important, and you probably wish you knew a little more about one of the hot DB topics being discussed in IT these days.
That’s kind of where I was a while back. I freely admit to most anyone that, while I have worked on many (many!) parts of the IBM i operating system, I have never been a developer in the DB2 area. So I regularly need to ask some of the smart people on the DB2 team to give me some semi-advanced lessons on databases. In particular, I recently needed to hear about why so much of the industry is talking about “columnar” databases. So, I went to chat with Mark Anderson. Mark, as many of you will know, is the Distinguished Engineer who is the Chief Architect of DB2 on i. Today, I thought I’d share some of what I learned.
Many database conversations center on performance, and that’s true with our topic today. People are talking about the performance of the newer column-oriented DBs, in comparison to the traditional row-oriented DBs.
As their names imply, there is a difference between how the data is stored in each of these DB architectures. In row-oriented DBs, each piece of data in a row is put on disk very close to each other piece in that row. Typically, you can think of it as if a row is a continuous set of bytes all sitting on disk. What’s stored in a row? Well, if you remember from DB 101 (you did all take the intro DB course, right?) a row typically has all the information about some entity. For purposes of this discussion, let’s say a row contains information about a customer. Then all of the attributes of a single customer are stored together on disk. Each of those attributes is a column in the virtual table (or array) we’re storing. I think this will go better with an example.
On disk, you have a large number of pages storing all this data. Let’s say that each row in the DB is large enough to take up one page of storage. In this case, that’s one page for each customer. OK, so given the example above, in a row-based DB, all of Alice’s information is stored in one page. If your application deals with getting information about Alice, or updating information about Alice, then all you need to have in memory at one time is that single page.
Though I haven’t said it yet, you can probably imagine that in a column-based DB, all the data in a column is stored together. Again, let’s say in our example, a column takes a page. So, for example, all of the ZIP codes would be on one page. And all the values in “2013 Total Order” would be on another page.
(And hey, all you DB experts are supposed to have stopped reading, so I don’t want any smart comments about the design of the specific customer table above. I’m not a DB architect! This is a teaching example.)
OK, so now we can get to the point.
All databases (all computing, in fact) work fastest when the data they are working with resides in memory when the work is being done. Of course, data doesn’t normally sit in memory. It’s stored somewhere, and brought into memory when it’s called by the DB. So, if your database is row-oriented, it’s going to have the best performance if you use it in a way that requires you to operate on a row at time. In our example, that brings one page into memory. (This is just a simple conceptual example, remember. In reality, operating systems bring more than one page at a time into memory. More about that later.)
On the other hand, if your DB is row-based, but you need to get all of the data stored in one column in order to do some operation, then you will have to go access each of the rows, and that means spending time bringing in each row to grab a small piece of data. If you had a column-oriented DB, you could get all of that data more quickly, because it’s all stored together. From the example, all that “2013 Total Order” data is in one column.
But – and this is key – if your DB is column-oriented and you wanted all the data about Alice, you’re going to have to bring in a lot of columns (pages) to get all of that data.
So there is the crux of why there is so much discussion about columnar DBs these days.
Oh, you still don’t quite get it? Well then, let’s go a bit further.
You see, until recently, almost all major databases were row-based. And, for most traditional business processing, this makes perfect sense. Most DB workloads have been what DB experts called OLTP workloads – On-Line Transaction Processing. OLTP workloads are the backbone of most existing business use of DBs. And these applications typically perform best with row-based DBs, because those workloads tend to need many attributes (values stored in many columns) about a single entity. They work with rows. Row-based operations bring in the minimum number of pages from disk when using a row-based DB.
But there are other workloads that perform better if the data in columns can be handled efficiently. Online Analytical Processing (OLAP) workloads frequently want to gather data from columns. For example, if you wanted to know the total of all the values in the column labeled “2013 Total Order” and you had a column-based DB, you would get that by bringing one column into memory and totaling the values, whereas in a simplistic row-based DB, you would need to get data from each row.
Of course, it’s not as simple as all that. Row-based DBs, such as DB2 for i, have added methods to make tasks such as the simple “sum a column” example (or much more complex OLAP analytics) perform well. For example, DB2 for i has couple of constructs called Encoded Vector Indexes (EVIs) and Materialized Query Tables (MQTs) that can be used to great effect. And DB2 for i has ways for customers to block (read many rows at a time) instead of reading them one at a time. In addition to that user-defined method for improving performance, the storage management component of IBM i is smart. Very smart! It has to be; it’s implementing single-level storage. Anyway, because it is so smart, it recognizes that a user is reading rows and then brings data in memory even before it’s requested. This becomes important because many OLTP workloads require a sequential “walk” through the rows, so bringing blocks of rows gets them in memory ahead of when they are needed.
On the other hand, you can’t just add an index to a column-oriented table to get good OLTP performance. As Mark says “It’s a bit like putting Humpty Dumpty together again. The pieces of the row are scattered on many pages of storage. Even if the entire database is in memory, a lot more CPU will be expended pulling all the columns of a row together.”
So, to wrap up this little lesson, the key to deciding which kind of DB to use is to ask yourself what kind of workload is most critical for your DB to support. Though you might want to do both kinds of operations, if your core business needs OLTP, then a row-based DB, with decades of performance optimization, might be the best choice. If your company doesn’t really need fast OLTP, but needs OLAP to be fast, then a columnar DB might be best.
And if you need both, then you will want to look at the mix of techniques available on each. You can decide to go with one, or you might even decide that a combination of the two architectures will suit you best.
Whatever you decide, just make sure that the solution is stable, because businesses really do run on their data.
And that’s it, readers. Database 102. Now, when you read those articles about columnar DBs, you’ll understand why people are talking about them.
Until next time, keep learning.