← Back to team overview

maria-discuss team mailing list archive

Re: legacy InnoDB compressed tables using row_format=compressed


Hi Lukas,

On Thu, Jun 29, 2017 at 10:18 AM, Lukas Lehner <weblehner@xxxxxxxxx> wrote:

> Hey
> Quote from https://mariadb.com/kb/en/mariadb/compression/
> This approach differs significantly from legacy InnoDB compressed tables
> using row_format=compressed, where both uncompressed and compressed pages
> can be in the buffer pool.
> Any details why this is legacy? Should not be used? With a search engine I
> found a very old blog posting "InnoDB compression woes" from Percona.

I am the principal author of the ROW_FORMAT=COMPRESSED in InnoDB. Actually,
when I was hired for Innobase Oy in September 2003, my first task was to
reduce the disk space usage of InnoDB tables. The first step of that was
ROW_FORMAT=COMPACT, which was introduced in MySQL 5.0.3.

I started the development of ROW_FORMAT=COMPRESSED shortly after Oracle
Corporation acquired my employer Innobase Oy in October 2005. The code was
originally published as the InnoDB Plugin for MySQL 5.1, in 2007 or 2008.

I authored all the revised ROW_FORMATs of InnoDB. ROW_FORMAT=DYNAMIC is a
minor variant of ROW_FORMAT=COMPACT: for columns that are stored off-page,
no prefix is stored in the clustered index leaf page records.
ROW_FORMAT=REDUNDANT (the original format developed by Heikki Tuuri) and
ROW_FORMAT=COMPACT store a prefix of 768 bytes locally, for easier
implementation of column prefix indexes (which before MySQL 5.5 were
limited to 767 bytes). With ROW_FORMAT=DYNAMIC, some column prefix
information is written to the undo log, so that the purge of history can
access it from there.

row_format=compressed is attractive when you pay your DBaaS per GB (storage
> based usage). Any insides?

The high-level idea of keeping both uncompressed and compressed pages in
the buffer pool came from the founder of Innobase Oy, Heikki Tuuri. The
low-level design was mine. There are 3 kinds of pages in

   - Fully uncompressed pages (page allocation bitmap, change buffer
   bitmap, other allocation data structures). These use user-specified the
   physical page size.
   - Compressed B-tree pages. Each page uncompresses in memory to a page
   that is equivalent to ROW_FORMAT=DYNAMIC. The compressed page consists of a
   compressed zlib stream, some uncompressed fields (for facilitating in-place
   updates without recompressing the page), and an uncompressed "page
   modification log" (to avoid updates without recompressing the page). This
   format was my invention.
   - Compressed BLOB pages. The contents of off-page columns is compressed
   as a single zlib stream, and there are a little fewer header fields than on
   uncompressed BLOB pages. Either way, BLOBs are stored as singly-linked
   lists of pages in InnoDB.

I implemented a rudimentary buffer pool interface and also the binary buddy
allocator (buf0buddy.cc) that uses the InnoDB buffer pool for allocating
smaller pages for the compressed page frames.
The buffer pool eviction mechanism (unzip_LRU) was jointly designed by
Heikki Tuuri and Inaam Rana. Oracle wanted them to file a patent for it,
and it was granted: United States Patent 8,375,178 (Memory page eviction
based on present system operation).

Back in 2005, there were no SSDs, and Heikki was worried about
fragmentation. So, a scheme like Stacker or Doublespace or NTFS compression
was out of the question. My feeling is that all these systems essentially
work like the PAGE_COMPRESSED=YES in MariaDB 10.1: allocating variable-size
compressed blocks for fixed-size uncompressed blocks. The property of
ROW_FORMAT=COMPRESSED that the compressed block size is fixed and specified
by the user was kind of a design constraint.

I think that is fair to say that ROW_FORMAT=COMPRESSED is complicating the
buffer pool management quite a bit. But on the other hand,
PAGE_COMPRESSED=YES depends on file system specifics and is causing
fragmentation there. Even on SSD, this could slow down some operations,
such as DROP TABLE.

BLOB storage is more efficient with ROW_FORMAT=COMPRESSED, because each
off-page column is compressed as a single zlib stream. With
PAGE_COMPRESSED=YES, each BLOB page snippet is compressed individually,
which could be much more wasteful in the worst case. Artificial example:
REPEAT(almost innodb_page_size uncompressible bytes',N) should compress to
1 or 2 pages with ROW_FORMAT=COMPRESSED (storing the string and then
compressing each repetition), but N pages in PAGE_COMPRESSED=YES, because
it would divide the BLOB into parts smaller than innodb_page_size and then
attempt to compress these pages individually.

Conclusion: ROW_FORMAT=COMPRESSED was a monolithic piece of development
that was not properly performance-tested before it was completed.
PAGE_COMPRESSED=YES was much simpler to implement and is easier to
understand. It is also easier to understand the buffer pool metrics when

Maybe some day we can improve the ROW_FORMAT=COMPRESSED. But I think that
LSM-trees such as MyRocks are difficult to beat when it comes to disk space
usage. This of course depends on the workload.

Best regards,

Marko Mäkelä, Lead Developer InnoDB
MariaDB Corporation