Feed: Planet PostgreSQL.
I hear quite often people being disappointed on how much space PostgreSQL is
wasting for each row it stores. I’ll try to show here some tricks to minimize
this effect, to allow more efficient storage.
What overhead?
If you don’t have tables with more than few hundred of million of rows, it’s
likely that you didn’t have an issue with this.
For each row stored, postgres will store aditionnal data for its own need.
This is
documented here.
The documentation says:
Field | Type | Length | Description |
---|---|---|---|
t_xmin | TransactionId | 4 bytes | insert XID stamp |
t_xmax | TransactionId | 4 bytes | delete XID stamp |
t_cid | CommandId | 4 bytes | insert and/or delete CID stamp (overlays with t_xvac) |
t_xvac | TransactionId | 4 bytes | XID for VACUUM operation moving a row version |
t_ctid | ItemPointerData | 6 bytes | current TID of this or newer row version |
t_infomask2 | uint16 | 2 bytes | number of attributes, plus various flag bits |
t_infomask | uint16 | 2 bytes | various flag bits |
t_hoff | uint8 | 1 byte | offset to user data |
Which is 23 bytes on most architectures (you have either t_cid or
t_xvac).
You can see part of these fields in hidden column present on any table by
adding them in the SELECT part of a query, or look for negative attribute
number in pg_attribute catalog:
If you compare to the previous table, you can see than not all of these columns
are not stored on disk. Obviously PostgreSQL doesn’t store the table’s oid in
each row. It’s added after, while constructing a tuple.
If you want more technical details, you should read take a look at
htup_detail.c, starting
with
TupleHeaderData struct.
How costly is it?
As the overhead is fixed, it’ll become more and more neglictable as the row
size grows. If you only store a single int column (4 bytes), each row will
need:
So, it’s 85% overhead, pretty horrible.
On the other hand, if you store 5 integer, 3 bigint and 2 text columns (let’s
say ~80B average), you’ll have:
That’s “only” 10% overhead.
So, how to minimize this overhead
The idea is to store the same data with less records. How to do that?
Aggregating data in arrays. The more records you put in a single array, the
less overhead you have. And if you aggregate enough data, you can benefit
from transparent compression thanks to the TOAST
mechanism
Let’s try with a single 1 integer column table containing 10M rows:
The user data should need 10M * 4B, ie. around 38MB, while this table will
consume 348MB. Inserting the data takes around 23 seconds.
NOTE: If you do the maths, you’ll find out that the overhead is slighty
more than 32B, not 23B. This is because each block also has some
overhead, NULL handling and alignement issue. If you want more information
on this, I recommand to see
this presentation
Let’s compare with aggregated versions of the same data:
This will insert 5 elements per row. I’ve done the same test with 20, 100, 200
and 1000 elements per row. Results are below:
NOTE: The size for 1000 element per row is a little higher than lower value.
This is because it’s the only one which is big enough to be TOAST-ed, but not
big enough to be compressed. We can see a little TOAST overhead here.
So far so good, we can see quite good improvements, both in size and INSERT
time even for very small arrays. Let’s see the impact to retrieve rows. I’ll
try to retrieve all the rows, then only one row with an index scan (for the
tests I’ve used EXPLAIN ANALYZE to minimize the time to represent the data in
psql):
To properly index this array, we need a GIN index. To get a the values from
aggregated data, we need to unnest() the arrays, and to be a little more
creative to get a single record:
Here’s the chart comparing index creation time and index size:
The GIN index is a little more than twice the btree index, if I add the table
size, total size is almost the same as without aggregation. That’s not a big
issue since this example is naive, we’ll see later how to avoid using GIN
index to keep total size low. Also index is way slower to build, meaning that
INSERT will also be slower.
Here’s the chart comparing index creation time and index size, time to retrieve
all rows and a single row:
Getting all the rows is probably not an interesting example, but as soon as
array contains enough elements it starts to be faster than original table. We
see that getting only one element is much more faster than with the btree
index, thanks to GIN efficiency. It’s not tested here, but since only btree
index are sorted, if you need to get a lot of data sorted, using a GIN index
will require an extra sort which will be way slower than a btree index scan.
A more realistic example
Now that we’ve seen the basics, let’s see how to go further: aggregating more
than one columns and avoid to use too much disk space with a GIN index. For
this, I’ll present how PoWA stores it’s data.
For each datasource collected, two tables are used: the historic and
aggregated one, and the current one. These tables store data in a custom
type instead of plain columns. Let’s see the tables related to
pg_stat_statements:
The custom type, basically all the counters present in pg_stat_statements and
the timestamp associated to this record:
The aggregated table store the pg_stat_statement unique identifier (queryid,
dbid, userid), and a record of counters:
The aggregated table contains the same unique identifier, an array of records
and some special fields:
We also store the timestamp range (coalesce_range) for all aggregated
counters in a row, and the minimum and maximum values of each counter in two
dedicated records. These extra fields doesn’t consume too much space, and
allows very efficient indexing and computation, based on the access pattern of
the related application.
This table is used to know how much ressource a query consumed on a given time
range. The GiST index won’t be too big since it only indexes one row per X
aggregated counters, and will find efficiently the rows matching a given
queryid and time range.
Then, computing the resources consumed can be done efficiently, since the
pg_stat_statements counters are strictly monotonic. The algorithm would be:
- if the row time range is entirely contained in the asked time range, we only
need to compute delta of summary record:
maxs_in_range.counter – mins_in_range.counter - if not (meaning only two rows for each queryid) we unnest the array, filter
out records that aren’t in the asked time range, keep first and last value
and compute for each counter the maximum minus the minimum. The unnest will
only
NOTE: Actually, PoWA interface always unnest all records overlapping the
asked time interval, since the interface is designed to show these counters
evolution on a relatively small time range, but with a great precision.
Hopefuly, unnesting the records is not that expensive, especially compared to
the disk space saved.
And here’s the size needed for the aggregated and non aggregated values. For
this I let PoWA generate 12.331.366 records (configuring a snapshot every 5
seconds for some hours, default aggregation of 100 records per row), and used a
btree index on (queryid, ((record).ts)
to simulate the index present on the aggregated table:
Pretty efficient, right?
Limitations
There are some limitations with aggregating records. If you do this, you can’t
enforce constraints such as foreign keys or unique constraints. The use is
therefore non-relationnal data, such as counters or metadata.
Bonus
Using custom types also allows some nice things, like defining custom
operators. For instance, the release 3.1.0 of powa will provide two
operators for each custom type defined:
- the – operator, to get difference between two record
- the / operator, to get the difference per second
You’ll therfore be able to do this kind of queries:
If you’re interested on how to implement such operators, you can look at
PoWA implementation.
Conclusion
You now know the basics to work around the per tuple overhead. Depending on
your needs and your data specifities, you should find a way to aggregate your
data and add some extra column to keep nice performance.