Quantcast
Channel: SAP – Cloud Data Architect
Viewing all articles
Browse latest Browse all 140

Minimizing tuple overhead

$
0
0

Feed: Planet PostgreSQL.

I hear quite often people being disappointed on how much space PostgreSQL is
wasting for each row it stores. I’ll try to show here some tricks to minimize
this effect, to allow more efficient storage.

What overhead?

If you don’t have tables with more than few hundred of million of rows, it’s
likely that you didn’t have an issue with this.

For each row stored, postgres will store aditionnal data for its own need.
This is
documented here.
The documentation says:

Field Type Length Description
t_xmin TransactionId 4 bytes insert XID stamp
t_xmax TransactionId 4 bytes delete XID stamp
t_cid CommandId 4 bytes insert and/or delete CID stamp (overlays with t_xvac)
t_xvac TransactionId 4 bytes XID for VACUUM operation moving a row version
t_ctid ItemPointerData 6 bytes current TID of this or newer row version
t_infomask2 uint16 2 bytes number of attributes, plus various flag bits
t_infomask uint16 2 bytes various flag bits
t_hoff uint8 1 byte offset to user data

Which is 23 bytes on most architectures (you have either t_cid or
t_xvac).

You can see part of these fields in hidden column present on any table by
adding them in the SELECT part of a query, or look for negative attribute
number in pg_attribute catalog:

# d test
     Table "public.test"
 Column |  Type   | Modifiers
--------+---------+-----------
 id     | integer |

# SELECT xmin, xmax, id FROM test LIMIT 1;
 xmin | xmax | id
------+------+----
 1361 |    0 |  1

# SELECT attname, attnum, atttypid::regtype, attlen
FROM pg_class c
JOIN pg_attribute a ON a.attrelid = c.oid
WHERE relname = 'test'
ORDER BY attnum;
 attname  | attnum | atttypid | attlen
----------+--------+----------+--------
 tableoid |     -7 | oid      |      4
 cmax     |     -6 | cid      |      4
 xmax     |     -5 | xid      |      4
 cmin     |     -4 | cid      |      4
 xmin     |     -3 | xid      |      4
 ctid     |     -1 | tid      |      6
 id       |      1 | integer  |      4

If you compare to the previous table, you can see than not all of these columns
are not stored on disk. Obviously PostgreSQL doesn’t store the table’s oid in
each row. It’s added after, while constructing a tuple.

If you want more technical details, you should read take a look at
htup_detail.c, starting
with
TupleHeaderData struct.

How costly is it?

As the overhead is fixed, it’ll become more and more neglictable as the row
size grows. If you only store a single int column (4 bytes), each row will
need:

23B + 4B = 27B

So, it’s 85% overhead, pretty horrible.

On the other hand, if you store 5 integer, 3 bigint and 2 text columns (let’s
say ~80B average), you’ll have:

23B + 5*4B + 3*8B + 2*80B = 227B

That’s “only” 10% overhead.

So, how to minimize this overhead

The idea is to store the same data with less records. How to do that?
Aggregating data in arrays. The more records you put in a single array, the
less overhead you have. And if you aggregate enough data, you can benefit
from transparent compression thanks to the TOAST
mechanism

Let’s try with a single 1 integer column table containing 10M rows:

# CREATE TABLE raw_1 (id integer);

# INSERT INTO raw_1 SELECT generate_series(1,10000000);

# CREATE INDEX ON raw_1 (id);

The user data should need 10M * 4B, ie. around 38MB, while this table will
consume 348MB. Inserting the data takes around 23 seconds.

NOTE: If you do the maths, you’ll find out that the overhead is slighty
more than 32B, not 23B. This is because each block also has some
overhead, NULL handling and alignement issue. If you want more information
on this, I recommand to see
this presentation

Let’s compare with aggregated versions of the same data:

# CREATE TABLE agg_1 (id integer[]);

# INSERT INTO raw_1 SELECT array_agg(i)
FROM generate_series(1,10000000) i
GROUP BY i % 2000000;

# CREATE INDEX ON raw_1 (id);

This will insert 5 elements per row. I’ve done the same test with 20, 100, 200
and 1000 elements per row. Results are below:

NOTE: The size for 1000 element per row is a little higher than lower value.
This is because it’s the only one which is big enough to be TOAST-ed, but not
big enough to be compressed. We can see a little TOAST overhead here.

So far so good, we can see quite good improvements, both in size and INSERT
time even for very small arrays. Let’s see the impact to retrieve rows. I’ll
try to retrieve all the rows, then only one row with an index scan (for the
tests I’ve used EXPLAIN ANALYZE to minimize the time to represent the data in
psql):

# SELECT id FROM raw_1;

# CREATE INDEX ON raw_1 (id);

# SELECT * FROM raw_1 WHERE id = 500;

To properly index this array, we need a GIN index. To get a the values from
aggregated data, we need to unnest() the arrays, and to be a little more
creative to get a single record:

# SELECT unnest(id) AS id FROM agg_1;

# CREATE INDEX ON agg_1 USING gin (id);

# WITH s(id) AS (
    SELECT unnest(id)
    FROM agg_1
    WHERE id && array[500]
)
SELECT id FROM s WHERE id = 500;

Here’s the chart comparing index creation time and index size:

The GIN index is a little more than twice the btree index, if I add the table
size, total size is almost the same as without aggregation. That’s not a big
issue since this example is naive, we’ll see later how to avoid using GIN
index to keep total size low. Also index is way slower to build, meaning that
INSERT will also be slower.

Here’s the chart comparing index creation time and index size, time to retrieve
all rows and a single row:

Getting all the rows is probably not an interesting example, but as soon as
array contains enough elements it starts to be faster than original table. We
see that getting only one element is much more faster than with the btree
index, thanks to GIN efficiency. It’s not tested here, but since only btree
index are sorted, if you need to get a lot of data sorted, using a GIN index
will require an extra sort which will be way slower than a btree index scan.

A more realistic example

Now that we’ve seen the basics, let’s see how to go further: aggregating more
than one columns and avoid to use too much disk space with a GIN index. For
this, I’ll present how PoWA stores it’s data.

For each datasource collected, two tables are used: the historic and
aggregated
one, and the current one. These tables store data in a custom
type instead of plain columns. Let’s see the tables related to
pg_stat_statements:

The custom type, basically all the counters present in pg_stat_statements and
the timestamp associated to this record:

powa=# d powa_statements_history_record
   Composite type "public.powa_statements_history_record"
       Column        |           Type           | Modifiers
---------------------+--------------------------+-----------
 ts                  | timestamp with time zone |
 calls               | bigint                   |
 total_time          | double precision         |
 rows                | bigint                   |
 shared_blks_hit     | bigint                   |
 shared_blks_read    | bigint                   |
 shared_blks_dirtied | bigint                   |
 shared_blks_written | bigint                   |
 local_blks_hit      | bigint                   |
 local_blks_read     | bigint                   |
 local_blks_dirtied  | bigint                   |
 local_blks_written  | bigint                   |
 temp_blks_read      | bigint                   |
 temp_blks_written   | bigint                   |
 blk_read_time       | double precision         |
 blk_write_time      | double precision         |

The aggregated table store the pg_stat_statement unique identifier (queryid,
dbid, userid), and a record of counters:

powa=# d powa_statements_history_current
    Table "public.powa_statements_history_current"
 Column  |              Type              | Modifiers
---------+--------------------------------+-----------
 queryid | bigint                         | not null
 dbid    | oid                            | not null
 userid  | oid                            | not null
 record  | powa_statements_history_record | not null

The aggregated table contains the same unique identifier, an array of records
and some special fields:

powa=# d powa_statements_history
            Table "public.powa_statements_history"
     Column     |               Type               | Modifiers
----------------+----------------------------------+-----------
 queryid        | bigint                           | not null
 dbid           | oid                              | not null
 userid         | oid                              | not null
 coalesce_range | tstzrange                        | not null
 records        | powa_statements_history_record[] | not null
 mins_in_range  | powa_statements_history_record   | not null
 maxs_in_range  | powa_statements_history_record   | not null
Indexes:
    "powa_statements_history_query_ts" gist (queryid, coalesce_range)

We also store the timestamp range (coalesce_range) for all aggregated
counters in a row, and the minimum and maximum values of each counter in two
dedicated records. These extra fields doesn’t consume too much space, and
allows very efficient indexing and computation, based on the access pattern of
the related application.

This table is used to know how much ressource a query consumed on a given time
range. The GiST index won’t be too big since it only indexes one row per X
aggregated counters, and will find efficiently the rows matching a given
queryid and time range.

Then, computing the resources consumed can be done efficiently, since the
pg_stat_statements counters are strictly monotonic. The algorithm would be:

  • if the row time range is entirely contained in the asked time range, we only
    need to compute delta of summary record:
    maxs_in_range.counter – mins_in_range.counter
  • if not (meaning only two rows for each queryid) we unnest the array, filter
    out records that aren’t in the asked time range, keep first and last value
    and compute for each counter the maximum minus the minimum. The unnest will
    only

NOTE: Actually, PoWA interface always unnest all records overlapping the
asked time interval, since the interface is designed to show these counters
evolution on a relatively small time range, but with a great precision.
Hopefuly, unnesting the records is not that expensive, especially compared to
the disk space saved.

And here’s the size needed for the aggregated and non aggregated values. For
this I let PoWA generate 12.331.366 records (configuring a snapshot every 5
seconds for some hours, default aggregation of 100 records per row), and used a
btree index on (queryid, ((record).ts)
to simulate the index present on the aggregated table:

Pretty efficient, right?

Limitations

There are some limitations with aggregating records. If you do this, you can’t
enforce constraints such as foreign keys or unique constraints. The use is
therefore non-relationnal data, such as counters or metadata.

Bonus

Using custom types also allows some nice things, like defining custom
operators
. For instance, the release 3.1.0 of powa will provide two
operators for each custom type defined:

  • the operator, to get difference between two record
  • the / operator, to get the difference per second

You’ll therfore be able to do this kind of queries:

# SELECT (record - lag(record) over()).*
FROM from powa_statements_history_current
WHERE queryid = 3589441560 AND dbid = 16384;
      intvl      | calls  |    total_time    |  rows  | ...
-----------------+--------+------------------+--------+ ...
 <NULL>          | <NULL> |           <NULL> | <NULL> | ...
 00:00:05.004611 |   5753 | 20.5570000000005 |   5753 | ...
 00:00:05.004569 |   1879 | 6.40500000000047 |   1879 | ...
 00:00:05.00477  |  14369 | 48.9060000000006 |  14369 | ...
 00:00:05.00418  |      0 |                0 |      0 | ...

# SELECT (record / lag(record) over()).*
FROM powa_statements_history_current
WHERE queryid = 3589441560 AND dbid = 16384;

  sec   | calls_per_sec | runtime_per_sec  | rows_per_sec | ...
--------+---------------+------------------+--------------+ ...
 <NULL> |        <NULL> |           <NULL> |       <NULL> | ...
      5 |        1150.6 |  4.1114000000001 |       1150.6 | ...
      5 |         375.8 | 1.28100000000009 |        375.8 | ...
      5 |        2873.8 | 9.78120000000011 |       2873.8 | ...

If you’re interested on how to implement such operators, you can look at
PoWA implementation.

Conclusion

You now know the basics to work around the per tuple overhead. Depending on
your needs and your data specifities, you should find a way to aggregate your
data and add some extra column to keep nice performance.


Viewing all articles
Browse latest Browse all 140

Trending Articles