最新的版的PostgresSQL引入了BRIN索引

来源：互联网发布：大数据徐子沛编辑：程序博客网时间：2024/06/13 21:43

After reading that PostgreSQL 9.5 will support BRIN indexes, it took me a couple of re-reads of the attached documentation to understand exactly what this index technique is about. Actually, it’s really simple, but for people like me who prefer things to be spelled out, here are some hopefully useful (and at least somewhat accurate) notes.

Tables

As a quick recap, table rows in PostgreSQL are stored into an on-disk structure known as the heap. The heap is an array that is logically partitioned into 8kb “pages”, with each page containing one or more “tuples” (rows). To ease management, as the heap grows it is additionally split into 1GB-sized files on disk, however the overall structure is still essentially just one big logical array.

When you ask PostgreSQL to insert a row into a table, it uses an auxilliary structure known as the free space map to locate the first available heap page for your relation (“table”) that has sufficient space to store the data for your row. If your row is larger than a pre-set limit (2kb), large columns are split out of the row’s data and stored in a series of rows in an internal table (the so-called TOAST tables).

The net result is that each data row exists entirely within one page, and that row lives at a particular logical index (the “item ID”) within its page. If PostgreSQL must refer to a row, it can uniquely identify it using just its page number, and its index within the page. The combination of this pair of numbers is known as the row’sctid, or its tuple ID. Tuple IDs can thus be used as a small, efficient, unique locator for every row in a database, and they exist regardless of your schema design.

[Side note: that’s not entirely true! If a row has been updated since the database was last VACUUMed, multiple versions will exist, chained together using some special fields in each version’s on-disk data. For simplicity let’s just assume only one version exists.]

In the current PostgreSQL implementation, 32 bits are used for the page number, and 16 bits for the item number (placing an absolute upper bound on a single database table to somewhere around 32 PiB), allowing the ctid to fit comfortably in 64 bits.

Using just the name of a relation and a ctid, PG can first split the page number from the ctid and use that to efficiently locate the physical database file and offset where the page lives:

page_size = 8KiBpages_per_segment = 1GiB / page_sizesegment, index = divmod(page_number, pages_per_segment)page_offset = page_size * index

Finally to locate the tuple within the page, a small, constant-sized lookup table exists at the start of each page that maps its item IDs to byte offsets within the page:

item_offset = page.lookup_table[item_id]

Indexes

Without further help, answering a query such as SELECT * FROM person WHERE age BETWEEN 18 AND 23 would require PG to visit every page in the heap, decoding each row in turn, and comparing its agecolumn to the WHERE predicate. Naturally for larger tables, we prefer to avoid that, and an index is necessary to allow PostgreSQL to avoid scanning the full table.

Btree Indexes

The most common index type in PG is the btree, which maintains an efficient map from column value toctid. Given the imaginary table:

Person table heap layoutPage NumberItem IDctidNameAgeCreation Date11(1, 1)John101998-012(1, 2)Jack991998-023(1, 3)Jill701998-034(1, 4)Jemma191998-0421(2, 1)George601998-052(2, 2)James441998-053(2, 3)Jocelyn551998-064(2, 4)Jemima221998-0731(3, 1)Jerry601999-012(3, 2)Jarvis441999-023(3, 3)Jasper551999-034(3, 4)Josh241999-0441(4, 1)Jacob602000-012(4, 2)Jesse442000-023(4, 3)Janet552000-034(4, 4)Justine242000-04

A btree index created using CREATE INDEX person_age ON person(age) might resemble:

person(age) btree index layoutAgectid10(1, 1)19(1, 4)22(2, 4)24(3, 4)24(4, 4)44(3, 2)44(4, 2)44(2, 2)55(3, 3)55(4, 3)55(2, 3)60(3, 1)60(4, 1)60(2, 1)80(1, 3)99(1, 2)

This is getting too long already, so skipping to the chase we can see that PG can now efficiently locate an exact row given its associated indexed column value, and that value in turn is stored in a data structure that permits fast lookup.

For our SELECT query from above, PG can jump to btree key 18 and scan out ctids until it reaches a key with an entry larger than 23. In the demo table, this means PG must only visit 2 rows from our set of 16, and prior to accessing the row data, it already knows the row definitely matches the predicate.

For some other queries, such as SELECT COUNT(*) FROM person WHERE age = 22, PG may not even need to visit the row data itself, since it can infer from index entries how many data rows exist. [Another MVCC caveat! This is not entirely true, since index entries may exist pointing to deleted rows, or rows created in later transactions]

The crucial point to note, though, is that one exact index entry is produced for every row, which usually doesn’t amount to much, maybe no more than 5-15% overhead relative to the source table, however for a large table, that overhead may be the difference between a dataset that fits in RAM, and one in which common queries end up hitting disk, or IO is doubled due to index access, since the dataset was already vastly larger than available RAM. It’s easy to imagine indexes quickly adding up, such that perhaps half of an application’s storage is wasted on them.

BRIN Indexes

Finally enough verbiage is spilled so that we can reach the point: BRIN indexes introduce a cool tradeoff where instead of covering individual rows, index entries cover one or more heap pages:

person(age) BRIN index with group size 1Page NumberHas NULL values?Lowest AgeHighest Age1No10992No22603No24604No2460

The structure is used like so: given a query such as SELECT * FROM person WHERE age BETWEEN 10 AND 15, PG will visit every index entry in turn, comparing its minimum/maximum values against the query predicate. If the index entry indicates that a range of pages contains at least one record matching the query, those pages will be scanned for matching rows. For this query, only one page contains rows whose age fields overlap the desired region, and so PG can avoid visiting 75% of the table.

Notice that in order to find just one row, PG must now scan a full page and compare each of its 4 rows against the query predicates. While index size is reduced, query time has increased! There is also little pattern in our age column: in fact, it is quite lucky that our index described only a single page covering the range 10..15. Had users signed up in a slightly different order, the distribution of ages across physical storage pages may have resulted in PG having to scan many more pages.

[Another side note: unlike our dummy table above, a typical PG heap page may contain over 200 rows, depending on how many columns are present, and how many of those are strings. Our dummy BRIN index above looks as if it contains just as much information as the original btree index, but that’s just because my example only has 16 rows instead of 800].

BRIN also permits configuring how many heap pages contribute to an index entry. For example, we can halve the size of our first index while also halving its precision:

person(age) BRIN index with group size 2Page NumberHas NULL values?Lowest AgeHighest Age1-2No10993-4No2460

Due to the work-increasing factor, and also since every index entry must be visited (resulting in a potentially high fixed cost for any query), BRIN is probably never useful for “hot” queries against a table, or even much use at all in a typical “hot” database, however for auxilliary queries, such as producing once-per-month reports or bulk queries against archival data, where reduced runtime or IO is desired without the storage costs of an exact index, BRIN may be just the tool.

Finally, notice in the original table how as new records were inserted, their creation date roughly tracked which database page they ended up on. This is quite a natural outcome since as the table grows, newer items will occupy later pages in the array, and so there is quite a reliable correlation between page number and the creation date column value. A BRIN index over this column would work very well.

0 0