The document provides an overview of indexes in Postgres, including B-Trees, GIN, and GiST indexes. It discusses:
1) What B-Tree indexes store key-pointer pairs to optimize queries. The keys are ordered and pages are linked in a balanced tree structure. GIN indexes split arrays into unique keys and store posting lists in leaves. GiST indexes allow overlapping key ranges and are not ordered.
2) How B-Tree pages contain high keys, pointers, and items. GIN indexes store pending entries in a list until vacuumed. GiST indexes use consistency functions to determine child page checks during searches.
3) The processes for searching, inserting, and deleting in
Constraints
!6
Some constraints transforminto indexes.
- PRIMARY KEY
- UNIQUE
- EXCLUDE USING
"crocodile_pkey" PRIMARY KEY, btree (id)
"crocodile_email_uq" UNIQUE CONSTRAINT, btree (email)
Indexes:
"appointment_pkey" PRIMARY KEY, btree (id)
"appointment_crocodile_id_schedule_excl" EXCLUDE USING gist
(crocodile_id WITH =, schedule WITH &&)
In the crocodile table
In the appointment table
7.
Query optimization
!7
Often themain reason why we create indexes
Why do indexes make queries faster
In an index, tuples (value, pointer) are stored.
Instead of reading the entire table for a value, you just go to the index (kind of like in an
encyclopedia)
Pages
!9
- PostgreSQL usespages to store data from indexes or tables
- A page has a ļ¬xed size of 8kB
- A page has a header and items
- In an index, each item is a tuple (value, pointer)
- Each item in a page is referenced to with a pointer called ctid
- The ctid consist of two numbers, the number of the page (the block number) and the offset
of the item.
The ctid of the item with value 4 would be (3, 2).
pageinspect, gevel anda bit of python
!11
Page inspect is an extension that allows you to explore a bit whatās inside the pages
Functions for BTree, GIN, BRIN and Hash indexes
Gevel adds functions to GiST, SP-Gist and GIN.
Used them to generate pictures for BTree and GiST
https://github.com/louiseGrandjonc/pageinspect_inspector
B-Trees internal datastructure - 1
!13
- A BTree in a balanced tree
- All the leaves are at equal distance from the root.
- A parent node can have multiple children minimizing the treeās depth
- Postgres implements the Lehman & Yao Btree
Letās say we would like to filter or order on the crocodileās number of teeth.
CREATE INDEX ON crocodile (number_of_teeth);
14.
B-Trees internal datastructure - 2
Metapage
!14
The metapage is always the ļ¬rst page of a BTree index. It contains:
- The block number of the root page
- The level of the root
- A block number for the fast root
- The level of the fast root
15.
B-Trees internal datastructure - 2
Metapage
!15
SELECT * FROM bt_metap('crocodile_number_of_teeth_idx');
magic | version | root | level | fastroot | fastlevel
--------+---------+------+-------+----------+-----------
340322 | 2 | 290 | 2 | 290 | 2
(1 row)
Using page inspect, you can get the information on the metapage
16.
B-Trees internal datastructure - 3
Pages
!16
The root, the parents, and the leaves are all pages with the same structure.
Pages have:
- A block number, here the root block number is 290
- A high key
- A pointer to the next (right) and previous pages
- Items
17.
B-Trees internal datastructure - 4
Pages high key
!17
- High key is specific to Lehman & Yao BTrees
- Any item in the page will have a value lower or equal to the high key
- The root doesnāt have a high key
- The right-most page of a level doesnāt have a high key
And in page 575, there is no high key as itās the
rightmost page.
In page 3, I will find crocodiles with 3 or less teeth
In page 289, with 31 and less
18.
B-Trees internal datastructure - 5
Next and previous pages pointers
!18
- Specificity of the Yao and Lehmann BTree
- Pages in the same level are in a linked list
Very useful for ORDER BY
For example:
SELECT number_of_teeth
FROM crocodile ORDER BY number_of_teeth ASC
Postgres would start at the first leaf page and thanks to the next
page pointer, has directly all rows in the right order.
B-Trees internal datastructure - 7
Items
!20
- Items have a value and a pointer
- In the parents, the ctid points to the child page
- In the parents, the value is the value of the ļ¬rst item in the child page
21.
B-Trees internal datastructure - 8
Items
!21
- In the leaves, the ctid is to the heap tuple in the table
- In the leaves itās the value of the column(s) of the row
22.
B-Trees internal datastructure
To sum it up
!22
- A Btree is a balanced tree. PostgreSQL implements the Lehmann & Yao algorithm
- Metapage contains information on the root and fast root
- Root, parent, and leaves are pages.
- Each level is a linked list making it easier to move from one page to an other within the same level.
- Pages have a high key defining the biggest value in the page
- Pages have items pointing to an other page or the row.āØ
23.
B-Trees - Searchingin a BTree
!23
1. Scan keys are created
2. Starting from the root until a leaf page
⢠Is moving to the right page necessary?
⢠If the page is a leaf, return the ļ¬rst item with a value
higher or equal to the scan key
⢠Binary search to find the right path to follow
⢠Descend to the child page and lock it
SELECT email FROM crocodile WHERE number_of_teeth >= 20;
24.
B-Trees - Scankeys
!24
Postgres uses the query scan toĀ deļ¬ne scankeys.
If possible,Ā redundant keysĀ in your queryĀ are eliminatedĀ to keep only
theĀ tightest bounds.
The tightest bound is number_of_teeth > 5
SELECT email, number_of teeth FROM crocodile
WHERE number_of_teeth > 4 AND number_of_teeth > 5
ORDER BY number_of_teeth ASC;
email | number_of_teeth
----------------------------------------+-----------------
anne.chow222131@croco.com | 6
valentin.williams222154@croco.com | 6
pauline.lal222156@croco.com | 6
han.yadav232276@croco.com | 6
25.
B-Trees - Aboutread locks
!25
We put a read lock on the currently examined page.
Read locksĀ ensure that theĀ records on that page are not
modiļ¬ed while readingĀ it.
There could still be a concurrent insert on a child page causing
a page split.
26.
BTrees - Ismoving right necessary?
!26
Concurrent insert while visiting the root:
SELECT email FROM crocodile WHERE number_of_teeth >= 20;
27.
BTrees - Ismoving right necessary?
!27
The new high key of child page is 19
So we need to move right to the page 840
28.
B-Trees - Searchingin a BTree
!28
1. Scan keys are created
2. Starting from the root until a leaf page
⢠Is moving to the right page necessary?
⢠If the page is a leaf, return the ļ¬rst item with a value
higher or equal to the scan key
⢠Binary search to find the right path to follow
⢠Descend to the child page and lock it
SELECT email FROM crocodile WHERE number_of_teeth >= 20;
29.
BTrees - Inserting
!29
1.Find the right insert page
2. Lock the page
3. Check constraint
4. Split page if necessary and insert row
5. In case of page split, recursively insert a new
item in the parent level
30.
BTrees -Inserting
Finding theright page
!30
Auto-incremented values:
Primary keys with a sequence for example, like the index crocodile_pkey.
New values will always be inserted in the right-most leaf page.
To avoid using the search algorithm, Postgres caches this page.
Non auto-incremented values:
The search algorithm is used to find the right leaf page.
31.
BTrees -Inserting
Page split
!31
1.Is a split necessary?
If theĀ free space on the target page is lower than the itemās size, then a split is necessary.
2. Finding the split point
Postgres wants toĀ equalize the free space on each pageĀ to limit page splits in future inserts.
3. Splitting
32.
BTrees - Deleting
!32
-Items are marked as deletedĀ and will beĀ ignored in future index scans until VACUUM
- A page is deletedĀ only if all its items have been deleted.
- It is possible to end up with a tree with several levels with only one page.
- The fast root is used toĀ optimize the search.
GIN
!34
- GIN (GeneralizedInverted Index)Ā
- Used to indexĀ arrays, jsonb, and tsvectorĀ (forĀ fulltext search) columns.
- Efficient for <@, &&, @@@ operators
New column healed_teeth (integer[])
Ā
Here is how to create the GIN index for this column
croco=# SELECT email, number_of_teeth, healed_teeth FROM crocodile WHERE id =1;
-[ RECORD 1 ]---+--------------------------------------------------------
email | louise.grandjonc1@croco.com
number_of_teeth | 58
healed_teeth | {16,11,55,27,22,41,38,2,5,40,52,57,28,50,10,15,1,12,46}
CREATE INDEX ON crocodile USING GIN(healed_teeth);
35.
GIN
How is itdifferent from a BTree? - Keys
!35
- GIN indexes are binary trees
- Just like BTree, their ļ¬rst page is a metapage
First difference: the keys
BTree index on healed_teeth
The indexed values are arrays
Seq Scan on crocodile (cost=ā¦)
Filter: ('{1,2}'::integer[] <@ healed_teeth)
Rows Removed by Filter: 250728
Planning time: 0.157 ms
Execution time: 161.716 ms
(5 rows)
SELECT email FROM crocodile
WHERE ARRAY[1, 2] <@ healed_teeth;
36.
GIN
How is itdifferent from a BTree? - Keys
!36
- In a GIN index, the array is split andĀ each valueĀ is an entry
- The values are unique
37.
GIN
How is itdifferent from a BTree? - Keys
!37
Bitmap Heap Scan on crocodile
(cost=516.59..6613.42 rows=54786 width=29)
(actual time=15.960..38.197 rows=73275 loops=1)
Recheck Cond: ('{1,2}'::integer[] <@ healed_teeth)
Heap Blocks: exact=4218
-> Bitmap Index Scan on crocodile_healed_teeth_idx
(cost=0.00..502.90 rows=54786 width=0)
(actual time=15.302..15.302 rows=73275 loops=1)
Index Cond: ('{1,2}'::integer[] <@ healed_teeth)
Planning time: 0.124 ms
Execution time: 41.018 ms
(7 rows)
Seq Scan on crocodile (cost=ā¦)
Filter: ('{1,2}'::integer[] <@ healed_teeth)
Rows Removed by Filter: 250728
Planning time: 0.157 ms
Execution time: 161.716 ms
(5 rows)
38.
GIN
How is itdifferent from a BTree? Leaves
!38
- In a leaf page, the items contain a posting list of pointers to the rows in the table
- If the list canāt ļ¬t in the page, it becomes a posting tree
- In the leaf item remains a pointer to the posting tree
39.
GIN
How is itdifferent from a BTree? Pending list
!39
- ToĀ optimise inserts, we store theĀ new entriesĀ in aĀ pending listĀ (linear list of pages)
- Entries areĀ moved to the main tree on VACUUM or when the list is full
- You can disable the pending list byĀ settingĀ fastupdateĀ to falseĀ (on CREATE or ALTER INDEX)
SELECT * FROM gin_metapage_info(get_raw_page('crocodile_healed_teeth_idx', 0));
-[ RECORD 1 ]----+-----------
pending_head | 4294967295
pending_tail | 4294967295
tail_free_size | 0
n_pending_pages | 0
n_pending_tuples | 0
n_total_pages | 358
n_entry_pages | 1
n_data_pages | 356
n_entries | 47
version | 2
40.
GIN
To sum itup
!40
To sum up, a GIN index has:
- A metapage
- A BTree of key entries
- The values are unique in the main binary tree
- The leaves either contain a pointer to a posting tree, or a posting list of heap
pointers
- New rows go into a pending list until itās full or VACUUM, that list needs to be
scanned while searching the index
GiST - keys
!42
Differenceswith a BTree index
- Data isnāt ordered
- The key ranges can overlap
Which means that a same value can be inserted in different pages
43.
GiST - keys
!43
Differenceswith a BTree index
- Data isnāt ordered
- The key ranges can overlap
Which means that a same value can be inserted in different pages
Data isnāt ordered
44.
GiST - keys
!44
Anew appointment scheduled from
Ā August 14th 2014 7:30am to 8:30am
can be inserted in both pages.
CREATE INDEX ON appointment USING GIST(schedule)
Differences with a BTree index
- Data isnāt ordered
- The key ranges can overlap
Which means that a same value can be inserted in different pages
45.
GiST - keys
!45
Differenceswith a BTree index
- Data isnāt ordered
- The key ranges can overlap
Which means that a same value can be inserted in different pages
A new appointment scheduled from
Ā August 14th 2014 7:30am to 8:30am
can be inserted in both pages.
CREATE INDEX ON appointment USING GIST(schedule)
46.
GiST
key class functions
!46
GiSTallows the development of custom data types with the appropriate access methods.
These functions are key class functions:
Union: used while inserting, if the range changed
Distance: used for ORDER BY and nearest neighbor, calculates the distance to the scan
key
47.
GiST
key class functions- 2
!47
Consistent: returns MAYBE if the range contains the searched value, meaning that rows
could be in the page
Child pages could contain the appointments overlapping
[2018-05-17 08:00:00, 2018-05-17 13:00:00]
Consistent returns MAYBE
48.
GiST - Searching
!48
SELECTc.email, schedule, done, emergency_level
FROM appointment
INNER JOIN crocodile c ON (c.id=crocodile_id)
WHERE schedule && '[2018-05-17 08:00:00,
2018-05-17 13:00:00]'::tstzrange
AND done IS FALSE
ORDER BY schedule DESC LIMIT 3;
1. Create a search queue of pages to explore with the root in it
2. While the search queue isnāt empty, pops a page
1. If the page is a leaf: update the bitmap with CTIDs of rows
2. Else, adds to the search queue the items where Consistent
returned MAYBE
49.
GiST - Inserting
!49
Anew item can be inserted in any page.
Penalty: key class function (defined by user) gives a number representing
how bad it would be to insert the value in the child page.
About page split:
Picksplit: makes groups with little distance
Performance of search will depend a lot of Picksplit
50.
GiST - Inserting
!50
Anew item can be inserted in any page.
Penalty: key class function (defined by user) gives a number representing
how bad it would be to insert the value in the child page.
About page split:
Picksplit: makes groups with little distance
Performance of search will depend a lot of Picksplit
51.
To sum up
!51
-Useful for overlapping (geometries, array etc.)
- Nearest neighbor
- Can be used for full text search (tsvector, tsquery)
- Any data type can implement GiST as long as a few methods are available
SP-GiST
Internal data structure
!53
-Not a balanced tree
- A same page canāt have inner tuples and leaf tuples
- Keys are decomposed
- In an inner tuple, the value is the preļ¬x
- In a leaf tuple, the value is the rest (postļ¬x)
54.
P
L
A
Page blkno: 1
ABLO
UISE
RIAN
O
D
Pageblkno: 8 Page blkno: 4
SP-GiST
Pages
!54
SELECT tid, level, leaf_value FROM spgist_print('crocodile_first_name_idx3') as t
(tid tid, a bool, n int, level int, p tid, pr text, l smallint, leaf_value text) ;
tid | level | leaf_value
----------+-------+------------
ā¦
(4,36) | 2 | ablo
(4,57) | 2 | ustafa
(4,84) | 3 | rian
(4,153) | 3 | uise
ā¦
Here are how the pages are
organized if we look into gevelās
sp-gist functions for this index
55.
SP-GiST
!55
- Can beused for points
- For text to search for prefix
- For non balanced data structures (k-d trees)
- Like GiST: allows the development of custom data types
BRIN
Internal data structure
!57
-Block Range Index
- Not a binary tree
- Not even a tree
- Block range: group of pages physically adjacent
- For each block range: the range of values is stored
- BRIN indexes are very small
- Fast scanning on large tables
58.
BRIN
Internal data structure
!58
SELECT* FROM brin_page_items(get_raw_page('appointment_created_at_idx', 2), 'appointment_created_at_idx');
itemoffset | blknum | attnum | allnulls | hasnulls | placeholder | value
------------+--------+--------+----------+----------+-------------+---------------------------------------------------
1 | 0 | 1 | f | f | f | {2008-03-01 00:00:00-08 .. 2009-07-07 07:30:00-07}
2 | 128 | 1 | f | f | f | {2009-07-07 08:00:00-07 .. 2010-11-12 15:30:00-08}
3 | 256 | 1 | f | f | f | {2010-11-12 16:00:00-08 .. 2012-03-19 23:30:00-07}
4 | 384 | 1 | f | f | f | {2012-03-20 00:00:00-07 .. 2013-07-26 07:30:00-07}
5 | 512 | 1 | f | f | f | {2013-07-26 08:00:00-07 .. 2014-12-01 15:30:00-08}
SELECT id, created_at FROM appointment WHERE ctid='(0, 1)'::tid;
id | created_at
--------+------------------------
101375 | 2008-03-01 00:00:00-08
(1 row)
59.
BRIN
Internal data structure
!59
SELECT* FROM brin_page_items(get_raw_page('crocodile_birthday_idx', 2),
'crocodile_birthday_idx');
itemoffset | blknum | attnum | allnulls | hasnulls | placeholder | value
------------+--------+--------+----------+----------+-------------+----------------------------
1 | 0 | 1 | f | f | f | {1948-09-05 .. 2018-09-04}
2 | 128 | 1 | f | f | f | {1948-09-07 .. 2018-09-03}
3 | 256 | 1 | f | f | f | {1948-09-05 .. 2018-09-03}
4 | 384 | 1 | f | f | f | {1948-09-05 .. 2018-09-04}
5 | 512 | 1 | f | f | f | {1948-09-05 .. 2018-09-02}
6 | 640 | 1 | f | f | f | {1948-09-09 .. 2018-09-04}
ā¦
(14 rows)
In this case, the values in birthday has no correlation with the physical
location, the index would not speed up the search as all pages would have
to be visited.
BRIN is interesting for data where the value is correlated with the
physical location.
60.
BRIN
Warning on DELETEand INSERT
!60
SELECT * FROM brin_page_items(get_raw_page('appointment_created_at_idx', 2), 'appointment_created_at_idx');
itemoffset | blknum | attnum | allnulls | hasnulls | placeholder | value
------------+--------+--------+----------+----------+-------------+---------------------------------------------------
1 | 0 | 1 | f | f | f | {2008-03-01 00:00:00-08 .. 2018-07-01 07:30:00-07}
2 | 128 | 1 | f | f | f | {2009-07-07 08:00:00-07 .. 2018-07-01 23:30:00-07}
3 | 256 | 1 | f | f | f | {2010-11-12 16:00:00-08 .. 2012-03-19 23:30:00-07}
4 | 384 | 1 | f | f | f | {2012-03-20 00:00:00-07 .. 2018-07-06 23:30:00-07}
DELETE FROM appointment WHERE created_at >= '2009-07-07' AND created_at < ā2009-07-08';
DELETE FROM appointment WHERE created_at >= '2012-03-20' AND created_at < ā2012-03-25';
Deleted and then vacuum on the appointment table
New rows are inserted in the free space after VACUUM
BRIN index has some ranges with big data ranges.
Search will visit a lot of pages.
Hash
Internal data structure
!62
-Only useful if you have a data not fitting
into a page (8kb)
- Only operator is =
- If you use a PG version < 10, itās just awful
63.
Conclusion
!63
- B-Tree
- Greatfor <, >, =, >=, <=
- GIN
- Fulltext search, jsonb, arrays
- ADD OPERATORS
- Inserts can be slow because of unicity of the
keys
- BRIN
- Great for huge table with correlation between
value and physical location
- <, >, =, >=, <=
- GiST
- Great for overlapping
- Using key class functions
- Can be implemented for any data type
- SP-Gist
- Also using key class function
- Decomposed keys
- Can be used for non balanced data
structures (k-d trees)
- Hash
- If you have a value > 8kB
- Only for =
64.
Questions
!64
Thanks for yourattention
Go read the articles www.louisemeta.com
Now only the ones on BTrees are published,
but Iāll announce the rest on twitter
@louisemeta
Crocodiles by https://www.instagram.com/zimmoriarty/?hl=en