KEMBAR78
Using histograms to get better performance | PDF
Using histograms
to get better performance
Sergei Petrunia
Varun Gupta
Database performance
● Performance is a product of many
factors
● One of them is Query optimizer
● It produces query plans
– A “good” query plan only
reads rows that contribute to
the query result
– A “bad” query plan means
unnecessary work is done
Do my queries use bad query plans?
● Queries take a long time
● Some are just inherently hard to
compute
● Some look good but turn out bad
due to factors that were not
accounted for
Query plan cost depends on data statistics
select *
from
lineitem, orders
where
o_orderkey=l_orderkey and
o_orderdate between '1990-01-01' and '1998-12-06' and
l_extendedprice > 1000000
● orders->lineitem
vs
lineitem->orders
● Depends on
condition selectivity
Another choice optimizer has to make
select *
from
orders
where
o_orderstatus='F'
order by
order_date
limit 10
● Use index(order_date)
– Stop as soon as we find 10 matches
● Find rows with o_orderstatus='F'
– Sort by o_orderdate picking first 10
● Again, it depends on condition
selectivity.
Data statistics in MariaDB
● Table: #rows in the table
● Index
– cardinality: AVG(#lineitems per order)
– “range estimates” - #rows(t.key BETWEEN const1 and
const2)
● Non-index column? Histogram
Histogram
● Partition the value space into buckets
– Store bucket bounds and #values in the bucket
– Imprecise
– Very compact
Summary so far
● Good database performance requires good query plans
● To pick those, optimizer needs statistics about the data
– Condition selectivity is important
● Certain kinds of statistics are always available
– Indexes
– For non-indexed columns, histograms may be needed.
Do my query plans suffer
from bad statistics?
Will my queries benefit?
● Very complex question
● No definite answer
● Suggestions
– ANALYZE for statements, r_filtered.
– Slow query log
ANALYZE for statements and r_filtered
● filtered – % of rows left after applying condition (expectation)
– r_filtered - ... - the reality
● r_filtered << filtered – the optimizer didn’t know the condition is selective
– Happens on a non-first table? We are filtering out late!
●
Add histogram on the column (Check the cond in FORMAT=JSON)
analyze select *
from lineitem, orders
where o_orderkey=l_orderkey and
o_orderdate between '1990-01-01' and '1998-12-06' and
l_extendedprice > 1000000
+--+-----------+--------+----+-------------+-------+-------+-----------------+-------+-------+--------+----------+-----------+
|id|select_type|table |type|possible_keys|key |key_len|ref |rows |r_rows |filtered|r_filtered|Extra |
+--+-----------+--------+----+-------------+-------+-------+-----------------+-------+-------+--------+----------+-----------+
|1 |SIMPLE |orders |ALL |PRIMARY,i_...|NULL |NULL |NULL |1504278|1500000| 50.00 | 100.00 |Using where|
|1 |SIMPLE |lineitem|ref |PRIMARY,i_...|PRIMARY|4 |orders.o_orderkey|2 |4.00 | 100.00 | 0.00 |Using where|
+--+-----------+--------+----+-------------+-------+-------+-----------------+-------+-------+--------+----------+-----------+
# Query_time: 1.961549 Lock_time: 0.011164 Rows_sent: 1 Rows_examined: 11745000
# Rows_affected: 0 Bytes_sent: 73
# Full_scan: Yes Full_join: No Tmp_table: No Tmp_table_on_disk: No
# Filesort: No Filesort_on_disk: No Merge_passes: 0 Priority_queue: No
#
# explain: id select_type table type possible_keys key key_len ref rows r_rows
filtered r_filtered Extra
# explain: 1 SIMPLE inventory ALL NULL NULL NULL NULL 11837024
11745000.00 100.00 0.00 Using where
#
SET timestamp=1551155484;
select count(inv_date_sk) from inventory where inv_quantity_on_hand>10000;
Slow Query Log
slow-query-log
long-query-time=...
log-slow-verbosity=query_plan,explain
my.cnf
hostname-slow.log
● Rows_examined >> Rows_sent? Grouping,or a poor query plan
● log_slow_query=explain will shows ANALYZE output
Histograms in MariaDB
Histograms in MariaDB
● Available since MariaDB 10.0 (Yes)
● Used by advanced users
● Not enabled by default
● Have limitations, not user-friendly
● MariaDB 10.4
– Fixes some of the limitations
– Makes histograms easier to use
Collecting histograms
Configuration for collecting histograms
histogram_size=0
histogram_type=SINGLE_PREC_HB
histogram_size=254
histogram_type=DOUBLE_PREC_HB
● MariaDB before 10.4: change the default histogram size
● MariaDB 10.4 : enable automatic sampling
histogram_size=254
histogram_type=DOUBLE_PREC_HB
analyze_sample_percentage=100
analyze_sample_percentage=0
Histograms are [still] not collected by default
● “ANALYZE TABLE” will not collect a histogram
MariaDB> analyze table t1;
+---------+---------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+---------+---------+----------+----------+
| test.t1 | analyze | status | OK |
+---------+---------+----------+----------+
● This will collect only
– Total #rows in table
– Index cardinalities (#different values)
ANALYZE ... PERSISTENT collects histograms
– Collect statistics for everything:
analyze table t1 persistent
for columns (col1,...) indexes (idx1,...);
+---------+---------+----------+-----------------------------------------+
| Table | Op | Msg_type | Msg_text |
+---------+---------+----------+-----------------------------------------+
| test.t1 | analyze | status | Engine-independent statistics collected |
| test.t1 | analyze | status | OK |
+---------+---------+----------+-----------------------------------------+
analyze table t1 persistent for all;
Can make histogram collection automatic
set use_stat_tables='preferably';
analyze table t1;
+---------+---------+----------+-----------------------------------------+
| Table | Op | Msg_type | Msg_text |
+---------+---------+----------+-----------------------------------------+
| test.t1 | analyze | status | Engine-independent statistics collected |
| test.t1 | analyze | status | OK |
+---------+---------+----------+-----------------------------------------+
● Beware: this may be *much* slower than ANALYZE TABLE
you’re used to
● Great for migrations
Histogram collection performance
● MariaDB 10.0: uses all data in the table to build histogram
– Precise, but expensive
– Particularly so for VARCHARs
● A test on a real table:
– Real table, 740M rows, 90GB
– CHECKSUM TABLE: 5 min
– ANALYZE TABLE ... PERSISTENT FOR ALL – 30 min
MariaDB 10.4: Bernoulli sampling
● Default: analyze_sample_percentage=100
– Uses the entire table, slow
● Suggested: analyze_sample_percentage=0
– “Roll the dice” sampling, size picked automatically
analyze table t1 persistent for columns (...) indexes();
analyze table t1 persistent for all;
– full table and secondary index scans
– does a full table scan
Further plans: genuine sampling
● Work on avoiding full table scans is in progress
● Will allow to make ANALYZE TABLE collect all histograms
Making the optimizer
use histograms
Make the optimizer use histograms
@@use_stat_tables=NEVER
@@optimizer_use_condition_selectivity=1
@@use_stat_tables=PREFERABLY // also affects ANALYZE!
@@optimizer_use_condition_selectivity=4
● MariaDB before 10.4: does not use histograms
● MariaDB 10.4 : uses histograms if they are collected
@@use_stat_tables=PREFERABLY_FOR_QUERIES
@@optimizer_use_condition_selectivity=4
– remember to re-collect!
Conclusions: how to start using histograms
● MariaDB before 10.4
analyze_sample_percentage=0
use_stat_tables=PREFERABLY # Changes optimizer
optimizer_use_condition_selectivity=4 # behavior
● MariaDB 10.4
● Both: ANALYZE TABLE ... PERSISTENT FOR ...
histogram_size=254 # No risk
histogram_type=DOUBLE_PREC_HB #
Can I just have histograms
for all columns?
A stored procedure to analyze every table
CREATE PROCEDURE analyze_persistent_for_all(db_name VARCHAR(64))
BEGIN
DECLARE done INT DEFAULT FALSE;
DECLARE x VARCHAR(64);
DECLARE cur1 CURSOR FOR
SELECT TABLE_NAME
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_TYPE = 'BASE TABLE' AND TABLE_SCHEMA=db_name;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
OPEN cur1;
read_loop: LOOP
FETCH cur1 INTO x;
IF done THEN
LEAVE read_loop;
END IF;
SET @sql = CONCAT('analyze table ', x, ' persistent for all');
PREPARE stmt FROM @sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
END LOOP;
CLOSE cur1;
END|
Should I ANALYZE ... PERSISTENT every table?
● New application
– Worth giving it a try
– Provision for periodic ANALYZE
– Column correlations?
● Existing application
– Performance fixes on a case-by-case basis.
Tests and benchmarks
TPC-DS benchmark
● scale=1
● The same dataset
– without histograms: ~20 min
– after ‘call analyze_persistent_for_all(‘tpcds’) from two slides
prior: 5 min.
TPC-DS benchmark run
A customer case with ORDER BY ... LIMIT
● table/column names replaced
CREATE TABLE cars (
type varchar(10),
company varchar(20),
model varchar(20),
quantity int,
KEY quantity (quantity),
KEY type (type)
);
select * from cars
where
type='electric' and
company='audi'
order by
quantity
limit 3;
● table/column names replaced
● quantity matches the ORDER BY, but need to match condition
● type is a Restrictive index
A customer case with ORDER BY ... LIMIT
● Uses ORDER-BY compatible index by default
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: cars
type: index
possible_keys: type
key: quantity
key_len: 5
ref: const
rows: 994266
r_rows: 700706.00
filtered: 0.20
r_filtered: 0.00
Extra: Using where
1 row in set (2.098 sec)
select * from cars
where
type='electric' and
company='audi'
order by
quantity
limit 3;
A customer case with ORDER BY ... LIMIT
● Providing the optimizer with histogram
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: cars
type: ref
possible_keys: type
key: type
key_len: 13
ref: const
rows: 2022
r_rows: 3.00
filtered: 100.00
r_filtered: 100.00
Extra: Using index condition; Using where; Using filesort
1 row in set (0.010 sec)
analyze table cars persistent for all;
select * from cars
where
type='electric' and
company='audi'
order by
quantity
limit 3;
Operations
Histograms are stored in a table
CREATE TABLE mysql.column_stats (
db_name varchar(64) NOT NULL,
table_name varchar(64) NOT NULL,
column_name varchar(64) NOT NULL,
min_value varbinary(255) DEFAULT NULL,
max_value varbinary(255) DEFAULT NULL,
nulls_ratio decimal(12,4) DEFAULT NULL,
avg_length decimal(12,4) DEFAULT NULL,
avg_frequency decimal(12,4) DEFAULT NULL,
hist_size tinyint unsigned,
hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
histogram varbinary(255),
PRIMARY KEY (db_name,table_name,column_name)
);
TPC-DS benchmark
● Can save/restore histograms
● Can set @@optimizer_use_condition_selectivity to disable
histogram use per-thread
Caveat: correlations
Problem with correlated conditions
● Possible selectivities
– MIN(1/n, 1/m)
– (1/n) * (1/m)
– 0
select ...
from order_items
where shipdate='2015-12-15' AND item_name='christmas light'
'swimsuit'
Problem with correlated conditions
● PostgreSQL: Multi-variate statistics
– Detects functional dependencies, col1=F(col2)
– Only used for equality predicates
– Also #DISTINCT(a,b)
● MariaDB: MDEV-11107: Use table check constraints in optimizer
– In development
select ...
from order_items
where shipdate='2015-12-15' AND item_name='christmas light'
'swimsuit'
Thanks!

Using histograms to get better performance

  • 1.
    Using histograms to getbetter performance Sergei Petrunia Varun Gupta
  • 2.
    Database performance ● Performanceis a product of many factors ● One of them is Query optimizer ● It produces query plans – A “good” query plan only reads rows that contribute to the query result – A “bad” query plan means unnecessary work is done
  • 3.
    Do my queriesuse bad query plans? ● Queries take a long time ● Some are just inherently hard to compute ● Some look good but turn out bad due to factors that were not accounted for
  • 4.
    Query plan costdepends on data statistics select * from lineitem, orders where o_orderkey=l_orderkey and o_orderdate between '1990-01-01' and '1998-12-06' and l_extendedprice > 1000000 ● orders->lineitem vs lineitem->orders ● Depends on condition selectivity
  • 5.
    Another choice optimizerhas to make select * from orders where o_orderstatus='F' order by order_date limit 10 ● Use index(order_date) – Stop as soon as we find 10 matches ● Find rows with o_orderstatus='F' – Sort by o_orderdate picking first 10 ● Again, it depends on condition selectivity.
  • 6.
    Data statistics inMariaDB ● Table: #rows in the table ● Index – cardinality: AVG(#lineitems per order) – “range estimates” - #rows(t.key BETWEEN const1 and const2) ● Non-index column? Histogram
  • 7.
    Histogram ● Partition thevalue space into buckets – Store bucket bounds and #values in the bucket – Imprecise – Very compact
  • 8.
    Summary so far ●Good database performance requires good query plans ● To pick those, optimizer needs statistics about the data – Condition selectivity is important ● Certain kinds of statistics are always available – Indexes – For non-indexed columns, histograms may be needed.
  • 9.
    Do my queryplans suffer from bad statistics?
  • 10.
    Will my queriesbenefit? ● Very complex question ● No definite answer ● Suggestions – ANALYZE for statements, r_filtered. – Slow query log
  • 11.
    ANALYZE for statementsand r_filtered ● filtered – % of rows left after applying condition (expectation) – r_filtered - ... - the reality ● r_filtered << filtered – the optimizer didn’t know the condition is selective – Happens on a non-first table? We are filtering out late! ● Add histogram on the column (Check the cond in FORMAT=JSON) analyze select * from lineitem, orders where o_orderkey=l_orderkey and o_orderdate between '1990-01-01' and '1998-12-06' and l_extendedprice > 1000000 +--+-----------+--------+----+-------------+-------+-------+-----------------+-------+-------+--------+----------+-----------+ |id|select_type|table |type|possible_keys|key |key_len|ref |rows |r_rows |filtered|r_filtered|Extra | +--+-----------+--------+----+-------------+-------+-------+-----------------+-------+-------+--------+----------+-----------+ |1 |SIMPLE |orders |ALL |PRIMARY,i_...|NULL |NULL |NULL |1504278|1500000| 50.00 | 100.00 |Using where| |1 |SIMPLE |lineitem|ref |PRIMARY,i_...|PRIMARY|4 |orders.o_orderkey|2 |4.00 | 100.00 | 0.00 |Using where| +--+-----------+--------+----+-------------+-------+-------+-----------------+-------+-------+--------+----------+-----------+
  • 12.
    # Query_time: 1.961549Lock_time: 0.011164 Rows_sent: 1 Rows_examined: 11745000 # Rows_affected: 0 Bytes_sent: 73 # Full_scan: Yes Full_join: No Tmp_table: No Tmp_table_on_disk: No # Filesort: No Filesort_on_disk: No Merge_passes: 0 Priority_queue: No # # explain: id select_type table type possible_keys key key_len ref rows r_rows filtered r_filtered Extra # explain: 1 SIMPLE inventory ALL NULL NULL NULL NULL 11837024 11745000.00 100.00 0.00 Using where # SET timestamp=1551155484; select count(inv_date_sk) from inventory where inv_quantity_on_hand>10000; Slow Query Log slow-query-log long-query-time=... log-slow-verbosity=query_plan,explain my.cnf hostname-slow.log ● Rows_examined >> Rows_sent? Grouping,or a poor query plan ● log_slow_query=explain will shows ANALYZE output
  • 13.
  • 14.
    Histograms in MariaDB ●Available since MariaDB 10.0 (Yes) ● Used by advanced users ● Not enabled by default ● Have limitations, not user-friendly ● MariaDB 10.4 – Fixes some of the limitations – Makes histograms easier to use
  • 15.
  • 16.
    Configuration for collectinghistograms histogram_size=0 histogram_type=SINGLE_PREC_HB histogram_size=254 histogram_type=DOUBLE_PREC_HB ● MariaDB before 10.4: change the default histogram size ● MariaDB 10.4 : enable automatic sampling histogram_size=254 histogram_type=DOUBLE_PREC_HB analyze_sample_percentage=100 analyze_sample_percentage=0
  • 17.
    Histograms are [still]not collected by default ● “ANALYZE TABLE” will not collect a histogram MariaDB> analyze table t1; +---------+---------+----------+----------+ | Table | Op | Msg_type | Msg_text | +---------+---------+----------+----------+ | test.t1 | analyze | status | OK | +---------+---------+----------+----------+ ● This will collect only – Total #rows in table – Index cardinalities (#different values)
  • 18.
    ANALYZE ... PERSISTENTcollects histograms – Collect statistics for everything: analyze table t1 persistent for columns (col1,...) indexes (idx1,...); +---------+---------+----------+-----------------------------------------+ | Table | Op | Msg_type | Msg_text | +---------+---------+----------+-----------------------------------------+ | test.t1 | analyze | status | Engine-independent statistics collected | | test.t1 | analyze | status | OK | +---------+---------+----------+-----------------------------------------+ analyze table t1 persistent for all;
  • 19.
    Can make histogramcollection automatic set use_stat_tables='preferably'; analyze table t1; +---------+---------+----------+-----------------------------------------+ | Table | Op | Msg_type | Msg_text | +---------+---------+----------+-----------------------------------------+ | test.t1 | analyze | status | Engine-independent statistics collected | | test.t1 | analyze | status | OK | +---------+---------+----------+-----------------------------------------+ ● Beware: this may be *much* slower than ANALYZE TABLE you’re used to ● Great for migrations
  • 20.
    Histogram collection performance ●MariaDB 10.0: uses all data in the table to build histogram – Precise, but expensive – Particularly so for VARCHARs ● A test on a real table: – Real table, 740M rows, 90GB – CHECKSUM TABLE: 5 min – ANALYZE TABLE ... PERSISTENT FOR ALL – 30 min
  • 21.
    MariaDB 10.4: Bernoullisampling ● Default: analyze_sample_percentage=100 – Uses the entire table, slow ● Suggested: analyze_sample_percentage=0 – “Roll the dice” sampling, size picked automatically analyze table t1 persistent for columns (...) indexes(); analyze table t1 persistent for all; – full table and secondary index scans – does a full table scan
  • 22.
    Further plans: genuinesampling ● Work on avoiding full table scans is in progress ● Will allow to make ANALYZE TABLE collect all histograms
  • 23.
  • 24.
    Make the optimizeruse histograms @@use_stat_tables=NEVER @@optimizer_use_condition_selectivity=1 @@use_stat_tables=PREFERABLY // also affects ANALYZE! @@optimizer_use_condition_selectivity=4 ● MariaDB before 10.4: does not use histograms ● MariaDB 10.4 : uses histograms if they are collected @@use_stat_tables=PREFERABLY_FOR_QUERIES @@optimizer_use_condition_selectivity=4 – remember to re-collect!
  • 25.
    Conclusions: how tostart using histograms ● MariaDB before 10.4 analyze_sample_percentage=0 use_stat_tables=PREFERABLY # Changes optimizer optimizer_use_condition_selectivity=4 # behavior ● MariaDB 10.4 ● Both: ANALYZE TABLE ... PERSISTENT FOR ... histogram_size=254 # No risk histogram_type=DOUBLE_PREC_HB #
  • 26.
    Can I justhave histograms for all columns?
  • 27.
    A stored procedureto analyze every table CREATE PROCEDURE analyze_persistent_for_all(db_name VARCHAR(64)) BEGIN DECLARE done INT DEFAULT FALSE; DECLARE x VARCHAR(64); DECLARE cur1 CURSOR FOR SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_TYPE = 'BASE TABLE' AND TABLE_SCHEMA=db_name; DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE; OPEN cur1; read_loop: LOOP FETCH cur1 INTO x; IF done THEN LEAVE read_loop; END IF; SET @sql = CONCAT('analyze table ', x, ' persistent for all'); PREPARE stmt FROM @sql; EXECUTE stmt; DEALLOCATE PREPARE stmt; END LOOP; CLOSE cur1; END|
  • 28.
    Should I ANALYZE... PERSISTENT every table? ● New application – Worth giving it a try – Provision for periodic ANALYZE – Column correlations? ● Existing application – Performance fixes on a case-by-case basis.
  • 29.
  • 30.
    TPC-DS benchmark ● scale=1 ●The same dataset – without histograms: ~20 min – after ‘call analyze_persistent_for_all(‘tpcds’) from two slides prior: 5 min.
  • 31.
  • 32.
    A customer casewith ORDER BY ... LIMIT ● table/column names replaced CREATE TABLE cars ( type varchar(10), company varchar(20), model varchar(20), quantity int, KEY quantity (quantity), KEY type (type) ); select * from cars where type='electric' and company='audi' order by quantity limit 3; ● table/column names replaced ● quantity matches the ORDER BY, but need to match condition ● type is a Restrictive index
  • 33.
    A customer casewith ORDER BY ... LIMIT ● Uses ORDER-BY compatible index by default *************************** 1. row *************************** id: 1 select_type: SIMPLE table: cars type: index possible_keys: type key: quantity key_len: 5 ref: const rows: 994266 r_rows: 700706.00 filtered: 0.20 r_filtered: 0.00 Extra: Using where 1 row in set (2.098 sec) select * from cars where type='electric' and company='audi' order by quantity limit 3;
  • 34.
    A customer casewith ORDER BY ... LIMIT ● Providing the optimizer with histogram *************************** 1. row *************************** id: 1 select_type: SIMPLE table: cars type: ref possible_keys: type key: type key_len: 13 ref: const rows: 2022 r_rows: 3.00 filtered: 100.00 r_filtered: 100.00 Extra: Using index condition; Using where; Using filesort 1 row in set (0.010 sec) analyze table cars persistent for all; select * from cars where type='electric' and company='audi' order by quantity limit 3;
  • 35.
  • 36.
    Histograms are storedin a table CREATE TABLE mysql.column_stats ( db_name varchar(64) NOT NULL, table_name varchar(64) NOT NULL, column_name varchar(64) NOT NULL, min_value varbinary(255) DEFAULT NULL, max_value varbinary(255) DEFAULT NULL, nulls_ratio decimal(12,4) DEFAULT NULL, avg_length decimal(12,4) DEFAULT NULL, avg_frequency decimal(12,4) DEFAULT NULL, hist_size tinyint unsigned, hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'), histogram varbinary(255), PRIMARY KEY (db_name,table_name,column_name) );
  • 37.
    TPC-DS benchmark ● Cansave/restore histograms ● Can set @@optimizer_use_condition_selectivity to disable histogram use per-thread
  • 38.
  • 39.
    Problem with correlatedconditions ● Possible selectivities – MIN(1/n, 1/m) – (1/n) * (1/m) – 0 select ... from order_items where shipdate='2015-12-15' AND item_name='christmas light' 'swimsuit'
  • 40.
    Problem with correlatedconditions ● PostgreSQL: Multi-variate statistics – Detects functional dependencies, col1=F(col2) – Only used for equality predicates – Also #DISTINCT(a,b) ● MariaDB: MDEV-11107: Use table check constraints in optimizer – In development select ... from order_items where shipdate='2015-12-15' AND item_name='christmas light' 'swimsuit'
  • 41.