PostgreSQL索引(10) – Bloom

2020-10-01 孟庆钟

本文翻译自Egor Rogov的英文博客，且已征得Egor Rogov的同意。译者做了少量修改。

审校：纪昀红，中国人民大学信息学院在读博士生。

在前面几篇文章中，我们讨论了通用索引引擎、AM的接口和六种AM（Hash索引、B-tree索引、GiST索引、SP-GiST索引、GIN索引、RUM索引、BRIN索引）。这篇文章将介绍Bloom索引。

1. 基本概念

一个经典的Bloom filter是一个数据结构，使用它可以快速的检查一个元素是否属于一个集合。过滤器被高度压缩，但是它可能会产生假阳性的结果：它可能会误认为一个元素在集合中存在，实际这个元素在集合中并不存在。如果它认为一个元素在集合中不存在，那么这个元素在集合中肯定不存在。

过滤器是一个由比特位组成的数组（也被称为签名），所有比特最初都被初始化成零。当向集合中添加一个元素时，我们会使用多个hash函数对这个元素计算hash值，把每个hash值对应的比特位置为1。所以，当一个元素对应的所有比特位都为1时，这个元素可能属于这个集合。但是当任意一位为0时，则这个元素肯定不属于这个集合。

通过改变签名的长度，我们可以在索引体积和假阳性的概率之间得到一个平衡。Bloom索引的应用非常广泛，相当多的宽表都在每个字段上使用过滤器。这种AM，和BRIN类似，也可以被看作是顺序扫描的加速器：索引返回的所有元组，都需要被recheck，但是有可能避免扫描大部分不满足条件的元组。

2. 结构

我们在GiST索引中讨论过签名。和这些树不同，Bloom filter是一个平坦的结构。它的第0页是元页面，后面所有的页面都是包含索引行的常规页面。每个索引行包含一个签名和一个指向表中行的TID，如下图所示：

2.1 参数

在创建Bloom索引时，可以指定每个签名的总长度，也可以指定每个字段需要将签名中多少位置1。

create index on ... using bloom(...) with (length=..., col1=..., col2=..., ...);

我们应该怎样选取合适的值呢？理论表明，给定一个假阳性的概率，签名的最优长度可以用$m=-n\log_2 p/\ln 2$估计，其中$n$是索引中字段的个数，签名中需要设置为1的位数$k$为$k=-\log_2 p$。

签名的长度会被向上取整到2字节的倍数。

需要注意几点：

假阳性的概率p与一个filter相关，我们期望在一次扫描中得到Np个假阳性的结果。例如一个表有100万行，概率为0.01，平均来讲，可以观察到Rows Removed by Index Recheck的数量为10000。
Bloom filter是一个概率结构。只有在大量数据的场景下，这种概率才有意义。
上面的估计基于一个理想的数学模型和一些假设。实际中，结果要糟糕一些。所以，不要过分依赖这个公式：它们仅仅提供了一种选取初始值的方法。
对每个字段来说，这个AM允许我们选择在签名中置1的位数。实际中最优的位数与数据的分布有关系。如果想深挖，请参考这篇文章。但是也别忘了上一条。

2.2 更新

当向表中插入一行数据时，签名被计算出来：对所有被索引的字段，与它们对应的位全部被置为1。理论上来说，我们需要k个不同的hash函数，实际中，只需要连续生成k个随机数即可。随机数的种子有hash值决定。

当一行数据被删除时，整个签名全部被删除。

更新操作由两个操作组合而成：删除旧版本，插入新版本。

2.3 扫描

因为Bloom只能用来检查某个元素是否属于某个集合，所以Bloom索引支持的操作只有等值检查，就像Hash索引一样。

上文已经提到，Bloom索引是一个平坦的结构，所以在访问它的时候，总需要把它按顺序完整的读出来。在读取的过程中，将位图构造出来，这个位图用来扫描基表。

读取Bloom索引时，实际就是对索引进行全部扫描。为了防止扫描过程把一些有用的缓存换出，在读取时会使用一个小的缓冲区环，就像对基表进行全表扫描一样。

我们需要注意到，Bloom索引的体积越大，优化器对它越不感兴趣。这个依赖是线性的，这和树状的索引不同。

3. 示例

3.1 表

我们继续使用上一篇文章中的表flights_bi。表的大小约4GB，总共包含大约3000万行。表的定义如下：

\d flights_bi
                          Table "bookings.flights_bi"
       Column       |           Type           | Collation | Nullable | Default 
--------------------+--------------------------+-----------+----------+---------
 airport_code       | character(3)             |           |          | 
 airport_coord      | point                    |           |          | 
 airport_utc_offset | interval                 |           |          | 
 flight_no          | character(6)             |           |          | 
 flight_type        | text                     |           |          | 
 scheduled_time     | timestamp with time zone |           |          | 
 actual_time        | timestamp with time zone |           |          | 
 aircraft_code      | character(3)             |           |          | 
 seat_no            | character varying(4)     |           |          | 
 fare_conditions    | character varying(10)    |           |          | 
 passenger_id       | character varying(20)    |           |          | 
 passenger_name     | text                     |           |          |

create extension bloom;

在BRIN索引中，我们使用了3个字段(scheduled_time, actual_time, airport_utc_offset)。因为Bloom索引不需要依赖数据在物理上有序，我们尝试在索引中使用表中几乎所有的字段。当然，我们也要排除时间字段（scheduled_time，actual_tim）：因为它只支持等值查询，在一个精确的时间上做等值查询基本没有意义。虽然可以创建表达式索引，但我们还是不费力了。我们排除机场的地理坐标airport_coord：point类型还不被支持。

为了计算参数的时，我们把假阳性的概率设置为0.01，带入上面的公式，可以计算出，签名需要96个比特，建议对每个字段把签名中7位置1。索引估计为515MB（大约为基表大小的八分之一）。

虽然使用16位签名，索引的体积更小，但是这会导致50%的假阳性，这太差了。

3.2 操作符类

首先创建索引：

create index flights_bi_bloom on flights_bi
using bloom(airport_code, airport_utc_offset, flight_no, flight_type, aircraft_code, seat_no, fare_conditions, passenger_id, passenger_name)
with (length=96, col1=7, col2=7, col3=7, col4=7, col5=7, col6=7, col7=7, col8=7, col9=7);

ERROR:  data type character has no default operator class for access method "bloom"
HINT:  You must specify an operator class for the index or define a default operator class for the data type.

不幸的是，它只支持两个操作符类：

select opcname, opcintype::regtype
from pg_opclass
where opcmethod = (select oid from pg_am where amname = 'bloom')
order by opcintype::regtype::text;

 opcname  | opcintype
----------+-----------
 int4_ops | integer
 text_ops | text
(2 rows)

幸运的是，为其它数据类型创建相似的操作符类非常简单。Bloom索引的每个操作符类都必须包含一个操作符（相等）和一个辅助函数（哈希函数）。寻找操作符和哈希函数最简单的方法是查找Hash索引的操作符类：

select distinct
       opc.opcintype::regtype::text,
       amop.amopopr::regoperator,
       ampr.amproc
from pg_am am, pg_opclass opc, pg_amop amop, pg_amproc ampr
where am.amname = 'hash'
   and opc.opcmethod = am.oid
   and amop.amopfamily = opc.opcfamily
   and amop.amoplefttype = opc.opcintype
   and amop.amoprighttype = opc.opcintype
   and ampr.amprocfamily = opc.opcfamily
   and ampr.amproclefttype = opc.opcintype
order by opc.opcintype::regtype::text;

 opcintype |       amopopr        |    amproc    
-----------+----------------------+--------------
 abstime   | =(abstime,abstime)   | hashint4
 aclitem   | =(aclitem,aclitem)   | hash_aclitem
 anyarray  | =(anyarray,anyarray) | hash_array
 anyenum   | =(anyenum,anyenum)   | hashenum
 anyrange  | =(anyrange,anyrange) | hash_range
 ...

我们创建两个操作符类：

CREATE OPERATOR CLASS character_ops
DEFAULT FOR TYPE character USING bloom AS
  OPERATOR  1  =(character,character),
  FUNCTION  1  hashbpchar;

CREATE OPERATOR CLASS interval_ops
DEFAULT FOR TYPE interval USING bloom AS
  OPERATOR  1  =(interval,interval),
  FUNCTION  1  interval_hash;

point类型没有对应的hash函数，索引不能为它创建Bloom索引，也不能对这种类型使用hash连接。

再次尝试一下：

create index flights_bi_bloom on flights_bi
using bloom(airport_code, airport_utc_offset, flight_no, flight_type, aircraft_code, seat_no, fare_conditions, passenger_id, passenger_name)
with (length=96, col1=7, col2=7, col3=7, col4=7, col5=7, col6=7, col7=7, col8=7, col9=7);

CREATE INDEX

索引的大小为526MB，比估计的稍大一些。这是因为我们在估算时没有考虑页面上的其它开销。

select pg_size_pretty(pg_total_relation_size('flights_bi_bloom'));

 pg_size_pretty
----------------
 526 MB
(1 row)

3.3 查询

我们使用各种标准，对索引进行查询。

就像已经提到的那样，Bloom filter是一个概率性的数据结构，因此它高度依赖具体的场景。例如，我们查看与两个乘客相关的行。Miroslav Sidorov：

explain(costs off,analyze)
select * from flights_bi where passenger_name='MIROSLAV SIDOROV';

                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on flights_bi (actual time=2639.010..3010.692 rows=2 loops=1)
   Recheck Cond: (passenger_name = 'MIROSLAV SIDOROV'::text)
   Rows Removed by Index Recheck: 38562
   Heap Blocks: exact=21726
   ->  Bitmap Index Scan on flights_bi_bloom (actual time=1065.191..1065.191 rows=38564 loops=1)
         Index Cond: (passenger_name = 'MIROSLAV SIDOROV'::text)
 Planning time: 0.109 ms
 Execution time: 3010.736 ms

Marfa Soloveva:

explain(costs off,analyze)
select * from flights_bi where passenger_name='MARFA SOLOVEVA';

                                            QUERY PLAN
---------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on flights_bi (actual time=9980.884..10142.684 rows=2 loops=1)
   Recheck Cond: (passenger_name = 'MARFA SOLOVEVA'::text)
   Rows Removed by Index Recheck: 3950168
   Heap Blocks: exact=45757 lossy=67332
   ->  Bitmap Index Scan on flights_bi_bloom (actual time=1037.588..1037.588 rows=212972 loops=1)
         Index Cond: (passenger_name = 'MARFA SOLOVEVA'::text)
 Planning time: 0.157 ms
 Execution time: 10142.730 ms

在第一个查询中，只有40000个假阳性结果，但是在第二个查询中有4000000个。执行时间也不同。

下面使用乘客的id进行查询，而不是姓名：

explain(costs off,analyze)
select * from flights_bi where passenger_id='5864 006033';

                                           QUERY PLAN
-------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on flights_bi (actual time=13747.305..16907.387 rows=2 loops=1)
   Recheck Cond: ((passenger_id)::text = '5864 006033'::text)
   Rows Removed by Index Recheck: 9620258
   Heap Blocks: exact=50510 lossy=165722
   ->  Bitmap Index Scan on flights_bi_bloom (actual time=937.202..937.202 rows=426474 loops=1)
         Index Cond: ((passenger_id)::text = '5864 006033'::text)
 Planning time: 0.110 ms
 Execution time: 16907.423 ms

Marfa:

explain(costs off,analyze)
select * from flights_bi where passenger_id='2461 559238';

                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on flights_bi (actual time=3881.615..3934.481 rows=2 loops=1)
   Recheck Cond: ((passenger_id)::text = '2461 559238'::text)
   Rows Removed by Index Recheck: 30669
   Heap Blocks: exact=27513
   ->  Bitmap Index Scan on flights_bi_bloom (actual time=1084.391..1084.391 rows=30671 loops=1)
         Index Cond: ((passenger_id)::text = '2461 559238'::text)
 Planning time: 0.120 ms
 Execution time: 3934.517 ms

这两个查询的效率也有很大差距。

注意到，如果可以同时使用两个索引，则假阳性的概率可以降低到万分之一。

explain(costs off,analyze)
select * from flights_bi
where passenger_name='MIROSLAV SIDOROV'
  and passenger_id='5864 006033';

                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on flights_bi (actual time=872.593..877.915 rows=2 loops=1)
   Recheck Cond: (((passenger_id)::text = '5864 006033'::text)
               AND (passenger_name = 'MIROSLAV SIDOROV'::text))
   Rows Removed by Index Recheck: 357
   Heap Blocks: exact=356
   ->  Bitmap Index Scan on flights_bi_bloom (actual time=832.041..832.041 rows=359 loops=1)
         Index Cond: (((passenger_id)::text = '5864 006033'::text)
                   AND (passenger_name = 'MIROSLAV SIDOROV'::text))
 Planning time: 0.524 ms
 Execution time: 877.967 ms

但是，Bloom索引根本不支持or。这是优化器的限制，而不是Bloom filter的限制。当然，其中一个选择就是读取两个索引，计算两个位图，然后求交集，但是代价很大。

4. 与BRIN和Hash索引的比较

很明显，Bloom索引与BRIN的应用场景有交集。它们都面对需要使用不同字段对大表进行扫描的场景，以损失精度换取空间上的压缩。

BRIN索引更加紧凑，可以快速排除一个区间上所有的元组，但是基于一个很强的假设：数据的顺序与物理位置相关。Bloom索引要大一些，但是它除了需要一个合适的Hash函数之外，没有任何假设。

和Bloom索引相似，Hash索引也只支持等值查询。Hash索引的精度比Bloom索引高，但是需要的空间也大。

5. 属性

和往常一样，看一下Bloom filter的属性，查询见第二篇文章。

AM层面的属性：

 amname |     name      | pg_indexam_has_property
--------+---------------+-------------------------
 bloom  | can_order     | f
 bloom  | can_unique    | f
 bloom  | can_multi_col | t
 bloom  | can_exclude   | f

它支持多列索引。

某个索引层面的属性：

     name      | pg_index_has_property
---------------+-----------------------
 clusterable   | f
 index_scan    | f
 bitmap_scan   | t
 backward_scan | f

它支持位图扫描。

列层面的属性：

        name        | pg_index_column_has_property 
--------------------+------------------------------
 asc                | f
 desc               | f
 nulls_first        | f
 nulls_last         | f
 orderable          | f
 distance_orderable | f
 returnable         | f
 search_array       | f
 search_nulls       | f

数据库内核研究

MengQingzhong.com