12.9. GiST and GIN Index
Types

There are two kinds of indexes that can be used to speed up
full text searches. Note that indexes are not mandatory for full
text searching, but in cases where a column is searched on a
regular basis, an index will usually be desirable.

       CREATE INDEX name ON table USING gist(column);

Creates a GiST (Generalized Search Tree)-based index.
The column can be of
tsvector or tsquery type.

       CREATE INDEX name ON table USING gin(column);

Creates a GIN (Generalized Inverted Index)-based index.
The column must be of
tsvector type.

There are substantial performance differences between the two
index types, so it is important to understand which to use.

A GiST index is lossy, meaning that
the index may produce false matches, and it is necessary to check
the actual table row to eliminate such false matches.
PostgreSQL does this
automatically; for example, in the query plan below, the
Filter: line indicates the index output
will be rechecked:

EXPLAIN SELECT * FROM apod WHERE textsearch @@ to_tsquery('supernovae');
QUERY PLAN
-------------------------------------------------------------------------
Index Scan using textsearch_gidx on apod  (cost=0.00..12.29 rows=2 width=1469)
Index Cond: (textsearch @@ '''supernova'''::tsquery)
Filter: (textsearch @@ '''supernova'''::tsquery)

GiST indexes are lossy because each document is represented in
the index by a fixed-length signature. The signature is generated
by hashing each word into a random bit in an n-bit string, with
all these bits OR-ed together to produce an n-bit document
signature. When two words hash to the same bit position there
will be a false match. If all words in the query have matches
(real or false) then the table row must be retrieved to see if
the match is correct.

Lossiness causes performance degradation due to useless
fetches of table records that turn out to be false matches. Since
random access to table records is slow, this limits the
usefulness of GiST indexes. The likelihood of false matches
depends on several factors, in particular the number of unique
words, so using dictionaries to reduce this number is
recommended.

GIN indexes are not lossy but their performance depends
logarithmically on the number of unique words.

Actually, GIN indexes store only the words (lexemes) of
tsvector values, and not their weight
labels. Thus, while a GIN index can be considered non-lossy for a
query that does not specify weights, it is lossy for one that
does. Thus a table row recheck is needed when using a query that
involves weights. Unfortunately, in the current design of
PostgreSQL, whether a recheck is
needed is a static property of a particular operator, and not
something that can be enabled or disabled on-the-fly depending on
the values given to the operator. To deal with this situation
without imposing the overhead of rechecks on queries that do not
need them, the following approach has been adopted:

  • The standard text match operator @@ is marked as non-lossy for GIN indexes.

  • An additional match operator @@@
    is provided, and marked as lossy for GIN indexes. This
    operator behaves exactly like @@
    otherwise.

  • When a GIN index search is initiated with the @@ operator, the index support code will throw
    an error if the query specifies any weights. This protects
    against giving wrong answers due to failure to recheck the
    weights.

In short, you must use @@@ rather
than @@ to perform GIN index searches on
queries that involve weight restrictions. For queries that do not
have weight restrictions, either operator will work, but
@@ will be faster. This awkwardness will
probably be addressed in a future release of PostgreSQL.

In choosing which index type to use, GiST or GIN, consider
these performance differences:

  • GIN index lookups are about three times faster than
    GiST

  • GIN indexes take about three times longer to build than
    GiST

  • GIN indexes are about ten times slower to update than
    GiST

  • GIN indexes are two-to-three times larger than GiST

As a rule of thumb, GIN
indexes are best for static data because lookups are faster. For
dynamic data, GiST indexes are faster to update. Specifically,
GiST indexes are very good for
dynamic data and fast if the number of unique words (lexemes) is
under 100,000, while GIN
indexes will handle 100,000+ lexemes better but are slower to
update.

Note that GIN index build
time can often be improved by increasing maintenance_work_mem,
while GiST index build time is
not sensitive to that parameter.

Partitioning of big collections and the proper use of GiST and
GIN indexes allows the implementation of very fast searches with
online update. Partitioning can be done at the database level
using table inheritance and constraint_exclusion, or by distributing documents
over servers and collecting search results using the contrib/dblink extension module. The latter is
possible because ranking functions use only local
information.

声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理。