Mr Dk.'s BlogMr Dk.'s Blog
  • 🦆 About Me
  • ⛏️ Technology Stack
  • 🔗 Links
  • 🗒️ About Blog
  • Algorithm
  • C++
  • Compiler
  • Cryptography
  • DevOps
  • Docker
  • Git
  • Java
  • Linux
  • MS Office
  • MySQL
  • Network
  • Operating System
  • Performance
  • PostgreSQL
  • Productivity
  • Solidity
  • Vue.js
  • Web
  • Wireless
  • 🐧 How Linux Works (notes)
  • 🐧 Linux Kernel Comments (notes)
  • 🐧 Linux Kernel Development (notes)
  • 🐤 μc/OS-II Source Code (notes)
  • ☕ Understanding the JVM (notes)
  • ⛸️ Redis Implementation (notes)
  • 🗜️ Understanding Nginx (notes)
  • ⚙️ Netty in Action (notes)
  • ☁️ Spring Microservices (notes)
  • ⚒️ The Annotated STL Sources (notes)
  • ☕ Java Development Kit 8
GitHub
  • 🦆 About Me
  • ⛏️ Technology Stack
  • 🔗 Links
  • 🗒️ About Blog
  • Algorithm
  • C++
  • Compiler
  • Cryptography
  • DevOps
  • Docker
  • Git
  • Java
  • Linux
  • MS Office
  • MySQL
  • Network
  • Operating System
  • Performance
  • PostgreSQL
  • Productivity
  • Solidity
  • Vue.js
  • Web
  • Wireless
  • 🐧 How Linux Works (notes)
  • 🐧 Linux Kernel Comments (notes)
  • 🐧 Linux Kernel Development (notes)
  • 🐤 μc/OS-II Source Code (notes)
  • ☕ Understanding the JVM (notes)
  • ⛸️ Redis Implementation (notes)
  • 🗜️ Understanding Nginx (notes)
  • ⚙️ Netty in Action (notes)
  • ☁️ Spring Microservices (notes)
  • ⚒️ The Annotated STL Sources (notes)
  • ☕ Java Development Kit 8
GitHub
  • 📝 Notes
    • Algorithm
      • Algorithm - Bloom Filter
      • Algorithm - Disjoint Set
      • Algorithm - Fast Power
      • Algorithm - KMP
      • Algorithm - Monotonic Stack
      • Algorithm - RB-Tree
      • Algorithm - Regular Expression
      • Algorithm - Sliding Window
      • Online Judge - I/O
    • C++
      • C++ - Const
      • C++ File I/O
      • C++ - Object Layout
      • C++ - Operator Overload
      • C++ - Polymorphism
      • C++ STL algorithm
      • C++ STL map
      • C++ STL multimap
      • C++ STL priority_queue
      • C++ STL set
      • C++ STL string
      • C++ STL unordered_map
      • C++ STL vector
      • C++ - Smart Pointer
      • C++ - Template & Genericity
    • Compiler
      • ANTLR - Basic
      • Compiler - LLVM Architecture
      • Compiler - Multi-version GCC
    • Cryptography
      • Cryptography - Certbot
      • Cryptography - Digital Signature & PKCS #7
      • Cryptography - GPG
      • Cryptography - JWT
      • Cryptography - Keystore & Certificates
      • Cryptography - OAuth 2.0
      • Cryptography - Java 实现对称与非对称加密算法
      • Cryptography - TLS
    • DevOps
      • DevOps - Travis CI
    • Docker
      • Docker - Image & Storage Management
      • Docker - Image
      • Docker - Libcontainer
      • Docker - Multi-Arch Image
      • Docker - Multi-Stage Build
      • Docker - Network
      • Docker - Orchestration & Deployment
      • Docker - Overview
      • Docker - Service Building
      • Docker - Volume & Network Usage
      • Docker - Volume
      • Linux - Control Group
      • Linux - Namespace
    • Git
      • Git - Branch & Merge
      • Git - Cached
      • Git - Cherry Pick
      • Git - Commit
      • Git - Patch
      • Git - Proxy
      • Git - Rebase
      • Git - Reset
      • Git - Stash
      • Git - Theme for Git-Bash
    • Java
      • JVM - Synchronized
      • JVM - Volatile
      • Java - Annotation 注解
      • Java - BIO & NIO
      • Java - Class Path
      • Java - Condition and LockSupport
      • Java - Current Timestamp
      • Java - Deep Copy
      • Java - 运行环境配置
      • Java - Equals
      • Java - Exporting JAR
      • Java - Javadoc
      • Java - Lock
      • Java - Maven 项目构建工具
      • Java - References
      • Java - Reflection Mechanism
      • Java - String Split
      • Java - Thread Pool
      • Java - Thread
      • Tomcat - Class Loader
      • Tomcat - Container
    • Linux
      • addr2line
      • cut
      • df
      • du
      • fallocate
      • find
      • fio
      • grep
      • groupadd
      • gzip
      • head / tail
      • hexdump
      • iostat
      • iotop
      • kill
      • ldd
      • lsof
      • ltrace / strace
      • mpstat
      • netstat
      • nm
      • pidstat
      • pmap
      • readlink
      • readlink
      • rpm2cpio / rpm2archive
      • sort
      • tee
      • uniq
      • useradd
      • usermod
      • watch
      • wc
      • which
      • xargs
    • MS Office
      • MS Office - Add-in Dev
      • MS Office - Application
    • MySQL
      • InnoDB - Architecture
      • InnoDB - Backup
      • InnoDB - Checkpoint
      • InnoDB - Critical Features
      • InnoDB - Files
      • InnoDB - Index
      • InnoDB - Insert Buffer
      • InnoDB - Lock
      • InnoDB - Partition Table
      • InnoDB - Table Storage
      • MySQL - Server Configuration
      • MySQL - Storage Engine
    • Network
      • Network - ARP
      • Network - FTP
      • Network - GitHub Accelerating
      • HTTP - Message Format
      • HTTP - POST 提交表单的两种方式
      • Network - Proxy Server
      • Network - SCP
      • Network - SSH
      • Network - TCP Congestion Control
      • Network - TCP Connection Management
      • Network - TCP Flow Control
      • Network - TCP Retransmission
      • Network - Traceroute
      • Network - V2Ray
      • Network - WebSocket
      • Network - Windows 10 Mail APP
      • Network - frp
    • Operating System
      • Linux - Kernel Compilation
      • Linux - Multi-OS
      • Linux - Mutex & Condition
      • Linux - Operations
      • Linux: Package Manager
      • Linux - Process Manipulation
      • Linux - User ID
      • Linux - Execve
      • OS - Compile and Link
      • OS - Dynamic Linking
      • OS - ELF
      • Linux - Image
      • OS - Loading
      • OS - Shared Library Organization
      • OS - Static Linking
      • Syzkaller - Architecture
      • Syzkaller - Description Syntax
      • Syzkaller - Usage
      • Ubuntu - Desktop Recover (Python)
      • WSL: CentOS 8
    • Performance
      • Linux Performance - Perf Event
      • Linux Performance - Perf Record
      • Linux Performance - Perf Report
      • Linux Performance - Flame Graphs
      • Linux Performance - Off CPU Analyze
    • PostgreSQL
      • PostgreSQL - ANALYZE
      • PostgreSQL - Atomics
      • PostgreSQL - CREATE INDEX CONCURRENTLY
      • PostgreSQL - COPY FROM
      • PostgreSQL - COPY TO
      • PostgreSQL - Executor: Append
      • PostgreSQL - Executor: Group
      • PostgreSQL - Executor: Limit
      • PostgreSQL - Executor: Material
      • PostgreSQL - Executor: Nest Loop Join
      • PostgreSQL - Executor: Result
      • PostgreSQL - Executor: Sequential Scan
      • PostgreSQL - Executor: Sort
      • PostgreSQL - Executor: Unique
      • PostgreSQL - FDW Asynchronous Execution
      • PostgreSQL - GUC
      • PostgreSQL - Locking
      • PostgreSQL - LWLock
      • PostgreSQL - Multi Insert
      • PostgreSQL - Plan Hint GUC
      • PostgreSQL - Process Activity
      • PostgreSQL - Query Execution
      • PostgreSQL - Resource Owner
      • PostgreSQL - Spinlock
      • PostgreSQL - Storage Management
      • PostgreSQL - VFD
      • PostgreSQL - WAL Insert
      • PostgreSQL - WAL Prefetch
    • Productivity
      • LaTeX
      • Venn Diagram
      • VuePress
    • Solidity
      • Solidity - ABI Specification
      • Solidity - Contracts
      • Solidity - Expressions and Control Structures
      • Solidity - Layout and Structure
      • Solidity - Remix IDE
      • Solidity - Slither
      • Solidity - Types
      • Solidity - Units and Globally Available Variables
    • Vue.js
      • Vue.js - Environment Variable
    • Web
      • Web - CORS
      • Web - OpenAPI Specification
    • Wireless
      • Wireless - WEP Cracking by Aircrack-ng
      • Wireless - WPS Cracking by Reaver
      • Wireless - wifiphisher

PostgreSQL - CREATE INDEX CONCURRENTLY

Created by: Mr Dk.

2025 / 04 / 05 14:50

Ningbo, Zhejiang, China


Background

索引通常被用于加速数据库的查询性能。理论上,在一张表上构建索引需要至少把表的全量数据扫描一遍,对于比较大的表来说,可能需要持续较长的时间;在此期间,表的内容不能被修改,否则构建出的索引与表的内容将会不一致。

对于在线系统来说,长时间锁住一张表无法修改是不可接受的。PostgreSQL 支持 并发创建索引 (CREATE INDEX CONCURRENTLY, CIC)。并发创建索引将会以较低的锁等级对表上锁,不阻塞 DML 操作的进行。并发索引构建会持续多个阶段完成,其消耗的整体资源和时间远大于非并发的索引创建。本文基于 PostgreSQL 17 源代码简析这个过程。

Related: Heap Only Tuples

由于 CIC 与 PostgreSQL 的 HOT (Heap-Only Tuples) 有一定关联,所以需要先简析一下 HOT 的原理。PostgreSQL 使用 MVCC 来存储表中的数据。对于每行数据,物理文件中可能会同时保存多个版本,每个事务根据其获取的快照来决定可见的版本是哪一个。当一行数据发生更新时,被更新的旧数据保持不变,表中将会插入一行新版本的数据。如果表上有索引,那么这行数据也需要被加入到索引中;如果表上有多个索引,那么每个索引都需要被更新,一次更新操作代价将会很大。

索引元组中保存的内容为:

  • ScanKey:被索引列的值
  • CTID:被索引行的位置(块号 + 块内偏移量)

当非索引列值被更新,并且当前页面中可以同时存放新旧两个版本的元组时,新元组会被直接插入到同一个页面中,并在旧元组的 header 中保存新元组的位置,形成 HOT 链。这样索引就不再需要被更新了,CTID 依旧指向旧版本元组。旧版本元组被标记为 HEAP_HOT_UPDATED,新版本元组被标记为 HEAP_ONLY_TUPLE。

    Index points to 1
    lp [1]  [2]

    [111111111]->[2222222222]

在后续查询时,虽然新版本的元组没有被索引直接引用到,但依旧可以通过旧元组 + HOT 链被查找到。索引扫描回表时,只要发现元组被标记为 HEAP_HOT_UPDATED,就需要沿着 HOT 链继续搜索。HOT 仅可能发生在一个页面内,所以沿着 HOT 链的搜索不会有额外的 I/O 开销。

后续当旧版本的元组不再可见而被删除时,其页内空间将会被回收,但行指针空间依旧被保留,保证 HOT 链不断裂:

    Index points to 1
    lp [1]->[2]

    [2222222222]

HOT 链并不会无限延长下去。当这一行被再次更新时,原先的新版本元组也会被标记为 HEAP_HOT_UPDATED,形成:

    Index points to 1
    lp [1]->[2]  [3]

    [2222222222]->[3333333333]

当所有能够看到 [2] 版本的事务结束后,由于任何索引都没有直接引用这个版本,所以这个版本的数据和行指针都可以被安全回收;索引元组依旧不用更新。HOT 链变为:

    Index points to 1
    lp [1]------>[3]

    [3333333333]

为什么 HOT 和创建索引有关呢?正是因为是否能够 HOT 的决策条件中包含了当前更新列上是否有索引。存在一个索引之前,某行数据更新是可以 HOT 的;但有了索引之后可能就不行了。所以在并发创建索引时,需要控制好新建索引的过程中依旧能够维持 HOT 的定义。

CIC

创建索引属于 DDL,从下面的执行器的入口函数开始:

/*
 * The "Slow" variant of ProcessUtility should only receive statements
 * supported by the event triggers facility.  Therefore, we always
 * perform the trigger support calls if the context allows it.
 */
static void
ProcessUtilitySlow(ParseState *pstate,
                   PlannedStmt *pstmt,
                   const char *queryString,
                   ProcessUtilityContext context,
                   ParamListInfo params,
                   QueryEnvironment *queryEnv,
                   DestReceiver *dest,
                   QueryCompletion *qc)
{
    /* ... */

    /* PG_TRY block is to ensure we call EventTriggerEndCompleteQuery */
    PG_TRY();
    {
        /* ... */

        switch (nodeTag(parsetree))
        {
            /* ... */

            case T_IndexStmt:   /* CREATE INDEX */
                {
                    IndexStmt  *stmt = (IndexStmt *) parsetree;
                    Oid         relid;
                    LOCKMODE    lockmode;
                    int         nparts = -1;
                    bool        is_alter_table;

                    if (stmt->concurrent)
                        PreventInTransactionBlock(isTopLevel,
                                                  "CREATE INDEX CONCURRENTLY");

                    /*
                     * Look up the relation OID just once, right here at the
                     * beginning, so that we don't end up repeating the name
                     * lookup later and latching onto a different relation
                     * partway through.  To avoid lock upgrade hazards, it's
                     * important that we take the strongest lock that will
                     * eventually be needed here, so the lockmode calculation
                     * needs to match what DefineIndex() does.
                     */
                    lockmode = stmt->concurrent ? ShareUpdateExclusiveLock
                        : ShareLock;
                    relid =
                        RangeVarGetRelidExtended(stmt->relation, lockmode,
                                                 0,
                                                 RangeVarCallbackOwnsRelation,
                                                 NULL);

                    /* ... */

                    /* Run parse analysis ... */
                    stmt = transformIndexStmt(relid, stmt, queryString);

                    /* ... and do it */
                    EventTriggerAlterTableStart(parsetree);
                    address =
                        DefineIndex(relid,  /* OID of heap relation */
                                    stmt,
                                    InvalidOid, /* no predefined OID */
                                    InvalidOid, /* no parent index */
                                    InvalidOid, /* no parent constraint */
                                    nparts, /* # of partitions, or -1 */
                                    is_alter_table,
                                    true,   /* check_rights */
                                    true,   /* check_not_in_use */
                                    false,  /* skip_build */
                                    false); /* quiet */

                    /* ... */
                }
                break;

            /* ... */

            default:
                elog(ERROR, "unrecognized node type: %d",
                     (int) nodeTag(parsetree));
                break;
        }

        /* ... */
    }
    PG_FINALLY();
    {
        /* ... */
    }
    PG_END_TRY();
}

从上面的代码片段可以看出,由于并发索引创建是通过多个事务完成的,所以无法在事务块内执行。另外,传统索引创建与并发索引创建对表的上锁等级不同。根据 PostgreSQL 的 表锁兼容矩阵:

#define AccessShareLock         1   /* SELECT */
#define RowShareLock            2   /* SELECT FOR UPDATE/FOR SHARE */
#define RowExclusiveLock        3   /* INSERT, UPDATE, DELETE */
#define ShareUpdateExclusiveLock 4  /* VACUUM (non-FULL), ANALYZE, CREATE
                                     * INDEX CONCURRENTLY */
#define ShareLock               5   /* CREATE INDEX (WITHOUT CONCURRENTLY) */
#define ShareRowExclusiveLock   6   /* like EXCLUSIVE MODE, but allows ROW
                                     * SHARE */
#define ExclusiveLock           7   /* blocks ROW SHARE/SELECT...FOR UPDATE */
#define AccessExclusiveLock     8   /* ALTER TABLE, DROP TABLE, VACUUM FULL,
                                     * and unqualified LOCK TABLE */

传统索引创建的 ShareLock 锁阻塞了一切写操作;而 CIC 的 ShareUpdateExclusiveLock 与 DML 操作所需要的 RowExclusiveLock 不冲突,所以不阻塞 DML。

后续的索引创建逻辑全部在 DefineIndex 函数中完成。

Phase 1:创建空的无效索引,避免不一致的 HOT 更新

CIC 需要首先创建一个空的无效索引并将事务提交,使其能够被其它事务感知到。可见这个新索引的事务并不需要维护这个索引,也不能使用这个索引。创建这个空索引的作用是影响其它事务对 HOT 更新的决策:由于此时新索引已经存在,因此后续的更新操作如果修改了这个索引包含的列值,则不能再使用 HOT 更新了。这样可以避免后续索引创建完毕后,存在不符合定义的 HOT 链。

ObjectAddress
DefineIndex(Oid tableId,
            IndexStmt *stmt,
            Oid indexRelationId,
            Oid parentIndexId,
            Oid parentConstraintId,
            int total_parts,
            bool is_alter_table,
            bool check_rights,
            bool check_not_in_use,
            bool skip_build,
            bool quiet)
{
    /* ... */

    /*
     * Force non-concurrent build on temporary relations, even if CONCURRENTLY
     * was requested.  Other backends can't access a temporary relation, so
     * there's no harm in grabbing a stronger lock, and a non-concurrent DROP
     * is more efficient.  Do this before any use of the concurrent option is
     * done.
     */
    if (stmt->concurrent && get_rel_persistence(tableId) != RELPERSISTENCE_TEMP)
        concurrent = true;
    else
        concurrent = false;

    /* ... */

    /*
     * Make the catalog entries for the index, including constraints. This
     * step also actually builds the index, except if caller requested not to
     * or in concurrent mode, in which case it'll be done later, or doing a
     * partitioned index (because those don't have storage).
     */
    flags = constr_flags = 0;
    if (stmt->isconstraint)
        flags |= INDEX_CREATE_ADD_CONSTRAINT;
    if (skip_build || concurrent || partitioned)
        flags |= INDEX_CREATE_SKIP_BUILD;
    if (stmt->if_not_exists)
        flags |= INDEX_CREATE_IF_NOT_EXISTS;
    if (concurrent)
        flags |= INDEX_CREATE_CONCURRENT;
    /* ... */

    indexRelationId =
        index_create(rel, indexRelationName, indexRelationId, parentIndexId,
                     parentConstraintId,
                     stmt->oldNumber, indexInfo, indexColNames,
                     accessMethodId, tablespaceId,
                     collationIds, opclassIds, opclassOptions,
                     coloptions, NULL, reloptions,
                     flags, constr_flags,
                     allowSystemTableMods, !check_rights,
                     &createdConstraintId);

    /* ... */
}

上述代码首先完成索引创建前的各项准备,包含索引列确认、权限检查、索引名称确定、索引表达式编译等。然后根据索引创建的类型确认标志位。对于 CIC 来说,需要设置的标志位有 INDEX_CREATE_SKIP_BUILD 和 INDEX_CREATE_CONCURRENT。前者表示先不创建索引本身,仅更新系统表,使其它事务能够可见这个索引;后者表示当前要进行的是并发索引创建。将这个标志位传入 index_create 函数中:

Oid
index_create(Relation heapRelation,
			 const char *indexRelationName,
			 Oid indexRelationId,
			 Oid parentIndexRelid,
			 Oid parentConstraintId,
			 RelFileNumber relFileNumber,
			 IndexInfo *indexInfo,
			 const List *indexColNames,
			 Oid accessMethodId,
			 Oid tableSpaceId,
			 const Oid *collationIds,
			 const Oid *opclassIds,
			 const Datum *opclassOptions,
			 const int16 *coloptions,
			 const NullableDatum *stattargets,
			 Datum reloptions,
			 bits16 flags,
			 bits16 constr_flags,
			 bool allow_system_table_mods,
			 bool is_internal,
			 Oid *constraintId)
{
    /* ... */

	bool		concurrent = (flags & INDEX_CREATE_CONCURRENT) != 0;

    /* ... */

	/*
	 * create the index relation's relcache entry and, if necessary, the
	 * physical disk file. (If we fail further down, it's the smgr's
	 * responsibility to remove the disk file again, if any.)
	 */
	indexRelation = heap_create(indexRelationName,
								namespaceId,
								tableSpaceId,
								indexRelationId,
								relFileNumber,
								accessMethodId,
								indexTupDesc,
								relkind,
								relpersistence,
								shared_relation,
								mapped_relation,
								allow_system_table_mods,
								&relfrozenxid,
								&relminmxid,
								create_storage);

	Assert(relfrozenxid == InvalidTransactionId);
	Assert(relminmxid == InvalidMultiXactId);
	Assert(indexRelationId == RelationGetRelid(indexRelation));

	/*
	 * Obtain exclusive lock on it.  Although no other transactions can see it
	 * until we commit, this prevents deadlock-risk complaints from lock
	 * manager in cases such as CLUSTER.
	 */
	LockRelation(indexRelation, AccessExclusiveLock);

	/*
	 * Fill in fields of the index's pg_class entry that are not set correctly
	 * by heap_create.
	 *
	 * XXX should have a cleaner way to create cataloged indexes
	 */
	indexRelation->rd_rel->relowner = heapRelation->rd_rel->relowner;
	indexRelation->rd_rel->relam = accessMethodId;
	indexRelation->rd_rel->relispartition = OidIsValid(parentIndexRelid);

	/*
	 * store index's pg_class entry
	 */
	InsertPgClassTuple(pg_class, indexRelation,
					   RelationGetRelid(indexRelation),
					   (Datum) 0,
					   reloptions);

	/* done with pg_class */
	table_close(pg_class, RowExclusiveLock);

	/*
	 * now update the object id's of all the attribute tuple forms in the
	 * index relation's tuple descriptor
	 */
	InitializeAttributeOids(indexRelation,
							indexInfo->ii_NumIndexAttrs,
							indexRelationId);

	/*
	 * append ATTRIBUTE tuples for the index
	 */
	AppendAttributeTuples(indexRelation, opclassOptions, stattargets);

	/* ----------------
	 *	  update pg_index
	 *	  (append INDEX tuple)
	 *
	 *	  Note that this stows away a representation of "predicate".
	 *	  (Or, could define a rule to maintain the predicate) --Nels, Feb '92
	 * ----------------
	 */
	UpdateIndexRelation(indexRelationId, heapRelationId, parentIndexRelid,
						indexInfo,
						collationIds, opclassIds, coloptions,
						isprimary, is_exclusion,
						(constr_flags & INDEX_CONSTR_CREATE_DEFERRABLE) == 0,
						!concurrent && !invalid,
						!concurrent);

	/*
	 * Register relcache invalidation on the indexes' heap relation, to
	 * maintain consistency of its index list
	 */
	CacheInvalidateRelcache(heapRelation);

    /* ... */

	/*
	 * Advance the command counter so that we can see the newly-entered
	 * catalog tuples for the index.
	 */
	CommandCounterIncrement();

    /* ... */

	/*
	 * If this is bootstrap (initdb) time, then we don't actually fill in the
	 * index yet.  We'll be creating more indexes and classes later, so we
	 * delay filling them in until just before we're done with bootstrapping.
	 * Similarly, if the caller specified to skip the build then filling the
	 * index is delayed till later (ALTER TABLE can save work in some cases
	 * with this).  Otherwise, we call the AM routine that constructs the
	 * index.
	 */
	if (IsBootstrapProcessingMode())
	{
		index_register(heapRelationId, indexRelationId, indexInfo);
	}
	else if ((flags & INDEX_CREATE_SKIP_BUILD) != 0)
	{
		/*
		 * Caller is responsible for filling the index later on.  However,
		 * we'd better make sure that the heap relation is correctly marked as
		 * having an index.
		 */
		index_update_stats(heapRelation,
						   true,
						   -1.0);
		/* Make the above update visible */
		CommandCounterIncrement();
	}
	else
	{
		index_build(heapRelation, indexRelation, indexInfo, false, true);
	}

	/*
	 * Close the index; but we keep the lock that we acquired above until end
	 * of transaction.  Closing the heap is caller's responsibility.
	 */
	index_close(indexRelation, NoLock);

	return indexRelationId;
}

此处向 pg_class 和 pg_index 系统表中插入了新索引的元数据,根据标志位 INDEX_CREATE_SKIP_BUILD 跳过了填充索引内容的 index_build 并直接返回。在函数 UpdateIndexRelation 更新 pg_index 系统表时,CIC 将索引的 indisvalid 和 indisready 全部设置为 false,表示这个索引既不需要被 DML 维护,也不能被用于查询。

系统表更新完毕后将事务提交,并打开一个新事务进入下一阶段:

ObjectAddress
DefineIndex(Oid tableId,
			IndexStmt *stmt,
			Oid indexRelationId,
			Oid parentIndexId,
			Oid parentConstraintId,
			int total_parts,
			bool is_alter_table,
			bool check_rights,
			bool check_not_in_use,
			bool skip_build,
			bool quiet)
{
    /* ... */

	/* save lockrelid and locktag for below, then close rel */
	heaprelid = rel->rd_lockInfo.lockRelId;
	SET_LOCKTAG_RELATION(heaplocktag, heaprelid.dbId, heaprelid.relId);
	table_close(rel, NoLock);

	/*
	 * For a concurrent build, it's important to make the catalog entries
	 * visible to other transactions before we start to build the index. That
	 * will prevent them from making incompatible HOT updates.  The new index
	 * will be marked not indisready and not indisvalid, so that no one else
	 * tries to either insert into it or use it for queries.
	 *
	 * We must commit our current transaction so that the index becomes
	 * visible; then start another.  Note that all the data structures we just
	 * built are lost in the commit.  The only data we keep past here are the
	 * relation IDs.
	 *
	 * Before committing, get a session-level lock on the table, to ensure
	 * that neither it nor the index can be dropped before we finish. This
	 * cannot block, even if someone else is waiting for access, because we
	 * already have the same lock within our transaction.
	 *
	 * Note: we don't currently bother with a session lock on the index,
	 * because there are no operations that could change its state while we
	 * hold lock on the parent table.  This might need to change later.
	 */
	LockRelationIdForSession(&heaprelid, ShareUpdateExclusiveLock);

	PopActiveSnapshot();
	CommitTransactionCommand();
	StartTransactionCommand();

	/* Tell concurrent index builds to ignore us, if index qualifies */
	if (safe_index)
		set_indexsafe_procflags();

    /* ... */
}

Phase 2:填充索引主体数据

在上一步的事务提交后,可见索引的事务不再会产生非预期的 HOT 更新。但目前已在进行中的事务已经产生了非预期的 HOT。所以,需要首先等待这些事务全都结束,所有事务才都不再会产生非预期的 HOT。此时才可以开始填充新索引的数据。

在具体实现上,等待对该表持有 ShareLock 的旧事务结束:

{
    /* ... */

	/*
	 * The index is now visible, so we can report the OID.  While on it,
	 * include the report for the beginning of phase 2.
	 */
	{
		const int	progress_cols[] = {
			PROGRESS_CREATEIDX_INDEX_OID,
			PROGRESS_CREATEIDX_PHASE
		};
		const int64 progress_vals[] = {
			indexRelationId,
			PROGRESS_CREATEIDX_PHASE_WAIT_1
		};

		pgstat_progress_update_multi_param(2, progress_cols, progress_vals);
	}

	/*
	 * Phase 2 of concurrent index build (see comments for validate_index()
	 * for an overview of how this works)
	 *
	 * Now we must wait until no running transaction could have the table open
	 * with the old list of indexes.  Use ShareLock to consider running
	 * transactions that hold locks that permit writing to the table.  Note we
	 * do not need to worry about xacts that open the table for writing after
	 * this point; they will see the new index when they open it.
	 *
	 * Note: the reason we use actual lock acquisition here, rather than just
	 * checking the ProcArray and sleeping, is that deadlock is possible if
	 * one of the transactions in question is blocked trying to acquire an
	 * exclusive lock on our table.  The lock code will detect deadlock and
	 * error out properly.
	 */
	WaitForLockers(heaplocktag, ShareLock, true);

    /* ... */
}

接下来,获取新快照,并通过 index_concurrently_build 函数填充索引内容,然后提交事务:

{
    /* ... */

	/*
	 * At this moment we are sure that there are no transactions with the
	 * table open for write that don't have this new index in their list of
	 * indexes.  We have waited out all the existing transactions and any new
	 * transaction will have the new index in its list, but the index is still
	 * marked as "not-ready-for-inserts".  The index is consulted while
	 * deciding HOT-safety though.  This arrangement ensures that no new HOT
	 * chains can be created where the new tuple and the old tuple in the
	 * chain have different index keys.
	 *
	 * We now take a new snapshot, and build the index using all tuples that
	 * are visible in this snapshot.  We can be sure that any HOT updates to
	 * these tuples will be compatible with the index, since any updates made
	 * by transactions that didn't know about the index are now committed or
	 * rolled back.  Thus, each visible tuple is either the end of its
	 * HOT-chain or the extension of the chain is HOT-safe for this index.
	 */

	/* Set ActiveSnapshot since functions in the indexes may need it */
	PushActiveSnapshot(GetTransactionSnapshot());

	/* Perform concurrent build of index */
	index_concurrently_build(tableId, indexRelationId);

	/* we can do away with our snapshot */
	PopActiveSnapshot();

	/*
	 * Commit this transaction to make the indisready update visible.
	 */
	CommitTransactionCommand();
	StartTransactionCommand();

	/* Tell concurrent index builds to ignore us, if index qualifies */
	if (safe_index)
		set_indexsafe_procflags();

    /* ... */
}

这里构建索引元组时,ScanKey 使用的是当前事务快照内可见元组的列值,但 CTID 使用的是可见元组在 HOT 链中的 root CTID。这样可以保证所有索引中引用同一个元组的 CTID 都是相同的。可见元组在 HOT 链中的前序元组在当前事务中是不可见的,所以使用该索引进行回表时,前序元组也必然不可见。

在索引填充完毕,提交事务之前,index_concurrently_build 将索引的 indisready 设置为 true,表示自此所有事务都需要开始维护这个索引:

/*
 * index_concurrently_build
 *
 * Build index for a concurrent operation.  Low-level locks are taken when
 * this operation is performed to prevent only schema changes, but they need
 * to be kept until the end of the transaction performing this operation.
 * 'indexOid' refers to an index relation OID already created as part of
 * previous processing, and 'heapOid' refers to its parent heap relation.
 */
void
index_concurrently_build(Oid heapRelationId,
						 Oid indexRelationId)
{
	/* ... */

	/* Now build the index */
	index_build(heapRel, indexRelation, indexInfo, false, true);

	/* ... */

	/*
	 * Update the pg_index row to mark the index as ready for inserts. Once we
	 * commit this transaction, any new transactions that open the table must
	 * insert new entries into the index for insertions and non-HOT updates.
	 */
	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_READY);
}

Phase 3:补全填充索引期间缺失的数据

截止目前,phase 2 开始之前提交的所有数据都已经在索引中了,phase 2 提交之后所有事务中的 DML 也会开始维护这个索引了。所以现在还缺两部分数据:

  1. 在 phase 2 事务开始之后才提交的事务,因这些元组在 phase 2 的快照中不可见,而缺失对应的索引元组
  2. 在 phase 2 事务期间开始的事务,因 phase 2 期间索引属性 indisready 依旧为 false,因此没有维护索引,缺失索引元组

Phase 3 开始补全这两部分缺失的数据。理论上现在立刻新开始一个事务,就可以可见并补全第一部分缺失数据。但第二部分缺失的数据因事务未提交暂不可见,所以依旧无法补全。所以这里需要等待在 phase 2 事务期间开始的,对表有过修改的事务结束,然后再获取快照扫描一次全表。此时两部分缺失数据都已经对当前快照可见,因此能够补回数据。

{
    /* ... */

	/*
	 * Phase 3 of concurrent index build
	 *
	 * We once again wait until no transaction can have the table open with
	 * the index marked as read-only for updates.
	 */
	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
								 PROGRESS_CREATEIDX_PHASE_WAIT_2);
	WaitForLockers(heaplocktag, ShareLock, true);

	/*
	 * Now take the "reference snapshot" that will be used by validate_index()
	 * to filter candidate tuples.  Beware!  There might still be snapshots in
	 * use that treat some transaction as in-progress that our reference
	 * snapshot treats as committed.  If such a recently-committed transaction
	 * deleted tuples in the table, we will not include them in the index; yet
	 * those transactions which see the deleting one as still-in-progress will
	 * expect such tuples to be there once we mark the index as valid.
	 *
	 * We solve this by waiting for all endangered transactions to exit before
	 * we mark the index as valid.
	 *
	 * We also set ActiveSnapshot to this snap, since functions in indexes may
	 * need a snapshot.
	 */
	snapshot = RegisterSnapshot(GetTransactionSnapshot());
	PushActiveSnapshot(snapshot);

	/*
	 * Scan the index and the heap, insert any missing index entries.
	 */
	validate_index(tableId, indexRelationId, snapshot);

	/*
	 * Drop the reference snapshot.  We must do this before waiting out other
	 * snapshot holders, else we will deadlock against other processes also
	 * doing CREATE INDEX CONCURRENTLY, which would see our snapshot as one
	 * they must wait for.  But first, save the snapshot's xmin to use as
	 * limitXmin for GetCurrentVirtualXIDs().
	 */
	limitXmin = snapshot->xmin;

	PopActiveSnapshot();
	UnregisterSnapshot(snapshot);

	/*
	 * The snapshot subsystem could still contain registered snapshots that
	 * are holding back our process's advertised xmin; in particular, if
	 * default_transaction_isolation = serializable, there is a transaction
	 * snapshot that is still active.  The CatalogSnapshot is likewise a
	 * hazard.  To ensure no deadlocks, we must commit and start yet another
	 * transaction, and do our wait before any snapshot has been taken in it.
	 */
	CommitTransactionCommand();
	StartTransactionCommand();

	/* Tell concurrent index builds to ignore us, if index qualifies */
	if (safe_index)
		set_indexsafe_procflags();

    /* ... */
}

在通过 validate_index 函数补全索引内的缺失数据以后,再次提交当前事务。此后,其它事务理论上就可以看见数据完整的索引了,但此时索引暂时还不能被标记为可使用。因为索引中只包含 phase 2 获取的快照及之后的快照可见的数据,不包含更旧事务所使用的旧快照可见的数据。如果此时立刻标记索引可用,那么这些旧事务在使用这个索引时将会查不到预期可见的数据。所以必须等待这些旧事务全部结束,才可以标记索引的 indisvalid 为 true,开放使用:

{
    /* ... */

	/* We should now definitely not be advertising any xmin. */
	Assert(MyProc->xmin == InvalidTransactionId);

	/*
	 * The index is now valid in the sense that it contains all currently
	 * interesting tuples.  But since it might not contain tuples deleted just
	 * before the reference snap was taken, we have to wait out any
	 * transactions that might have older snapshots.
	 */
	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
	WaitForOlderSnapshots(limitXmin, true);

	/*
	 * Index can now be marked valid -- update its pg_index entry
	 */
	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_VALID);

	/*
	 * The pg_index update will cause backends (including this one) to update
	 * relcache entries for the index itself, but we should also send a
	 * relcache inval on the parent table to force replanning of cached plans.
	 * Otherwise existing sessions might fail to use the new index where it
	 * would be useful.  (Note that our earlier commits did not create reasons
	 * to replan; so relcache flush on the index itself was sufficient.)
	 */
	CacheInvalidateRelcacheByRelid(heaprelid.relId);

	/*
	 * Last thing to do is release the session-level lock on the parent table.
	 */
	UnlockRelationIdForSession(&heaprelid, ShareUpdateExclusiveLock);

	pgstat_progress_end_command();

	return address;
}

此后,新索引开始在所有事务中可用。

Summary

综上,CIC 全程没有阻塞任何 DML,而是通过多个阶段,使新索引逐渐从中间状态过渡到可用状态。其间通过获取两次事务快照,辅以必要的等待,逐步补全索引数据,保证索引的完整性和一致性。CIC 全程对全表数据扫描了两次,相比于非并发索引创建,有更大的资源开销。此外,新索引何时开始可用,也取决于系统中 CIC 开始前的长事务何时结束:如果这些事务一直不结束,即使新索引的数据已经就绪,也将会一直不可用。

References

PostgreSQL: Documentation: 17: 65.7. Heap-Only Tuples (HOT)

PostgreSQL: Documentation: 17: CREATE INDEX

Explaining CREATE INDEX CONCURRENTLY

README.HOT

Edit this page on GitHub
Prev
PostgreSQL - Atomics
Next
PostgreSQL - COPY FROM