PostgreSQL - Process Activity

Created by : Mr Dk.

2023 / 03 / 06 00:07

Hangzhou, Zhejiang, China

Background

PostgreSQL 是一个多进程架构的数据库。在数据库运行过程中，PostgreSQL 提供了丰富的系统视图来展示目前系统的运行状况，涵盖了系统的方方面面。这些视图主要分为两类：

用于展示系统当前运行情况的视图
用于展示系统截至目前累积的统计信息的视图

前者展示的是某个瞬间的系统状态，后者展示的是截止目前的一个时间段内的系统状态。这些系统视图全部以 pg_stat_ 开头。本文将分析其中最为常用的 pg_stat_activity 系统视图的实现，该视图用于展示某一时刻 PostgreSQL 所有服务器进程的活动状态，可被用于查询实时连接数、慢 SQL 的执行状态等。

本文基于 PostgreSQL 15 稳定分支 REL_15_STABLE 的如下版本号作分析：

commit f61e60102f08305f3cb9e55a7958b8036a02fe39
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Sat Mar 4 13:32:35 2023 -0500

    Avoid failure when altering state of partitioned foreign-key triggers.
    
    Beginning in v15, if you apply ALTER TABLE ENABLE/DISABLE TRIGGER to
    a partitioned table, it also affects the partitions' cloned versions
    of the affected trigger(s).  The initial implementation of this
    located the clones by name, but that fails on foreign-key triggers
    which have names incorporating their own OIDs.  We can fix that, and
    also make the behavior more bulletproof in the face of user-initiated
    trigger renames, by identifying the cloned triggers by tgparentid.
    
    Following the lead of earlier commits in this area, I took care not
    to break ABI in the v15 branch, even though I rather doubt there
    are any external callers of EnableDisableTrigger.
    
    While here, update the documentation, which was not touched when
    the semantics were changed.
    
    Per bug #17817 from Alan Hodgson.  Back-patch to v15; older versions
    do not have this behavior.
    
    Discussion: https://postgr.es/m/17817-31dfb7c2100d9f3d@postgresql.org

View Definition

pg_stat_activity 是一个系统视图，其视图定义如下：

CREATE VIEW pg_stat_activity AS
    SELECT
            S.datid AS datid,
            D.datname AS datname,
            S.pid,
            S.leader_pid,
            S.usesysid,
            U.rolname AS usename,
            S.application_name,
            S.client_addr,
            S.client_hostname,
            S.client_port,
            S.backend_start,
            S.xact_start,
            S.query_start,
            S.state_change,
            S.wait_event_type,
            S.wait_event,
            S.state,
            S.backend_xid,
            s.backend_xmin,
            S.query_id,
            S.query,
            S.backend_type
    FROM pg_stat_get_activity(NULL) AS S
        LEFT JOIN pg_database AS D ON (S.datid = D.oid)
        LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);

从视图定义中可以看出，视图中的主要信息来自 pg_stat_get_activity 函数，辅以将函数中输出的每个进程连接到的数据库 oid 与用户 oid 分别与 pg_database 和 pg_authid 两张系统表进行连接，从而得到每个进程连接到的数据库名和用户名。所以接下来简要分析 pg_stat_get_activity 函数的实现。

Implementation

Backend Status Array

pg_stat_get_activity 函数是一个 set returning function，返回的每一行都是一个进程的活动信息。其信息的来源是，PostgreSQL 的每一个进程（包括服务客户端连接的后端进程，以及其它后台辅助进程）都会在共享内存中维护一个 PgBackendStatus 结构体，其中记录了这个进程正在进行的活动，其定义如下：

/* ----------
 * PgBackendStatus
 *
 * Each live backend maintains a PgBackendStatus struct in shared memory
 * showing its current activity.  (The structs are allocated according to
 * BackendId, but that is not critical.)  Note that this is unrelated to the
 * cumulative stats system (i.e. pgstat.c et al).
 *
 * Each auxiliary process also maintains a PgBackendStatus struct in shared
 * memory.
 * ----------
 */
typedef struct PgBackendStatus
{
    /*
     * To avoid locking overhead, we use the following protocol: a backend
     * increments st_changecount before modifying its entry, and again after
     * finishing a modification.  A would-be reader should note the value of
     * st_changecount, copy the entry into private memory, then check
     * st_changecount again.  If the value hasn't changed, and if it's even,
     * the copy is valid; otherwise start over.  This makes updates cheap
     * while reads are potentially expensive, but that's the tradeoff we want.
     *
     * The above protocol needs memory barriers to ensure that the apparent
     * order of execution is as it desires.  Otherwise, for example, the CPU
     * might rearrange the code so that st_changecount is incremented twice
     * before the modification on a machine with weak memory ordering.  Hence,
     * use the macros defined below for manipulating st_changecount, rather
     * than touching it directly.
     */
    int         st_changecount;

    /* The entry is valid iff st_procpid > 0, unused if st_procpid == 0 */
    int         st_procpid;

    /* Type of backends */
    BackendType st_backendType;

    /* Times when current backend, transaction, and activity started */
    TimestampTz st_proc_start_timestamp;
    TimestampTz st_xact_start_timestamp;
    TimestampTz st_activity_start_timestamp;
    TimestampTz st_state_start_timestamp;

    /* Database OID, owning user's OID, connection client address */
    Oid         st_databaseid;
    Oid         st_userid;
    SockAddr    st_clientaddr;
    char       *st_clienthostname;  /* MUST be null-terminated */

    /* Information about SSL connection */
    bool        st_ssl;
    PgBackendSSLStatus *st_sslstatus;

    /* Information about GSSAPI connection */
    bool        st_gss;
    PgBackendGSSStatus *st_gssstatus;

    /* current state */
    BackendState st_state;

    /* application name; MUST be null-terminated */
    char       *st_appname;

    /*
     * Current command string; MUST be null-terminated. Note that this string
     * possibly is truncated in the middle of a multi-byte character. As
     * activity strings are stored more frequently than read, that allows to
     * move the cost of correct truncation to the display side. Use
     * pgstat_clip_activity() to truncate correctly.
     */
    char       *st_activity_raw;

    /*
     * Command progress reporting.  Any command which wishes can advertise
     * that it is running by setting st_progress_command,
     * st_progress_command_target, and st_progress_param[].
     * st_progress_command_target should be the OID of the relation which the
     * command targets (we assume there's just one, as this is meant for
     * utility commands), but the meaning of each element in the
     * st_progress_param array is command-specific.
     */
    ProgressCommandType st_progress_command;
    Oid         st_progress_command_target;
    int64       st_progress_param[PGSTAT_NUM_PROGRESS_PARAM];

    /* query identifier, optionally computed using post_parse_analyze_hook */
    uint64      st_query_id;
} PgBackendStatus;

可以看到其中记录了进程的 pid、进程的类型、进程正在进行的活动，以及客户端连接 / 事务 / 查询 / 状态开始的时间。在共享内存中，所有进程的这个结构被维护在一个数组中：

static PgBackendStatus *BackendStatusArray = NULL;

既然是存放在共享内存中，就可能同时被多个进程读写，所以需要一定的进程同步机制。这个结构体的进程同步机制是使用结构体中的 st_changecount 计数器实现的。当一个进程需要修改该结构体中的信息时，首先需要自增这个计数器；在修改完毕以后，也需要自增这个计数器。当进程想要读取这个结构时，首先需要记录下该结构的计数器数值，然后将共享内存中的结构拷贝到进程私有内存后，再次检查这个计数器的值。如果计数器的值没有变，并且是一个偶数，那么说明进程私有内存中的结构体副本是安全可用的；否则说明结构在拷贝过程中发生了并发更新，需要重新到共享内存中拷贝一个正确的副本。在这个并发协议中，更新这个结构的开销比读取这个结构的开销要低，这正是设计者想要获得的 trade off：每一个进程在进入不同的状态时都需要更新这个结构，频率极高，并且每个进程只更新自己的结构；而对该结构的访问只会出现在显式调用 pg_stat_get_activity 时，其频率相对来说是很低的。

Local Backend Status Table

在某个进程想要获取共享内存中 BackendStatusArray 数组的所有内容时，就需要把整个数组中的内容拷贝一份到该进程私有的内存中，再进行后续的操作。这个过程由 pgstat_read_current_status 函数实现：

/* ----------
 * pgstat_read_current_status() -
 *
 *  Copy the current contents of the PgBackendStatus array to local memory,
 *  if not already done in this transaction.
 * ----------
 */
static void
pgstat_read_current_status(void)
{
    volatile PgBackendStatus *beentry;
    LocalPgBackendStatus *localtable;
    LocalPgBackendStatus *localentry;
    char       *localappname,
               *localclienthostname,
               *localactivity;
#ifdef USE_SSL
    PgBackendSSLStatus *localsslstatus;
#endif
#ifdef ENABLE_GSS
    PgBackendGSSStatus *localgssstatus;
#endif
    int         i;

    if (localBackendStatusTable)
        return;                 /* already done */

    pgstat_setup_backend_status_context();

    /*
     * Allocate storage for local copy of state data.  We can presume that
     * none of these requests overflow size_t, because we already calculated
     * the same values using mul_size during shmem setup.  However, with
     * probably-silly values of pgstat_track_activity_query_size and
     * max_connections, the localactivity buffer could exceed 1GB, so use
     * "huge" allocation for that one.
     */
    localtable = (LocalPgBackendStatus *)
        MemoryContextAlloc(backendStatusSnapContext,
                           sizeof(LocalPgBackendStatus) * NumBackendStatSlots);
    localappname = (char *)
        MemoryContextAlloc(backendStatusSnapContext,
                           NAMEDATALEN * NumBackendStatSlots);
    localclienthostname = (char *)
        MemoryContextAlloc(backendStatusSnapContext,
                           NAMEDATALEN * NumBackendStatSlots);
    localactivity = (char *)
        MemoryContextAllocHuge(backendStatusSnapContext,
                               pgstat_track_activity_query_size * NumBackendStatSlots);
#ifdef USE_SSL
    localsslstatus = (PgBackendSSLStatus *)
        MemoryContextAlloc(backendStatusSnapContext,
                           sizeof(PgBackendSSLStatus) * NumBackendStatSlots);
#endif
#ifdef ENABLE_GSS
    localgssstatus = (PgBackendGSSStatus *)
        MemoryContextAlloc(backendStatusSnapContext,
                           sizeof(PgBackendGSSStatus) * NumBackendStatSlots);
#endif

    localNumBackends = 0;

    beentry = BackendStatusArray;
    localentry = localtable;
    for (i = 1; i <= NumBackendStatSlots; i++)
    {
        /*
         * Follow the protocol of retrying if st_changecount changes while we
         * copy the entry, or if it's odd.  (The check for odd is needed to
         * cover the case where we are able to completely copy the entry while
         * the source backend is between increment steps.)  We use a volatile
         * pointer here to ensure the compiler doesn't try to get cute.
         */

        /* ... */
    }

    /* Set the pointer only after completion of a valid table */
    localBackendStatusTable = localtable;
}

该函数会从共享内存的 BackendStatusArray 结构体中复制一份到进程私有内存中，保存在如下数组里：

/* Status for backends including auxiliary */
static LocalPgBackendStatus *localBackendStatusTable = NULL;

/* Total number of backends including auxiliary */
static int  localNumBackends = 0;

进程私有内存中的 LocalPgBackendStatus 结构体相比于共享内存中的 PgBackendStatus 结构体，除了原有的所有信息，还新增了事务相关的信息：

/* ----------
 * LocalPgBackendStatus
 *
 * When we build the backend status array, we use LocalPgBackendStatus to be
 * able to add new values to the struct when needed without adding new fields
 * to the shared memory. It contains the backend status as a first member.
 * ----------
 */
typedef struct LocalPgBackendStatus
{
    /*
     * Local version of the backend status entry.
     */
    PgBackendStatus backendStatus;

    /*
     * The xid of the current transaction if available, InvalidTransactionId
     * if not.
     */
    TransactionId backend_xid;

    /*
     * The xmin of the current session if available, InvalidTransactionId if
     * not.
     */
    TransactionId backend_xmin;
} LocalPgBackendStatus;

Get Activity

最终，pg_stat_get_activity 函数利用了上述 pgstat_read_current_status 函数的支持，从共享内存中拷贝所有进程的状态信息到进程私有内存中，然后依次把私有内存中每一个进程的状态信息进行进一步的加工，然后组装为元组返回。

/*
 * Returns activity of PG backends.
 */
Datum
pg_stat_get_activity(PG_FUNCTION_ARGS)
{
#define PG_STAT_GET_ACTIVITY_COLS   30
    int         num_backends = pgstat_fetch_stat_numbackends();
    int         curr_backend;
    int         pid = PG_ARGISNULL(0) ? -1 : PG_GETARG_INT32(0);
    ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;

    InitMaterializedSRF(fcinfo, 0);

    /* 1-based index */
    for (curr_backend = 1; curr_backend <= num_backends; curr_backend++)
    {
        /* for each row */
        Datum       values[PG_STAT_GET_ACTIVITY_COLS];
        bool        nulls[PG_STAT_GET_ACTIVITY_COLS];

        /* ... */

        tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);

        /* If only a single backend was requested, and we found it, break. */
        if (pid != -1)
            break;
    }

    return (Datum) 0;
}

在当前版本（PostgreSQL 15）中，该函数返回的每一行都包含 30 个列，其中的信息为：

进程连接到的数据库 oid
进程的 pid
通过该进程登录的用户 ID
连接到该进程的应用名称
进程当前状态
进程当前正在执行的 SQL
进程当前的等待事件类型
进程当前的等待事件
进程进入当前事务的开始时间
进程进入当前活动的开始时间
当前进程启动的时间
进程进入当前状态的开始时间
客户端 IP 地址
客户端域名
客户端端口号
进程当前事务的事务 ID
进程当前的 xmin
后台进程类型（如果这个进程是一个后台进程）
是否启用了 SSL
SSL 版本
SSL 加密套件
SSL bits
SSL 客户端 DN
SSL 客户端序列号
SSL 发放方 DN
是否启用了 GSS
GSS Principle
GSS 是否启用加密
进程的并行组 leader 的 pid
查询标识符

其中，暴露到 pg_stat_activity 视图中的只有其中的部分列。另外，部分列做了权限控制，只有 superuser 或具有 pg_read_all_stats 角色的用户才可以看到所有进程这些列的信息。

References

PostgreSQL Documentation - 28.2. The Cumulative Statistics System