PostgreSQL - Executor: Sequential Scan

Created by : Mr Dk.

2021 / 06 / 27 23:42

Hangzhou, Zhejiang, China

对 PostgreSQL 执行器的执行流程进行分析后，看看最简单的顺序扫描算子是如何实现的。

Plan Node

优化器输出的查询计划树是二叉树的形式。二叉树的节点有着最基本的结构定义，并根据不同类型的操作实现继承关系。最基本的计划树节点的结构定义如下。其中，由 type 来指定计划节点的类型。

/* ----------------
 *      Plan node
 *
 * All plan nodes "derive" from the Plan structure by having the
 * Plan structure as the first field.  This ensures that everything works
 * when nodes are cast to Plan's.  (node pointers are frequently cast to Plan*
 * when passed around generically in the executor)
 *
 * We never actually instantiate any Plan nodes; this is just the common
 * abstract superclass for all Plan-type nodes.
 * ----------------
 */
typedef struct Plan
{
    NodeTag     type;

    /*
     * estimated execution costs for plan (see costsize.c for more info)
     */
    Cost        startup_cost;   /* cost expended before fetching any tuples */
    Cost        total_cost;     /* total cost (assuming all tuples fetched) */

    /*
     * planner's estimate of result size of this plan step
     */
    double      plan_rows;      /* number of rows plan is expected to emit */
    int         plan_width;     /* average row width in bytes */

    /*
     * information needed for parallel query
     */
    bool        parallel_aware; /* engage parallel-aware logic? */
    bool        parallel_safe;  /* OK to use as part of parallel plan? */

    /*
     * information needed for asynchronous execution
     */
    bool        async_capable;  /* engage asynchronous-capable logic? */

    /*
     * Common structural data for all Plan types.
     */
    int         plan_node_id;   /* unique across entire final plan tree */
    List       *targetlist;     /* target list to be computed at this node */
    List       *qual;           /* implicitly-ANDed qual conditions */
    struct Plan *lefttree;      /* input plan tree(s) */
    struct Plan *righttree;
    List       *initPlan;       /* Init Plan nodes (un-correlated expr
                                 * subselects) */

    /*
     * Information for management of parameter-change-driven rescanning
     *
     * extParam includes the paramIDs of all external PARAM_EXEC params
     * affecting this plan node or its children.  setParam params from the
     * node's initPlans are not included, but their extParams are.
     *
     * allParam includes all the extParam paramIDs, plus the IDs of local
     * params that affect the node (i.e., the setParams of its initplans).
     * These are _all_ the PARAM_EXEC params that affect this node.
     */
    Bitmapset  *extParam;
    Bitmapset  *allParam;
} Plan;

任何类型计划节点都要继承自这个最基本的 Plan 结构，并在其结构体中将 Plan 作为第一个成员变量，使得将子类型的计划节点指针强制转换为 Plan * 时能够直接引用到 Plan 结构体内的变量。这和 C++ 的对象继承行为一致。这里以扫描计划节点为例：

/*
 * ==========
 * Scan nodes
 * ==========
 */
typedef struct Scan
{
    Plan        plan;
    Index       scanrelid;      /* relid is index into the range table */
} Scan;

这也是所有扫描节点的父类。任何扫描计划节点都要继承自这个结构体，并将 Scan 作为第一个成员变量。以最简单的顺序扫描 (sequential scan) 节点为例：

/* ----------------
 *      sequential scan node
 * ----------------
 */
typedef Scan SeqScan;

/* ----------------
 *      table sample scan node
 * ----------------
 */
typedef struct SampleScan
{
    Scan        scan;
    /* use struct pointer to avoid including parsenodes.h here */
    struct TableSampleClause *tablesample;
} SampleScan;

好吧，由于顺序扫描过于简单，除了 Scan 的变量以外没有任何扩展变量了；对于 SampleScan，除了第一个成员变量 scan 外，还有完成其对应功能的其它成员变量。

Initialization

之前已经分析过，在执行器的 ExecInitNode() 函数中有一个 switch 语句，根据查询计划节点的类型 (NodeTag) 调用相应的 ExecInitXXX()。对于顺序扫描节点，自然是调用 ExecInitSeqScan()。传入参数为顺序扫描的计划节点，以及计划执行的全局状态：

/* ----------------------------------------------------------------
 *      ExecInitSeqScan
 * ----------------------------------------------------------------
 */
SeqScanState *
ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
{
    SeqScanState *scanstate;

    /*
     * Once upon a time it was possible to have an outerPlan of a SeqScan, but
     * not any more.
     */
    Assert(outerPlan(node) == NULL);
    Assert(innerPlan(node) == NULL);

    /*
     * create state structure
     */
    scanstate = makeNode(SeqScanState);
    scanstate->ss.ps.plan = (Plan *) node;
    scanstate->ss.ps.state = estate;
    scanstate->ss.ps.ExecProcNode = ExecSeqScan;

    /*
     * Miscellaneous initialization
     *
     * create expression context for node
     */
    ExecAssignExprContext(estate, &scanstate->ss.ps);

    /*
     * open the scan relation
     */
    scanstate->ss.ss_currentRelation =
        ExecOpenScanRelation(estate,
                             node->scanrelid,
                             eflags);

    /* and create slot with the appropriate rowtype */
    ExecInitScanTupleSlot(estate, &scanstate->ss,
                          RelationGetDescr(scanstate->ss.ss_currentRelation),
                          table_slot_callbacks(scanstate->ss.ss_currentRelation));

    /*
     * Initialize result type and projection.
     */
    ExecInitResultTypeTL(&scanstate->ss.ps);
    ExecAssignScanProjectionInfo(&scanstate->ss);

    /*
     * initialize child expressions
     */
    scanstate->ss.ps.qual =
        ExecInitQual(node->plan.qual, (PlanState *) scanstate);

    return scanstate;
}

可以看到，顺序扫描节点不再允许有任何左右孩子节点了，必须是计划树的叶子节点。然后根据当前的计划节点构造相对应的 PlanState 节点：SeqScanState。然后将计划执行阶段的节点回调函数 ExecProcNode 函数指针设置为函数 ExecSeqScan()。值得一提的是，PlanState 也满足类似 Plan 的继承关系：

/* ----------------------------------------------------------------
 *               Scan State Information
 * ----------------------------------------------------------------
 */

/* ----------------
 *   ScanState information
 *
 *      ScanState extends PlanState for node types that represent
 *      scans of an underlying relation.  It can also be used for nodes
 *      that scan the output of an underlying plan node --- in that case,
 *      only ScanTupleSlot is actually useful, and it refers to the tuple
 *      retrieved from the subplan.
 *
 *      currentRelation    relation being scanned (NULL if none)
 *      currentScanDesc    current scan descriptor for scan (NULL if none)
 *      ScanTupleSlot      pointer to slot in tuple table holding scan tuple
 * ----------------
 */
typedef struct ScanState
{
    PlanState   ps;             /* its first field is NodeTag */
    Relation    ss_currentRelation;
    struct TableScanDescData *ss_currentScanDesc;
    TupleTableSlot *ss_ScanTupleSlot;
} ScanState;

/* ----------------
 *   SeqScanState information
 * ----------------
 */
typedef struct SeqScanState
{
    ScanState   ss;             /* its first field is NodeTag */
    Size        pscan_len;      /* size of parallel heap scan descriptor */
} SeqScanState;

Sequential Scan Execution

如上所述，顺序扫描计划节点的执行回调函数被设置为 ExecSeqScan()：

/* ----------------------------------------------------------------
 *      ExecSeqScan(node)
 *
 *      Scans the relation sequentially and returns the next qualifying
 *      tuple.
 *      We call the ExecScan() routine and pass it the appropriate
 *      access method functions.
 * ----------------------------------------------------------------
 */
static TupleTableSlot *
ExecSeqScan(PlanState *pstate)
{
    SeqScanState *node = castNode(SeqScanState, pstate);

    return ExecScan(&node->ss,
                    (ExecScanAccessMtd) SeqNext,
                    (ExecScanRecheckMtd) SeqRecheck);
}

这里，所有与扫描相关的操作又被抽象到了 ExecScan() 函数中，被各种扫描操作复用。每种扫描操作需要提供 accessMtd 和 recheckMtd 两个回调函数指针。

/*
 * prototypes from functions in execScan.c
 */
typedef TupleTableSlot *(*ExecScanAccessMtd) (ScanState *node);
typedef bool (*ExecScanRecheckMtd) (ScanState *node, TupleTableSlot *slot);

其中，accessMtd 用于使用特定的扫描方式 返回下一个元组。对于顺序扫描来说，上述调用指定 SeqNext() 函数顺序扫描下一个元组并返回：

如果扫描描述符为空，那么立刻初始化一个，并维护在 SeqScanState 中
调用 table_scan_getnextslot() 获取下一个元组

/* ----------------------------------------------------------------
 *                      Scan Support
 * ----------------------------------------------------------------
 */

/* ----------------------------------------------------------------
 *      SeqNext
 *
 *      This is a workhorse for ExecSeqScan
 * ----------------------------------------------------------------
 */
static TupleTableSlot *
SeqNext(SeqScanState *node)
{
    TableScanDesc scandesc;
    EState     *estate;
    ScanDirection direction;
    TupleTableSlot *slot;

    /*
     * get information from the estate and scan state
     */
    scandesc = node->ss.ss_currentScanDesc;
    estate = node->ss.ps.state;
    direction = estate->es_direction;
    slot = node->ss.ss_ScanTupleSlot;

    if (scandesc == NULL)
    {
        /*
         * We reach here if the scan is not parallel, or if we're serially
         * executing a scan that was planned to be parallel.
         */
        scandesc = table_beginscan(node->ss.ss_currentRelation,
                                   estate->es_snapshot,
                                   0, NULL);
        node->ss.ss_currentScanDesc = scandesc;
    }

    /*
     * get the next tuple from the table
     */
    if (table_scan_getnextslot(scandesc, direction, slot))
        return slot;
    return NULL;
}

/*
 * SeqRecheck -- access method routine to recheck a tuple in EvalPlanQual
 */
static bool
SeqRecheck(SeqScanState *node, TupleTableSlot *slot)
{
    /*
     * Note that unlike IndexScan, SeqScan never use keys in heap_beginscan
     * (and this is very bad) - so, here we do not check are keys ok or not.
     */
    return true;
}

再来看看 ExecScan() 中的通用扫描动作，以及如何调用上述两个回调函数。其中有一个核心死循环，用于不同获取元组：

/* ----------------------------------------------------------------
 *      ExecScan
 *
 *      Scans the relation using the 'access method' indicated and
 *      returns the next qualifying tuple.
 *      The access method returns the next tuple and ExecScan() is
 *      responsible for checking the tuple returned against the qual-clause.
 *
 *      A 'recheck method' must also be provided that can check an
 *      arbitrary tuple of the relation against any qual conditions
 *      that are implemented internal to the access method.
 *
 *      Conditions:
 *        -- the "cursor" maintained by the AMI is positioned at the tuple
 *           returned previously.
 *
 *      Initial States:
 *        -- the relation indicated is opened for scanning so that the
 *           "cursor" is positioned before the first qualifying tuple.
 * ----------------------------------------------------------------
 */
TupleTableSlot *
ExecScan(ScanState *node,
         ExecScanAccessMtd accessMtd,   /* function returning a tuple */
         ExecScanRecheckMtd recheckMtd)
{
    ExprContext *econtext;
    ExprState  *qual;
    ProjectionInfo *projInfo;

    /*
     * Fetch data from node
     */
    qual = node->ps.qual;
    projInfo = node->ps.ps_ProjInfo;
    econtext = node->ps.ps_ExprContext;

    /* interrupt checks are in ExecScanFetch */

    /*
     * If we have neither a qual to check nor a projection to do, just skip
     * all the overhead and return the raw scan tuple.
     */
    if (!qual && !projInfo)
    {
        ResetExprContext(econtext);
        return ExecScanFetch(node, accessMtd, recheckMtd);
    }

    /*
     * Reset per-tuple memory context to free any expression evaluation
     * storage allocated in the previous tuple cycle.
     */
    ResetExprContext(econtext);

    /*
     * get a tuple from the access method.  Loop until we obtain a tuple that
     * passes the qualification.
     */
    for (;;)
    {
        TupleTableSlot *slot;

        slot = ExecScanFetch(node, accessMtd, recheckMtd);

        /*
         * if the slot returned by the accessMtd contains NULL, then it means
         * there is nothing more to scan so we just return an empty slot,
         * being careful to use the projection result slot so it has correct
         * tupleDesc.
         */
        if (TupIsNull(slot))
        {
            if (projInfo)
                return ExecClearTuple(projInfo->pi_state.resultslot);
            else
                return slot;
        }

        /*
         * place the current tuple into the expr context
         */
        econtext->ecxt_scantuple = slot;

        /*
         * check that the current tuple satisfies the qual-clause
         *
         * check for non-null qual here to avoid a function call to ExecQual()
         * when the qual is null ... saves only a few cycles, but they add up
         * ...
         */
        if (qual == NULL || ExecQual(qual, econtext))
        {
            /*
             * Found a satisfactory scan tuple.
             */
            if (projInfo)
            {
                /*
                 * Form a projection tuple, store it in the result tuple slot
                 * and return it.
                 */
                return ExecProject(projInfo);
            }
            else
            {
                /*
                 * Here, we aren't projecting, so just return scan tuple.
                 */
                return slot;
            }
        }
        else
            InstrCountFiltered1(node, 1);

        /*
         * Tuple fails qual, so free per-tuple memory and try again.
         */
        ResetExprContext(econtext);
    }
}

在死循环中，调用 ExecScanFetch() 并传入两个回调函数指针获取元组。在这个函数中折腾了一大堆，在函数返回前调用了 accessMtd 函数指针：

/*
 * Run the node-type-specific access method function to get the next tuple
 */
return (*accessMtd) (node);

Ending

执行结束时，调用 ExecEndNode()。与初始化类似，其中也有一个核心的 switch 函数，根据计划节点的类型，调用相应的 ExecEndXXX()。这里显然将会调用 ExecEndSeqScan()，完成所有的清理工作：

/* ----------------------------------------------------------------
 *      ExecEndSeqScan
 *
 *      frees any storage allocated through C routines.
 * ----------------------------------------------------------------
 */
void
ExecEndSeqScan(SeqScanState *node)
{
    TableScanDesc scanDesc;

    /*
     * get information from node
     */
    scanDesc = node->ss.ss_currentScanDesc;

    /*
     * Free the exprcontext
     */
    ExecFreeExprContext(&node->ss.ps);

    /*
     * clean out the tuple table
     */
    if (node->ss.ps.ps_ResultTupleSlot)
        ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
    ExecClearTuple(node->ss.ss_ScanTupleSlot);

    /*
     * close heap scan
     */
    if (scanDesc != NULL)
        table_endscan(scanDesc);
}

顺序扫描已经是最简单的扫描算子了，后续考虑看一下索引扫描算子，看看实现上有什么特别的地方。暂时没有向下研究至存储级别的实现：heap 表似乎定义了自己的 access method，与具体扫描操作的实现解耦。