Postgres query plan’s execution flow

Samuel Hu
18 min readOct 31, 2018

--

This post is a description of reading source code of Postgres’ query execution flow about its executors, plan trees, call backs, plan states and other staff above storage.

Overall procedures

ExecutorStart/ExecutorRun/ExecutorFinish/ExecutorEnd是执行器的四大接口,顾名思义代表执行器的几个步骤。

void ExecutorStart(QueryDesc *queryDesc, int eflags);
void ExecutorRun(QueryDesc *queryDesc, ScanDirection direction, uint64 count, bool execute_once);
void ExecutorFinish(QueryDesc *queryDesc);
void ExecutorEnd(QueryDesc *queryDesc);

可以看到,处理的参数都是类型为QueryDesc的描述符,这是一个很重要的参数,因为本质上就是就是plan tree的封装(PlannedStmt是本体)。QueryDesc基本上PG day one就有的结构体,有24年的历史,看一下几个关键的成员:

typedef struct QueryDesc
{
CmdType operation; // 查询类型
PlannedStmt *plannedstmt; // Plan tree的头
...
Snapshot snapshot; //SnapshotData,基于'xmin:xmax:xip'实现的MVCC快照读
DestReceiver *dest; //查询出tuples的目的地

/* These fields are set by ExecutorStart */
TupleDesc tupDesc;
EState *estate;
PlanState *planstate;

/* This field is set by ExecutorRun */
bool already_executed;
...
} QueryDesc;

ExecutorStart调用后,会把tupDesc初始化成tuple的schema描述,把estate初始化query的生命周期内所需要的内存,也用来在执行器之间共享状态,把planstate初始化成plan state tree(PG的执行计划分成plan tree和plan state tree,plan tree只读不写,写的操作记录在plan state里,这样可以更纯粹的做plan tree的plan cache。两个tree是node级别一一对应的)。ExecInitNode会从Plan tree的根节点先序遍历地递归初始化对应的plan state tree:

PlanState *
ExecInitNode(Plan *node, EState *estate, int eflags)
{
// 递归推出条件
if (node == NULL)
return NULL;

switch (nodeTag(node))
{
/*
* control nodes
*/
case T_Result:
result = (PlanState *) ExecInitResult((Result *) node,
estate, eflags);
break;
...
/*
* scan nodes
*/
case T_SeqScan:
result = (PlanState *) ExecInitSeqScan((SeqScan *) node,
estate, eflags);
break;

case T_IndexScan:
result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
estate, eflags);
break;
... // 其它还有IndexOnly,TidScan等

/*
* join nodes
*/
case T_NestLoop:
result = (PlanState *) ExecInitNestLoop((NestLoop *) node,
estate, eflags);
break;
...
/*
* materialization nodes
*/
case T_Material:
result = (PlanState *) ExecInitMaterial((Material *) node,
estate, eflags);
break;

case T_Sort:
result = (PlanState *) ExecInitSort((Sort *) node,
estate, eflags);
break;
...
}
...
}

Postgres把Plan分成control, scan, join, materialization四个大类,materialization

这里面有一个值得注意的动作是设置CommandId的动作,cid是PG里面做Visibility Check时候的依据之一,在tuple的header里使用t_cid来表示第几个语句更新了这一行。所以这里针对不同CRUD的情况处理了这个值:

  1. plan中的rowMarks不为空,说明SELECT语句有锁行(FOR UPDATE)操作。
  2. plan中有ModifyingCTE, 比如WITH了一个DELETE的CTE并RETURNING删除行,但是使用SELECT去查询。
  3. INSERT/DELETE/UPDATE语句。

ExecutorRun是执行器的运行入口,除了QueryDesc外,参数direction指定了前后的方向,并读取count数量的tuple,方向如果是NoMovementScanDirection,则不会读表。函数执行完成后结果是把处理的行数存到estate里,tuples放到QueryDesc中的DestReceiver内,DestReceiver是一个C语言实现的多态抽象类,封装了四个回调函数,不同的Dest类型(如发给SELECT INTO,或者TupleStore等)实现了自己需要的callback和定制的数据结构。 如果被处理的查询是SELECT或者INSERT/UPDATE/DELETE...RETURNING这种带有输出的语句,那就会调用Dest中的rStartup去做初始化。比如如果目标是Relation,初始化就会包括判断目标表是基表还是物化视图,获取schema,创建table等。随后就是进入ExecutePlan了。 ExecutePlan的参数都是之前讨论的这些关键数据结构。对ExecutePlan的调用可以认为是runtime和执行plan tree的分界线,之前准备戏台子,准备完了就可以走plan tree了。

static void
ExecutePlan(EState *estate, //已准备
PlanState *planstate,//已准备
bool use_parallel_mode,//并行查询,不赘述
CmdType operation,//已准备
bool sendTuples,//已准备
uint64 numberTuples,//已准备,0代表跑完为止
ScanDirection direction,//已准备
DestReceiver *dest,//已准备
bool execute_once)

对执行树(实际上是plan state tree)的调用由ExecProcNode函数完成。

static inline TupleTableSlot *
ExecProcNode(PlanState *node)
{
...
return node->ExecProcNode(node);
}

PlanState是节点的抽象,为了实现c继承的效果,Postgres中所有的实际PlanState都把PlanState作为struct中的第一个field,这样通过指针转化的方式可以使用PlanState抽象或者如JoinState的具象。Postgres中实现了如下具象的PlanState:

twocode@pc:~/others/postgres$ grep -B2 -rsni 'PlanState\s*ps;' . | grep typedef
./src/include/nodes/execnodes.h-1157-typedef struct ResultState
./src/include/nodes/execnodes.h-1172-typedef struct ProjectSetState
./src/include/nodes/execnodes.h-1186-typedef struct ModifyTableState
./src/include/nodes/execnodes.h-1295-typedef struct MergeAppendState
./src/include/nodes/execnodes.h-1320-typedef struct RecursiveUnionState
./src/include/nodes/execnodes.h-1339-typedef struct BitmapAndState
./src/include/nodes/execnodes.h-1350-typedef struct BitmapOrState
./src/include/nodes/execnodes.h-1376-typedef struct ScanState
./src/include/nodes/execnodes.h-1903-typedef struct JoinState
./src/include/nodes/execnodes.h-2447-typedef struct UniqueState
./src/include/nodes/execnodes.h-2460-typedef struct GatherState
./src/include/nodes/execnodes.h-2486-typedef struct GatherMergeState
./src/include/nodes/execnodes.h-2535-typedef struct HashState
./src/include/nodes/execnodes.h-2572-typedef struct SetOpState
./src/include/nodes/execnodes.h-2596-typedef struct LockRowsState
./src/include/nodes/execnodes.h-2627-typedef struct LimitState

而这其中的有些也会派生出其它更具象的PlanState,比如ScanStateJoinState,又可以被进一步分为:

twocode@pc:~/others/postgres$ grep -B2 -rsni 'ScanState\s*ss;' . | grep typedef
./src/include/nodes/execnodes.h-1388-typedef struct SeqScanState
./src/include/nodes/execnodes.h-1398-typedef struct SampleScanState
./src/include/nodes/execnodes.h-1463-typedef struct IndexScanState
./src/include/nodes/execnodes.h-1509-typedef struct IndexOnlyScanState
./src/include/nodes/execnodes.h-1544-typedef struct BitmapIndexScanState
./src/include/nodes/execnodes.h-1629-typedef struct BitmapHeapScanState
./src/include/nodes/execnodes.h-1664-typedef struct TidScanState
./src/include/nodes/execnodes.h-1684-typedef struct TidRangeScanState
./src/include/nodes/execnodes.h-1700-typedef struct SubqueryScanState
./src/include/nodes/execnodes.h-1724-typedef struct FunctionScanState
./src/include/nodes/execnodes.h-1760-typedef struct ValuesScanState
./src/include/nodes/execnodes.h-1776-typedef struct TableFuncScanState
./src/include/nodes/execnodes.h-1805-typedef struct CteScanState
./src/include/nodes/execnodes.h-1828-typedef struct NamedTuplestoreScanState
./src/include/nodes/execnodes.h-1844-typedef struct WorkTableScanState
./src/include/nodes/execnodes.h-1856-typedef struct ForeignScanState
./src/include/nodes/execnodes.h-1882-typedef struct CustomScanState
./src/include/nodes/execnodes.h-2042-typedef struct MaterialState
./src/include/nodes/execnodes.h-2086-typedef struct MemoizeState
./src/include/nodes/execnodes.h-2144-typedef struct SortState
./src/include/nodes/execnodes.h-2201-typedef struct IncrementalSortState
./src/include/nodes/execnodes.h-2228-typedef struct GroupState
./src/include/nodes/execnodes.h-2275-typedef struct AggState
./src/include/nodes/execnodes.h-2360-typedef struct WindowAggState
./doc/src/sgml/custom-scan.sgml-226-typedef struct CustomScanState
twocode@pc:~/others/postgres$ grep -B2 -rsni 'JoinState\s*js;' . | grep typedef
./src/include/nodes/execnodes.h-1920-typedef struct NestLoopState
./src/include/nodes/execnodes.h-1953-typedef struct MergeJoinState
./src/include/nodes/execnodes.h-2005-typedef struct HashJoinState

对于这么多的PlanState,自然有应的ExecProcNode回调函数给执行器。 比如索引扫描(IndexScan)对应的就是IndexScanStateExecIndexScan

下面记录一下典型的SQL算子的实现。还记得上文提到过的,PostgreSQL会把node分成control, scan, join, materialize类。接下去可以分类做一些源码走读。

Scan Nodes

Scan算子(如上面统计的IndexScan, SeqScan, CteScan等等)都是由ExecScan作为入口,并且通过AccessMethod和recheck方法来扩展。

TupleTableSlot *
ExecScan(ScanState *node,
ExecScanAccessMtd accessMtd,
ExecScanRecheckMtd recheckMtd)

比如对于SeqScan,对应的入口为:

static TupleTableSlot *
ExecSeqScan(PlanState *pstate)
{
SeqScanState *node = castNode(SeqScanState, pstate);

return ExecScan(&node->ss,
(ExecScanAccessMtd) SeqNext,
(ExecScanRecheckMtd) SeqRecheck);
}

对于CteScan,对应的入口为:

static TupleTableSlot *
ExecCteScan(PlanState *pstate)
{
CteScanState *node = castNode(CteScanState, pstate);

return ExecScan(&node->ss,
(ExecScanAccessMtd) CteScanNext,
(ExecScanRecheckMtd) CteScanRecheck);
}

以`IndexScan`为例:

IndexScan

如果查询需要对扫描的结果进行排序(比如ORDER BY),那么通过IndexScanState中记录的OrderByKeys可以进行区分,针对这种是否需要重排的特征设置回调函数。

static TupleTableSlot *
ExecIndexScan(PlanState *pstate)
{
...
if (node->iss_NumOrderByKeys > 0)
return ExecScan(&node->ss,
(ExecScanAccessMtd) IndexNextWithReorder,
(ExecScanRecheckMtd) IndexRecheck);
else
return ExecScan(&node->ss,
(ExecScanAccessMtd) IndexNext,
(ExecScanRecheckMtd) IndexRecheck);
}

IndexNextWithReorderIndexNext功能都是调用IndexAM存储接口,区别就在于IndexNextWithReorder会排序的表达式把结果做Reorder。看上去是为了GiST引入的,实际上对与lossy index等需要recheck的场景是通用的,并且只要有多个orderby key,就会使用这个函数而不是IndexNext. 通过index_beginscan的AM接口创建了ScanDesc后,ORDER BY的逻辑会通过state内部的iss_ReorderQueue(一个支持自定义比较函数的二叉堆)来实现。

  1. 如果reorder queue非空,拿到堆顶的值topmost,做排序比较。如果是空的而且index扫描完成了,退出循环。
  2. 使用cmp_orderbyvals去跟topmost比较,比较的原理是按照ORDER BY的顺序依次比较,采用NULL Last原则,如果topmost更大,返回1,否则返回-1,相等返回0. 如果返回了1,把topmost从queue中弹出并当作这一次的函数返回值。
  3. 上述条件都没有满足的情况下,也就是queue里面没有比当前的tuple更大的话。继续沿着index扫描tuple,如果index是lossy,说明需要recheck。这里的recheck规则跟索引方式和数值精度有关,是为了GiST索引服务的,常规业务下不会走到。
  4. 仍旧使用cmp_orderbyvals比较tuple和topmost,如果tuple更大,入queue。

--

--