Postgres query plan’s execution flow

18 min readOct 31, 2018

This post is a description of reading source code of Postgres’ query execution flow about its executors, plan trees, call backs, plan states and other staff above storage.

Overall procedures

ExecutorStart/ExecutorRun/ExecutorFinish/ExecutorEnd是执行器的四大接口，顾名思义代表执行器的几个步骤。

void ExecutorStart(QueryDesc *queryDesc, int eflags);
void ExecutorRun(QueryDesc *queryDesc, ScanDirection direction, uint64 count, bool execute_once);
void ExecutorFinish(QueryDesc *queryDesc);
void ExecutorEnd(QueryDesc *queryDesc);

可以看到，处理的参数都是类型为QueryDesc的描述符，这是一个很重要的参数，因为本质上就是就是plan tree的封装（PlannedStmt是本体）。QueryDesc基本上PG day one就有的结构体，有24年的历史，看一下几个关键的成员：

typedef struct QueryDesc
{
	CmdType operation; // 查询类型
	PlannedStmt *plannedstmt; // Plan tree的头
	...
	Snapshot snapshot; //SnapshotData，基于'xmin:xmax:xip'实现的MVCC快照读
	DestReceiver *dest; //查询出tuples的目的地
	
	/* These fields are set by ExecutorStart */
	TupleDesc tupDesc;
	EState *estate;	
	PlanState *planstate;

	/* This field is set by ExecutorRun */
	bool already_executed;
	...
} QueryDesc;

ExecutorStart调用后，会把tupDesc初始化成tuple的schema描述，把estate初始化query的生命周期内所需要的内存，也用来在执行器之间共享状态，把planstate初始化成plan state tree（PG的执行计划分成plan tree和plan state tree，plan tree只读不写，写的操作记录在plan state里，这样可以更纯粹的做plan tree的plan cache。两个tree是node级别一一对应的)。ExecInitNode会从Plan tree的根节点先序遍历地递归初始化对应的plan state tree:

PlanState *
ExecInitNode(Plan *node, EState *estate, int eflags)
{
	// 递归推出条件
	if (node == NULL)
		return NULL;

	switch (nodeTag(node))
	{
			/*
			 * control nodes
			 */
		case T_Result:
			result = (PlanState *) ExecInitResult((Result *) node,
												  estate, eflags);
			break;
		...	
			/*
			 * scan nodes
			 */
		case T_SeqScan:
			result = (PlanState *) ExecInitSeqScan((SeqScan *) node,
												   estate, eflags);
			break;

		case T_IndexScan:
			result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
													 estate, eflags);
			break;
		... // 其它还有IndexOnly，TidScan等

			/*
			 * join nodes
			 */
		case T_NestLoop:
			result = (PlanState *) ExecInitNestLoop((NestLoop *) node,
													estate, eflags);
			break;
		...
			/*
			 * materialization nodes
			 */
		case T_Material:
			result = (PlanState *) ExecInitMaterial((Material *) node,
													estate, eflags);
			break;

		case T_Sort:
			result = (PlanState *) ExecInitSort((Sort *) node,
												estate, eflags);
			break;
		...
	}
	...
}

Postgres把Plan分成control, scan, join, materialization四个大类，materialization

这里面有一个值得注意的动作是设置CommandId的动作，cid是PG里面做Visibility Check时候的依据之一，在tuple的header里使用t_cid来表示第几个语句更新了这一行。所以这里针对不同CRUD的情况处理了这个值：

plan中的rowMarks不为空，说明SELECT语句有锁行（FOR UPDATE）操作。
plan中有ModifyingCTE, 比如WITH了一个DELETE的CTE并RETURNING删除行，但是使用SELECT去查询。
INSERT/DELETE/UPDATE语句。

ExecutorRun是执行器的运行入口，除了QueryDesc外，参数direction指定了前后的方向，并读取count数量的tuple，方向如果是NoMovementScanDirection，则不会读表。函数执行完成后结果是把处理的行数存到estate里，tuples放到QueryDesc中的DestReceiver内，DestReceiver是一个C语言实现的多态抽象类，封装了四个回调函数，不同的Dest类型（如发给SELECT INTO，或者TupleStore等）实现了自己需要的callback和定制的数据结构。如果被处理的查询是SELECT或者INSERT/UPDATE/DELETE...RETURNING这种带有输出的语句，那就会调用Dest中的rStartup去做初始化。比如如果目标是Relation，初始化就会包括判断目标表是基表还是物化视图，获取schema，创建table等。随后就是进入ExecutePlan了。 ExecutePlan的参数都是之前讨论的这些关键数据结构。对ExecutePlan的调用可以认为是runtime和执行plan tree的分界线，之前准备戏台子，准备完了就可以走plan tree了。

static void
ExecutePlan(EState *estate, //已准备
			PlanState *planstate,//已准备
			bool use_parallel_mode,//并行查询，不赘述
			CmdType operation,//已准备
			bool sendTuples,//已准备
			uint64 numberTuples,//已准备，0代表跑完为止
			ScanDirection direction,//已准备
			DestReceiver *dest,//已准备
			bool execute_once)

对执行树（实际上是plan state tree)的调用由ExecProcNode函数完成。

static inline TupleTableSlot *
ExecProcNode(PlanState *node)
{
	...
	return node->ExecProcNode(node);
}

PlanState是节点的抽象，为了实现c继承的效果，Postgres中所有的实际PlanState都把PlanState作为struct中的第一个field，这样通过指针转化的方式可以使用PlanState抽象或者如JoinState的具象。Postgres中实现了如下具象的PlanState:

twocode@pc:~/others/postgres$ grep -B2 -rsni 'PlanState\s*ps;' . | grep typedef
./src/include/nodes/execnodes.h-1157-typedef struct ResultState
./src/include/nodes/execnodes.h-1172-typedef struct ProjectSetState
./src/include/nodes/execnodes.h-1186-typedef struct ModifyTableState
./src/include/nodes/execnodes.h-1295-typedef struct MergeAppendState
./src/include/nodes/execnodes.h-1320-typedef struct RecursiveUnionState
./src/include/nodes/execnodes.h-1339-typedef struct BitmapAndState
./src/include/nodes/execnodes.h-1350-typedef struct BitmapOrState
./src/include/nodes/execnodes.h-1376-typedef struct ScanState
./src/include/nodes/execnodes.h-1903-typedef struct JoinState
./src/include/nodes/execnodes.h-2447-typedef struct UniqueState
./src/include/nodes/execnodes.h-2460-typedef struct GatherState
./src/include/nodes/execnodes.h-2486-typedef struct GatherMergeState
./src/include/nodes/execnodes.h-2535-typedef struct HashState
./src/include/nodes/execnodes.h-2572-typedef struct SetOpState
./src/include/nodes/execnodes.h-2596-typedef struct LockRowsState
./src/include/nodes/execnodes.h-2627-typedef struct LimitState

而这其中的有些也会派生出其它更具象的PlanState，比如ScanState和JoinState，又可以被进一步分为:

twocode@pc:~/others/postgres$ grep -B2 -rsni 'ScanState\s*ss;' . | grep typedef
./src/include/nodes/execnodes.h-1388-typedef struct SeqScanState
./src/include/nodes/execnodes.h-1398-typedef struct SampleScanState
./src/include/nodes/execnodes.h-1463-typedef struct IndexScanState
./src/include/nodes/execnodes.h-1509-typedef struct IndexOnlyScanState
./src/include/nodes/execnodes.h-1544-typedef struct BitmapIndexScanState
./src/include/nodes/execnodes.h-1629-typedef struct BitmapHeapScanState
./src/include/nodes/execnodes.h-1664-typedef struct TidScanState
./src/include/nodes/execnodes.h-1684-typedef struct TidRangeScanState
./src/include/nodes/execnodes.h-1700-typedef struct SubqueryScanState
./src/include/nodes/execnodes.h-1724-typedef struct FunctionScanState
./src/include/nodes/execnodes.h-1760-typedef struct ValuesScanState
./src/include/nodes/execnodes.h-1776-typedef struct TableFuncScanState
./src/include/nodes/execnodes.h-1805-typedef struct CteScanState
./src/include/nodes/execnodes.h-1828-typedef struct NamedTuplestoreScanState
./src/include/nodes/execnodes.h-1844-typedef struct WorkTableScanState
./src/include/nodes/execnodes.h-1856-typedef struct ForeignScanState
./src/include/nodes/execnodes.h-1882-typedef struct CustomScanState
./src/include/nodes/execnodes.h-2042-typedef struct MaterialState
./src/include/nodes/execnodes.h-2086-typedef struct MemoizeState
./src/include/nodes/execnodes.h-2144-typedef struct SortState
./src/include/nodes/execnodes.h-2201-typedef struct IncrementalSortState
./src/include/nodes/execnodes.h-2228-typedef struct GroupState
./src/include/nodes/execnodes.h-2275-typedef struct AggState
./src/include/nodes/execnodes.h-2360-typedef struct WindowAggState
./doc/src/sgml/custom-scan.sgml-226-typedef struct CustomScanStatetwocode@pc:~/others/postgres$ grep -B2 -rsni 'JoinState\s*js;' . | grep typedef
./src/include/nodes/execnodes.h-1920-typedef struct NestLoopState
./src/include/nodes/execnodes.h-1953-typedef struct MergeJoinState
./src/include/nodes/execnodes.h-2005-typedef struct HashJoinState

对于这么多的PlanState，自然有应的ExecProcNode回调函数给执行器。比如索引扫描（IndexScan）对应的就是IndexScanState和ExecIndexScan。

下面记录一下典型的SQL算子的实现。还记得上文提到过的，PostgreSQL会把node分成control, scan, join, materialize类。接下去可以分类做一些源码走读。

Scan Nodes

Scan算子（如上面统计的IndexScan, SeqScan, CteScan等等）都是由ExecScan作为入口，并且通过AccessMethod和recheck方法来扩展。

TupleTableSlot *
ExecScan(ScanState *node,
		 ExecScanAccessMtd accessMtd,
		 ExecScanRecheckMtd recheckMtd)

比如对于SeqScan,对应的入口为：

static TupleTableSlot *
ExecSeqScan(PlanState *pstate)
{
	SeqScanState *node = castNode(SeqScanState, pstate);

	return ExecScan(&node->ss,
					(ExecScanAccessMtd) SeqNext,
					(ExecScanRecheckMtd) SeqRecheck);
}

对于CteScan,对应的入口为：

static TupleTableSlot *
ExecCteScan(PlanState *pstate)
{
	CteScanState *node = castNode(CteScanState, pstate);

	return ExecScan(&node->ss,
					(ExecScanAccessMtd) CteScanNext,
					(ExecScanRecheckMtd) CteScanRecheck);
}

以`IndexScan`为例：

IndexScan

如果查询需要对扫描的结果进行排序（比如ORDER BY），那么通过IndexScanState中记录的OrderByKeys可以进行区分，针对这种是否需要重排的特征设置回调函数。

static TupleTableSlot *
ExecIndexScan(PlanState *pstate)
{
	...
	if (node->iss_NumOrderByKeys > 0)
		return ExecScan(&node->ss,
						(ExecScanAccessMtd) IndexNextWithReorder,
						(ExecScanRecheckMtd) IndexRecheck);
	else
		return ExecScan(&node->ss,
						(ExecScanAccessMtd) IndexNext,
						(ExecScanRecheckMtd) IndexRecheck);
}

IndexNextWithReorder与IndexNext功能都是调用IndexAM存储接口，区别就在于IndexNextWithReorder会排序的表达式把结果做Reorder。看上去是为了GiST引入的，实际上对与lossy index等需要recheck的场景是通用的，并且只要有多个orderby key，就会使用这个函数而不是IndexNext. 通过index_beginscan的AM接口创建了ScanDesc后，ORDER BY的逻辑会通过state内部的iss_ReorderQueue（一个支持自定义比较函数的二叉堆）来实现。

如果reorder queue非空，拿到堆顶的值topmost，做排序比较。如果是空的而且index扫描完成了，退出循环。
使用cmp_orderbyvals去跟topmost比较，比较的原理是按照ORDER BY的顺序依次比较，采用NULL Last原则，如果topmost更大，返回1，否则返回-1，相等返回0. 如果返回了1，把topmost从queue中弹出并当作这一次的函数返回值。
上述条件都没有满足的情况下，也就是queue里面没有比当前的tuple更大的话。继续沿着index扫描tuple，如果index是lossy，说明需要recheck。这里的recheck规则跟索引方式和数值精度有关，是为了GiST索引服务的，常规业务下不会走到。
仍旧使用cmp_orderbyvals比较tuple和topmost，如果tuple更大，入queue。

Postgres query plan’s execution flow

Overall procedures

Scan Nodes

IndexScan

Written by Samuel Hu