Micro architectural analysis of in memory OLTP Revisited (EN)

I Recently read this article: Micro-architectural analysis of in-memory OLTP: Revisited , SIGMOD‘21

Database administrators often improve database performances through tuning parameters, sharding, etc.; database kernel developers often unleash the power of hardware. For example, with CPU having a large cache and a large bit width SIMD, you can do vectorized SQL executors; with CPU having hundreds of cores, you need to do all kinds of NUMA-aware data structures and thread affinity; with various persistence media like HDD, SSD, Flash, AEP, you build new indexing and IO strategies.

To know how to optimize the database system, we have to first observe it (metrics, tracing, perf counters, etc..). Usually we use resource monitoring tools like top, htop, sar, etc.; PMU counters tools like perf, SPE, etc.; as well as database monitoring views (SQL Server DMV, Oracle’s AWR, MySQL’s performance schema).

The microarchitecture analysis of workloads, on the other hand, has been a long-standing methodology to observe performance and seems to have been forgotten by many people. In recent years, perhaps the more famous work is Google’s Profiling a warehouse-scale computer, which looked at various loads in data center scenarios from the perspective of microarchitecture and learns stuff like how cpu caches were utilized, branch prediction behaved.

So what does it mean when talking about low level microarchitecture counters? You might could get a feeling reading the following terminologies, which come from a great book Computer Architecture: A Quantitative Approach (The Morgan Kaufmann Series in Computer Architecture and Design)

This paper, Micro-architectural analysis of in-memory OLTP: Revisited, as can be seen from the name, is a sequel, the previous one is Micro-architectural analysis of in-memory OLTP, using a similar method to observe the same database and load, the difference is that the CPU model varies from Xeon E5–2640 (Ivy Bridge) Upgraded to Xeon E5–2680 v4 (Broadwell).

Characteristics of OLTP

Now the most common OLTP database is still disk-based data storage, such as MySQL and PG, according to my personal experience, when running TPCC IPC between 0.7 and 0.8, that is, each clock cycle average only runs 0.7 ~ 0.8 instructions, accompanied by a relatively high LLC miss (20%+) and frontend-bound / backend-bound ratio. The paper cites database workload analysis over the past many years and draws similar conclusions.

OLTP benchmarks are famous for their suboptimal microarchitectural behavior. There is a large body of work that characterizes OLTP benchmarks at the micro-architectural level. They all conclude that OLTP exhibits high stall time (> 50% of the execution cycles) and a low instructions-per-cycle (IPC) value (< 1 IPC on machines that can retire up to 4 instructions in a cycle). The instruction cache misses that mainly stem from the large instruction footprint of transactions are the main source of the stall time, while the next contributing factor is the longlatency data misses from the last-level cache (LLC).

One of the effects of disk-based database on worker observations is data structures for IO features, such as B-tree, B+tree, index organization tables, heap tables, LSM trees, etc., whose data structures are completely different from the perspective of CPU access; the other is the impact of IO runtime on system representation, such as synchronous IO, asynchronous IO, System calls and so on require the CPU’s scheduling strategy to affect the freshness of the cache. Therefore, on the one hand, the study of in-memory databases is more pure for workload analysis, on the other hand, there are more and more use scenarios for in-memory databases, and there are more and more mature products. Therefore, this article is still very interesting for the characteristic analysis of in-memory databases.

Considering the lighter components, cache-friendly data structures, and cleaner codebase of in-memory systems, one expects them to exhibit better cache locality (especially for the instruction cache) and less memory stall time. Due to the distinctive design features of the in-memory systems from the disk-based ones, however, it is not straightforward to extrapolate how OLTP benchmarks behave at the microarchitectural level when run on an in-memory engine solely by looking at the results of previous studies.

The analysis of the paper looked at three in-memory databases and two disk databases, although the Surname is not specified, but from the characteristic description, DBMS M should refer to Hekaton (hash index, Bw tree, JIT), DBMS N refers to VoltDB (red and black tree). Silo uses Masstree. The disk databases are DBMS D (which should be SQL Server) and Shore-MT. I will directly call them by their original names in the subsequent description, and if the inference is wrong, I am not responsible.

Microarchitecture observation metrics

Observing these micro indicators mainly answers these macro questions:
1. Is the CPU mainly spent on fetch latency or retrire execution of instructions?
2. Does the fetch delay come from the instruction or the data?
3. What is the impact of the capacity size of the database on these observations?
4. What is the impact of transaction size on these observational metrics?

CPU cycles

A picture is worth a thousand words, and there are the following conclusions:
1. Under the small amount of data (1MB to 10MB), the proportion of retiring is similar, because most of the data can be placed in the LLC, and the CPU only needs to work, and does not need too much access memory.
2. Under the amount of big data (10G to 100G), the proportion of retiring is reduced, because it takes more time to deal with cache miss visits. Both in-memory databases and disk databases have less than 30% of the retiring, which is a data stall-bound scenario.
3. Due to Intel’s architecture upgrade, the optimization of IFU has greatly reduced the Icache miss rate, so Shore-MT was still Icache-stall-bound in the Ivy Bridge era, and became Dcache-stall-bound under Broadwell.

“State-of-the-art advances in branch prediction algorithms enable accurate fetch requests to run ahead of micro-operation supply to hide instruction TLB and cache misses.”. Hence, instruction fetch unit keeps supplying instructions even though there is an instruction cache miss, which allows overlapping the instruction cache miss latency with useful work.

4. The proportion of Icache-stall in DBMS D (SQL Server) and DBMS M (Hekaton) is too high, because there are too many legacy codebases (NTDBMS code warehouse 300G, understand everything). Hekaton has its own codegen actuator, storage engine, but the shell and runtime (such as SQLOS) are still shared, so the insertion footprint is particularly large, which in turn causes Icache to be easily swapped in and out, resulting in Icache-stall. Hekaton’s throughput performance is about three times that of SQL Server, so it can also be seen that Hekaton’s Dcache usage is much better than SQL Server.
5. DBMS N (VoltDB) doesn’t have legaccy code, so there aren’t many connections either. Consistent with the previous reasons, the IFU upgrade also caused VoltDB to change from Icache-stall to Dcache-stall.
6. Silo’s retirring account is high, mainly because it is only a storage engine, less peripheral components of the database, but when the amount of data is large, it is inevitable to become Dcache-bound.
7. Silo’s brand misprediction accounts for a high percentage, mainly because although everyone has to do tree traversal, the simplicity of the code causes Silo’s Icache stall to be too low, so the proportion of branch prediction failures increases year-on-year.
Everyone has 10–15% decoding-stalls, mainly Intel’s DSB is too small. This can be understood as the cache of front-end decoding is too small.

In memory database vs disk based databases

In the case of small data volume (1MB) and large data volume (100GB), the execution time of each module of the disk database is similar, the buffer manager and the lock latch time ratio is the largest, and the Dcache stall is mainly from the buffer manager; VoltDB because there is no buffer management, plus the partitioning mechanism and the deterministic concurrency control (single thread), So the main time and the main Dcache stall come from during the transaction setup and initialization phases. In addition, at 100GB of data, VoltDB’s index lookup time increased significantly, from 10% to 44%. The conclusion is that Shore-MT needs architectural refactoring to reduce data footprints, and VoltDB needs to optimize index organization. The following figure is a comparison chart under the data volume of 1MB.

Impact of transaction sizes

The transaction size in the article refers to the number of rows fetched and accessed in each transaction. For essentially all databases, as the number of rows accessed increases, the smaller the percentage of Icache-stalls and the more dcache-stalls there are. It is not difficult to understand that on the one hand, the code flow of the executor and the storage engine is more solidified to Icache, on the other hand, the number of rows accessed is more, and the main pressure is in the storage engine, which is more likely to cause Dcache miss.

TPC-B vs TPC-C

Under TPC-B, in-memory database performance is apparently due to disk databases. In the field of in-memory databases, Silo > VoltDB > Hekaton. There is no suspense in this conclusion. The microarchitecture behavior under TPC-C is similar to that of TPC-B, the Icache stall of SQL Server and Hekaton is very high (legacy code), the rest are Dcache stalls or resource conflicts, because TPC-C has better locality than the previous self-made benchmark, so the proportion of retiring is higher.

Impact of other database characteristics on the microarchitecture

The article also analyzes the impact of other database components or features on the microarchitecture, and summarizes it.
> JIT. JIT can greatly optimize the Icache stall (~50%) because: 1. Reducing virtual function overhead (vtable). 2. Inline function, reducing branch prediction. 3. Take advantage of the compiler’s own optimization features.
> Compared with the index type B-tree and the hash index, if the load requires more random access to the data, then the B-tree will have more Dcache-stall ratio (because there are more lookups).
> Data types. Only The Cache-stall that handles string types under Hekaton has a higher proportion than long integers (legacy codes handle memory operations more expensively).
> The Icache stall of multithreaded SQL Server is still very high, and the rest are Dcache stalls. This is also relatively easy to understand, mainly related to the footprint of the code. Hekaton’s multi-threaded Icache stall is much less than single-threaded, and the main thing here is that hekaton’s instructions are better (after all, Hekaton has its own self-contained SQL executor and lighter engine, and the total number of lines of code is very low in SQL Server).

Advanced features (chip or hardware)

Hyperthreading

For multi-threaded workloads, hyperthreading is almost brainless and profitable, and the most critical thing is that you can cut out to do other tasks at the dcache/ICache stall to improve the overall ILP. Not just databases, but also data-intensive scenarios.

Overclocking

The main frequency is increased by the cycle time, and the access time is mainly determined by the physical delay between the LSU of the core and the DRAM, which belongs to the offcore, if you count the cross-DIE overhead under the multi-NUMA architecture, it is not the CPU cycle that can be determined. Therefore, the promotion of the main frequency mainly improves the ALU-bound scenario, so the article observes OLTP systems, and everyone has no benefit.

Hardware prefetching

The effect is limited. In the random access scenario, spatial locality is not enough to improve the prefetch hit rate. For high-performance data structures, prefetching has been taken into account in the software implementation.

Postscript

This article is very long, leaving aside the 24 pages of reference, but in fact, the core is to observe the microarchitecture data in different scenarios, and make some generalizations and summaries. Traditionally, the more familiar analysis method is Intel’s Top Down analysis methodology, which unfolds from the four categories of Frontend-Bound/Backend-Bound/Bad-speculation/Retiring in the instruction lifecycle. This article does not start from here, but stares at the data stall and retiring, in fact, the essence is the same.

This kind of analysis method is actually relatively limited to the developer of the database kernel, I believe that most kernel developers or optimizers will not use this methodology to optimize, this methodology will not clearly tell the kernel staff which module execution time is long, which lock is a hot spot. Relatively speaking, it will be more meaningful to the designers of the underlying microarchitecture. As the data center scenario is absorbed by the cloud, and the cloud vendors are frantically adding their own chips and hardware, it is of great reference significance through this macro methodology. For example, Apple M1 has the ability to design a perverted architecture according to its own scene, which is also of great reference significance for database and data center scenarios, and in the future, the Graviton series or Microsoft’s ARM chip will definitely design its own microarchitecture according to its own workload analysis.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store