A Simple Survey on Schemaless Database Storage Design

Samuel Hu
6 min readJan 14, 2022

--

Database engines are always designed to meet different workloads. Columnar storages are usually favored for analytical semi-structural workloads (Dremel, AsterixDB). On the other hand, KV storages are also used by lots of vendors (MongoDB, CouchDB, Azure CosmosDB, Yugabyte, etc.); they usually choose different data localities, encodings and indexes to optimize for range/join/lookup queries. Updates/deletions are usually less considered.

Disclaimer: This post is at a very high level and needs much more future works to make it more accurate and complete.

Industrial works

MongoDB/CouchDB/AWS DocumentDB

All of their feature sets are for querying are limited and analytical (columnar) accesses are slow. These systems are optimized for point accesses.

Azure DocumentDB [1]

Of all its design goals (automatic indexing, configurable tradeoff, rich queries support, consistency, multitenancy), automatic indexing is mostly discussed and at its core.

Storage Model

1. Tree representation

Use tree to represent both schema and instance values. Furthermore, inverted index is also a tree that can be serialized to JSON as well.

2. Inverted indexing

The index is constructed out of all trees of documents within the collection. Each node is composed of a term (label) and posting (document id).

3. Logical storage representation

Use KV tuple for logical storage model. For paths, partial forward path is chosen for better range queries and partial reverse path is chosen for better equality support (join). Different encodings are optimized for key paths and values (posting). Users can customize indexes on documents, paths, index types, index update models (consistency).

Physical representation

Bw-tree for performance and three major optimizations: Blind incremental updates; efficient index recovery; resource governance support. Running Azure DocumentDB in production, it has several workload insights:

  • Document frequency distribution and unique term length are inverse.
  • Count of leaf nodes (instance values) completely dwarf count of interior nodes (schema).

DD/M/QL, transaction and consistency

Multi-model Mixed multi-item transaction with different operations. OCC.

Azure CosmosDB

Azure CosmosDB originated from Azure DocumentDB and is now a multimodel database ranked top 3 in all document, graph, kv areas on db-engines.com. Based on limited information from CosmosDB, it uses ARS (Atomic-Record-Sequence) as the type system, and I think the underlying storage is still the same.

Yugabyte [2]

Storage Model

A Yugabyte table is a collection and each row is a document and built on DocDBKey (No actual terminology in Yugabyte) DocDBValue (No actual terminology) model [3]. DocDBKey is composed of DocKey, SubDocKey and DocHybridTime [4]. DocDBValue has Primitive, Object, Collectiontypes. Both components keys and values are PrimitiveType encoded [5].

// kArray is handled slightly differently and hence we only have
// kObject, kRedisTS, kRedisSet, and kRedisList.
constexpr inline bool IsObjectType(const ValueType value_type) {
return value_type == ValueType::kRedisTS || value_type == ValueType::kObject ||
value_type == ValueType::kRedisSet || value_type == ValueType::kRedisSortedSet ||
value_type == ValueType::kSSForward || value_type == ValueType::kSSReverse ||
value_type == ValueType::kRedisList;
}
constexpr inline bool IsCollectionType(const ValueType value_type) {
return IsObjectType(value_type) || value_type == ValueType::kArray;
}

Physical representation

RocksDB and BlockBasedTable.

DD/M/QL, transaction and consistency

CREATE TABLE msgs (user_id text,
msg_id int,
msg text,
msg_props map<text, text>,
PRIMARY KEY ((user_id), msg_id));

INSERT INTO msgs (user_id, msg_id, msg, msg_props)
VALUES ('user1', 10, 'msg1', {'from' : 'a@b.com', 'subject' : 'hello'});
UPDATE msgs
SET msg_props = msg_props + {'read_status' : 'true'}
WHERE user_id = 'user1', msg_id = 10
INSERT INTO msgs (user_id, msg_id, msg, msg_props)
VALUES (‘user1’, 20, 'msg2', {'from' : 'c@d.com', 'subject' : 'bar'});
DELETE msg_props
FROM msgs
WHERE user_id = 'user1'
AND msg_id = 10;

As a transactional RDBMS, Yugabyte supports RR and Serializable with row versioning and HLC.

Dremel/Parquet

Storage Model

Dremel is focused on Protobuf and mainly dealt with repeated and option fields. The wield part is that Dremel would parse and extract the proto fields of incoming documents but still chooses to organizes the data in nested format. And in order to reconstruct the columnar data to document using repetition level and definition level id (R field and D field within the columns) with complex FSM.

Physical representation

Each column is stored as a set of blocks. Each block contains the repetition and definition levels and compressed field values. Any definition level smaller than the number of repeated and optional fields in a field’s path denotes a NULL. Levels are packed as bit sequences. Maximum efforts are tried to reduce level size. Encoding is RLE. [6]

DD/M/QL, transaction and consistency

CREATE TABLE mydataset.table1(
id INT64,
cart JSON
);
INSERT INTO mydataset.table1VALUES(1, JSON '{"name": "Alice", "age": 30}');CREATE OR REPLACE TABLE mydataset.table_new AS
SELECT id, SAFE.PARSE_JSON(cart) AS cart_json
FROM mydataset.old_table;
# batch load job
bq load --source_format=CSV mydataset.table1 json.csv id:INTEGER,json_data:JSONbq show mydataset.table1Last modified Schema Total Rows Total Bytes----------------- -------------------- ------------ ------------- 22 Dec 22:10:32 |- id: integer 3 63 |- json_data: json
DECLARE int_val INT64 DEFAULT 0;
SELECT cart[CONCAT('it','ems')][int_val + 1].product AS itemFROM mydataset.table1;+--------+| item |+--------+| "food" || NULL |+--------+

Very limited public internals on BigQuery’s transaction internals. For parquet, it’s not supported and Delta Lake uses versioned Parquet files.

Fun fact:

  • Within a transaction, materialized views are interpreted as logical views. You can still query a materialized view inside a transaction, but it doesn’t result in any performance improvement or cost reduction compared with the equivalent logical view.

AsterixDB [7]

Extended Dremel, trying to solve 2 limitations:

  1. Dremel requires schema definition.
  2. Dremel does not support dynamic types.

Storage Model

Repeatition levels replaced by definition levels + delimeter. Introduce union type as parent of sub-primitve types. Store different types to different columns.

Physical representation

Parquet

DD/M/QL, transaction and consistency

Object-level transaction; multi-statement transaction is not supported.

Academic works

JSON Tiles/Sinew

Design goal is to improve access performance (attributes oriented), query optimization (statistics provided) and heterogeneous data (dynamic types). Presumption and observation:

  1. JSON documents in a collection often have the same set of keys and similar implicit schema.
  2. Derive type from key namings, like create as a datatime.
  3. Documents in real-world data sets often have similar structure and they therefore propose extracting a single schema globally (Sinew).

Storage Model

Basic idea is very simple: extract the data into itemsets and group them into different columns and tiles. Frequent schema are materialized. Discovering frequent itemsets fall into data mining category and is achieved by using FPGrowth algorithm. For incoming data that might be schematically variable (non-perfect tiles), tiles are reordered between neighbors and grouped into a partition. Dynamic types are naturally supported through extraction process.

Physical representation

JSONB (RFC 8259). Integers/Floats/Literals/Objects/Arrays all use 8-bit header. Object (nested objects and arrays) contain all k-v pairs referenced.

DD/M/QL, transaction and consistency

Overall good for analytics, fair point query, good ingestion, good compression. Updates are in-lined. For unfrequent attributes, performance on Objects and arrays should be very slow.

Proteus, ReCache [8] [9]

Proteus and ReCache are two research papers that described two components of a hetorogenuous database. Proteus was the query engine while ReCache was the caching layer. Proteus builds indexes on top of JSON data to speed up accesses. ReCache accelerates processing of heterogeneous formats by caching accesses of the data according to the query workload automatically.

[1] Schema-Agnostic Indexing with Azure DocumentDB — Microsoft Research

[2] Persistence in YugabyteDB

[3] yugabyte-db/docdb.h at master · yugabyte/yugabyte-db

[4] yugabyte-db/doc_key.h at master · yugabyte/yugabyte-db

[5] yugabyte-db/value_type.h at master · yugabyte/yugabyte-db

[6] Inside Capacitor, BigQuery’s next-generation columnar storage format

[7] Columnar Formats for Schemaless LSM-based Document Stores

[8] Fast queries over heterogeneous data through engine customization

[9] ReCache: reactive caching for fast analytics over heterogeneous data

--

--