Home Technology Show HN: Keybench – Scriptable, extensible performance...
Technology

Show HN: Keybench – Scriptable, extensible performance tool for key value stores

Key Points

guycipher/keybench Folders and files | Name | Name | Last commit date | || |---|---|---|---|---| Repository files navigation keybench ======== A scriptable, extensible performance tool for sorted key value stores.

guycipher/keybench Folders and files | Name | Name | Last commit date | || |---|---|---|---|---| Repository files navigation keybench ======== A scriptable, extensible performance tool for sorted key value stores. You write the workload in Lua. keybench drives it across one or more storage engines, times every operation, and reports throughput and latency. The same script runs unchanged against every engine, so a comparison measures the engines and not the harness. POSIX only currently. What it measures ================ keybench reports two rates and a latency distribution. wu/s workload units per second. One unit is one call to your run() function. This is the rate of whole operations as your script defines them, a "view cart" or a "checkout", regardless of how many key touches each one costs. ops/s primitive operations per second. One primitive op is a single put, get, del, range, or scan call, or one key handled inside an mget, mput, or mdel. A range or a scan counts once no matter how many rows it returns. This is the rate of raw key touches. When every unit is one primitive op the two rates are equal and the report prints one throughput line. When a unit does several primitive ops, such as a batch of B keys or a cart checkout that scans then deletes, the report prints wu/sec and ops/sec on separate lines. For a batched verb ops/sec is exactly B times wu/sec, because each unit performs B key touches, so batching shows as ops/sec rising with B while wu/sec falls, the difference being the fixed per call cost spread over more keys. Latency is recorded per operation kind as a distribution, not an average. Each of put, get, del, range, mget, mput, and mdel has its own histogram, and a scan is timed into the range histogram. The report gives p50, p99, p99.9, and the max for each. How it works ============ Five parts, each replaceable. The engine owns concurrency. keybench spawns the worker threads, splits the op budget across them, and joins them. A worker runs your script in its own Lua state with its own random seed. The harness never holds a lock around an engine call, so a serialized engine reports as serialized and a parallel one reports as parallel. The Lua script stays single threaded and never reasons about locks. The workload is a Lua table. A file returns a table with a name, an optional load function, and a run function. load seeds the store before timing, and the harness runs it on every worker thread at once, each filling its own shard, so a large dataset loads in parallel. run is the measured unit and the harness calls it until the op budget or the time budget is spent. Both receive a ctx table carrying users, items, ops, seed, batch, thread, threads, and the current iter. The store is one verb set over every engine. Your script calls a global kv table. The same eight verbs reach whichever engine is selected. keybench times the call, files the sample in the right histogram, and counts the primitive ops. The backend is a self registering plugin. Each engine lives under backends/ with its own build fragment and registers itself before main runs. Selecting one by name, or several for a comparison, is a command flag. Adding one is a new directory and a vtable, no edit to the core. The reporter is a sink. A run broadcasts its metadata, its probes, its aggregated points, and its live samples to every reporter you asked for. A console reporter prints a human report. A tsv reporter writes machine rows. A timeline reporter writes one row per live sample. You can run several at once. Building ======== keybench vendors its own Lua, so the default build needs only a C compiler and pthreads. make That produces ./keybench with the in memory skiplist engine compiled in. The skiplist needs no external library and is the reference engine. To compile in the persistent engines, name them. make ROCKSDB=1 add the RocksDB engine make TIDESDB=1 add the TidesDB engine make BACKENDS="skiplist rocksdb tidesdb" No special allocator is linked by default, keybench uses the system one. If a distro rocksdb or tidesdb was itself built against jemalloc, link the same allocator first with ALLOC=jemalloc so the whole malloc family agrees, which avoids a crash when a database is reopened across sweep points. ALLOC=tcmalloc works too, and ALLOC_LIBS="..." links any other allocator or a custom path. The engine library and header locations are also overridable, so a custom install works. make ROCKSDB=1 ROCKSDB_CFLAGS=-I/opt/rocksdb/include ROCKSDB_LIBS="-L/opt/rocksdb/lib -lrocksdb" make TIDESDB=1 TIDESDB_CFLAGS=-I/opt/tidesdb/include TIDESDB_LIBS="-L/opt/tidesdb/lib -ltidesdb" make ROCKSDB=1 TIDESDB=1 ALLOC=jemalloc To build against a system Lua instead of the vendored copy. make LUA_SYS=1 Other targets. make run build, then run the cart workload once make all-wl build, then run every workload in workloads/ make clean Running ======= The smallest run names one workload file. ./keybench workloads/mixed.lua That drives 200000 units on one thread against the in memory skiplist, which needs no disk. A persistent engine writes to a drive, so name one with --data-dir, and the same run against RocksDB on that disk prints this. ./keybench --backend rocksdb --data-dir /mnt/disk workloads/mixed.lua keybench 0.1.0: 1 workload(s), engine=rocksdb seed=1 repeat=1 threads: 1 batch: 1 [probe: system] host agpmastersystem os Linux 6.2.0-39-generic x86_64 cpu 11th Gen Intel(R) Core(TM) i7-11700K @ 3.60GHz cpu_cores 16 ram 46.8 GiB page_size 4096 data_fs(/mnt/disk) 76.9 GiB free of 158.9 GiB data_dev /dev/sdb3 on /mnt/disk (ext4) data_disk solid state data_disk_model WDC WDS500G2B0A [probe: build] keybench 0.1.0 compiler 12.3.0 std C 201112 (optimized) allocator jemalloc (glibc 2.37) === mixed === engine=rocksdb 11.1.1 threads=1 batch=1 (median of 1 run) op count p50 p99 p99.9 max put 60404 8.64 us 22.91 us 75.26 us 376.62 us get 99617 10.43 us 31.36 us 46.34 us 108.91 us del 19955 4.19 us 15.17 us 29.82 us 47.02 us range 20024 407.55 us 643.07 us 806.91 us 1.04 ms get hit rate: 92.2% throughput: 17819 /sec (1 op per unit) The system probe records the machine, including the device, filesystem, and solid state or rotational nature of the drive behind --data-dir, which is the disk the benchmark actually exercises. The build probe records the compiler and allocator. The per op table gives the latency distribution. The hit rate is the fraction of gets that found a key. The throughput line gives wu/s and, when it differs, ops/s. Live progress ============= While a point runs, keybench streams a status line to stderr once a second, so a long point looks alive rather than hung. It first prints a seeding line for the untimed load phase, then one line per second, each its own line so the output scrolls as a log rather than rewriting one spot. A time bounded point shows the elapsed and budget seconds and the current rate. seeding cart on rocksdb 11.1.1 with 8 threads ... done (1.8s) cart rocksdb 11.1.1 t8 b1 1.0/5s 776154 wu/s cart rocksdb 11.1.1 t8 b1 2.0/5s 626071 wu/s cart rocksdb 11.1.1 t8 b1 3.0/5s 578264 wu/s A count bounded point shows the elapsed seconds, the rate, and the percent of the op budget done. mixed rocksdb 11.1.1 t1 b1 2.0s 161301 wu/s 43% (341639/800000) mixed rocksdb 11.1.1 t1 b1 3.0s 147362 wu/s 61% (489498/800000) mixed rocksdb 11.1.1 t1 b1 4.0s 138006 wu/s 78% (627934/800000) The status goes to stderr and only when stderr is a terminal, so it never lands in a piped or redirected log and never mixes into the stdout report or the tsv streams. Choosing the work ================= --ops N total units across all threads, then stop. Default 200000. --secs S run for S seconds instead of a fixed count. --users N size of the user population a workload may address. --items N size of the catalog a workload may address. --seed N base RNG seed. Thread t uses seed plus t. These split into two kinds. --ops and --secs are the two ways to bound a run and you use one of them, since giving --secs clears the op count and runs the clock. --users, --items, and --seed are not a choice between, they always apply, each handed to the workload through ctx to size its population, size its catalog, and seed its RNG. So a run is one bound flag plus any of the three knobs at once. Sweeping and comparing ====================== A comma list in a knob turns one run into a grid. keybench runs every point in the grid and, when there is more than one, prints a summary table sorted as it ran. --backend L one engine, or a comma list to compare. e.g. skiplist,rocksdb --threads L one thread count, or a comma list to sweep. e.g. 1,2,4,8 --batch L one batch size, or a comma list to sweep. e.g. 1,8,64 --repeat N run each point N times and report the median. The grid is the cross product of engines, thread counts, and batch sizes. Only a workload that opts into batching, by returning batched true, varies with the batch knob, so the others run once at batch one rather than repeating an identical measurement across the batch list. Each cell is run --repeat times and reduced to a median, which damps the noise of a single run. An example that compares two engines as they scale. ./keybench --backend skiplist,rocksdb --threads 1,4,16 --repeat 3 \ --data-dir /mnt/disk workloads/mixed.lua Name several files to run them in sequence, each isolated in its own fresh store. ./keybench workloads/cart.lua workloads/mixed.lua workloads/scan.lua Reporters and live timeline =========================== --report S a comma list of sinks, each name or name:path. --timeline S sample every S seconds while a run is in flight. --report-dir D create D// and bundle every artifact there. The sink names are console, tsv, and timeline. A bare name, or a name whose path is a single dash, writes to stdout. A name with a path writes to that file. --report console,tsv:points.tsv,timeline:tl.tsv console prints the human report above. tsv writes one row per operation per grid point, ready for a spreadsheet or plot. timeline writes one row per live sample, one metric per row, so you can watch throughput, CPU, resident memory, disk bytes, and engine internals evolve over the run. The timeline sink only produces rows when --timeline names an interval. --report-dir is the convenient whole. It stamps a fresh directory, writes the console report, the points tsv, the timeline tsv, and a replay config into it, and tells you where. Probes ====== A probe records context the benchmark itself does not produce. The system probe reports the host, CPU, memory, free disk, and the device, mount, filesystem, and solid state or rotational nature of the drive backing --data-dir, and samples CPU percent, resident memory, load, faults, and IO while the run proceeds. The build probe reports the keybench version, compiler, and allocator. --probe L a comma list of probes, or none to disable, or all by default. Config files and replay ======================= A run can be written as a config file and replayed exactly. The format is plain INI. A [bench] section sets the run knobs. A section named for an engine tunes that engine and is handed to it at open time. --config F load run knobs from F, and engine tuning from its sections. --save-config F write the effective run to F as a replay config. The sample at samples/rocksdb-vs-tidesdb.cnf compares the two persistent engines across a thread and batch sweep. [bench] backend = rocksdb,tidesdb ops = 2000000 threads = 1,8,16,32,64 batch = 1,64,256,512,1024 repeat = 3 report_dir = ./results data_dir = /mnt/disk workload = workloads/mixed.lua workload = workloads/cart.lua [rocksdb] write_buffer_size = 64M compression = kNoCompression [tidesdb] write_buffer_size = 64M compression = none enable_bloom_filter = 1 Run it. ./keybench --config samples/rocksdb-vs-tidesdb.cnf Because --report-dir also drops a replay config beside its reports, any run can be reproduced from the directory it wrote. That sample doubles as a reference. Every engine knob keybench understands is listed in it, the active lines being the overrides the parity run applies and the rest shown commented next to its real default, so the whole tuning surface is visible in one place. The defaults printed there were read from the installed libraries, so they track the engine versions you actually link. Three rules govern an engine section. Sizes accept K, M, G, and T suffixes that mean powers of 1024, a bare number means bytes, and there is no B suffix, so write 64M and not 64MB. A value with no recognised suffix silently falls back to the default. A comment starts at the beginning of a line or after a space, which lets a value keep an inline semicolon, so a RocksDB nested option string such as block_based_table_factory = {block_cache=64M;filter_policy=bloomfilter:10:false} survives intact. The TidesDB section is a closed set that keybench maps field by field, while the RocksDB section is forwarded verbatim to RocksDB's own option string parser, so any key RocksDB accepts works and keybench sets only create_if_missing itself. Examples ======== The execution model in one breath. A run is a grid, the cross product of the engine list, the thread list, and the batch list, and keybench runs that grid once per workload file, in the order the files are given. Each cell of the grid is measured --repeat times and reduced to a median. Every measurement opens a fresh engine store in its own temporary directory under --data-dir, runs the untimed load to seed it, runs the timed phase, then tears the store down, so no two measurements share state. Worker thread t seeds its RNG with seed plus t, and the repeats of a cell reuse the same seeds, so --repeat measures run to run noise rather than seed variance. Two precedence rules are worth holding. A --config file supplies the defaults and any flag you also pass on the command line overrides it. --report-dir takes over reporting, bundling a console report, a points tsv, a timeline tsv, and a replay config into one stamped directory, and it overrides --report. A persistent engine needs --data-dir, a path on the disk you are benchmarking, so the rocksdb and tidesdb examples below pass one, written here as /mnt/disk. Point it at your own drive. The smallest run, one workload against the default skiplist for 200000 units on one thread. ./keybench workloads/mixed.lua Run for a fixed time instead of a fixed count. ./keybench --secs 30 workloads/mixed.lua Compare engines. The backend list is the first grid dimension. ./keybench --backend skiplist,rocksdb,tidesdb --data-dir /mnt/disk workloads/mixed.lua Scale across threads. The thread list is a sweep, here six points. ./keybench --backend tidesdb --threads 1,2,4,8,16,32 --secs 20 --data-dir /mnt/disk workloads/cart.lua Trace batch amortization. Only batch.lua reads ctx.batch, so the batch list sweeps it while a workload that does not opt in runs once at batch one. ./keybench --backend rocksdb --batch 1,8,64,512 --ops 2000000 --data-dir /mnt/disk workloads/batch.lua A full grid with repeats. Two engines by four thread counts is eight cells, each run three times and reduced to a median, with a summary table at the end. ./keybench --backend rocksdb,tidesdb --threads 1,8,16,32 --repeat 3 --secs 20 --data-dir /mnt/disk workloads/cart.lua Run several workloads in sequence. Each gets the whole grid, isolated in its own fresh store. ./keybench --backend rocksdb,tidesdb --threads 1,8,16 --data-dir /mnt/disk workloads/mixed.lua workloads/cart.lua workloads/scan.lua Shape the workload with the knobs. A larger catalog and user population with a fixed seed for a reproducible keyspace. ./keybench --backend tidesdb --users 500000 --items 1000000 --seed 7 --secs 30 --data-dir /mnt/disk workloads/cart.lua Write machine readable output alongside the console report. tsv is one row per operation per cell, timeline is one row per live sample, both to files. ./keybench --backend rocksdb,tidesdb --threads 1,8,16 --report console,tsv:points.tsv,timeline:tl.tsv --timeline 1 --data-dir /mnt/disk workloads/cart.lua Bundle everything for a result you keep, on a fast disk. --report-dir stamps a directory and drops the console report, the points tsv, the timeline tsv, and a replay config into it, and with --timeline the timeline fills. ./keybench --backend rocksdb,tidesdb --threads 1,8,16,32 --repeat 3 --secs 20 --timeline 1 --report-dir ./results --data-dir /mnt/fast/bench workloads/cart.lua Drive a whole comparison from a config, and override a knob for a quick dry run. The [bench] section holds the run, an engine section tunes that engine, and any flag still wins over the file. ./keybench --config samples/rocksdb-vs-tidesdb.cnf ./keybench --config samples/rocksdb-vs-tidesdb.cnf --secs 3 --threads 1,4 Save the effective run as a replay config, then rerun it byte for byte later. ./keybench --backend rocksdb,tidesdb --threads 1,8,16 --secs 20 --data-dir /mnt/disk --save-config run.cnf workloads/cart.lua ./keybench --config run.cnf Quiet the probes for the leanest output, or name only the one you want. ./keybench --probe none workloads/mixed.lua ./keybench --probe system workloads/mixed.lua Plotting ======== scripts/plot.py renders the tsv artifacts of one or more result directories into figures, throughput bars, scalability curves, batch amortization, latency percentiles, and the live timelines. The system timeline figure carries every metric the system probe records, and the engine internals are drawn one figure per engine, since RocksDB and TidesDB report different metric names and so never share a panel. Every metric an engine emits gets a panel, including the ones that stay flat or zero for the whole run, since a constant is itself a fact worth seeing. Engine colors come from the BACKENDPALETTE file. Figures are written as PNG by default, which is what you want for a slide or a post. Pass --format pdf or --format svg for a vector copy, say for a paper. python3 scripts/plot.py results/ --out figures python3 scripts/plot.py results/ --out figures --format pdf The timeline figures are only as detailed as the samples behind them, so a run needs a sample interval and enough time per point to fill them. A short fixed ops run finishes each point in well under a second and leaves the timelines nearly empty. Give the run a timeline interval and either a seconds budget or a large ops count so the sampler fires many times. Writing a workload ================= A workload file returns a table. return { name = "mine", load = load, run = run } name labels the report. load is optional and seeds the store before timing. run is the measured unit and runs many times. Both take ctx. A workload that reads ctx.batch through kv.mget or kv.mput also returns batched = true, which tells keybench to sweep it over the batch sizes. Without that the batch sweep is skipped for the file and it runs once at batch one. local function load(ctx) for i = ctx.thread, ctx.items - 1, ctx.threads do kv.put(string.format("k:%012d", i), tostring(i)) end end local function run(ctx) local k = string.format("k:%012d", math.random(0, ctx.items - 1)) kv.get(k) end ctx carries users, items, ops, seed, batch, thread, threads, and iter, the one based index of the current call. load runs on every worker thread at once, each in its own Lua state, so seed only this thread's shard, the keys where i is ctx.thread of every ctx.threads, otherwise each thread would write the whole dataset. Grouping the writes into one kv.mput of ctx.batch keys makes a large seed commit as one engine batch. run runs in a fresh state too, so a value captured in load is not visible in it. Read the keyspace size from ctx on every run() call rather than from an upvalue. users and items are knobs the workload interprets, not fixed benchmark settings. keybench only forwards them through ctx. By convention items is the keyspace or catalog size and users is a user population count, but each workload uses only what it needs and gives them its own meaning. A workload that does not read a knob is unaffected by it. The store orders keys by raw byte comparison. Format integer keys to a fixed zero padded width so byte order agrees with numeric order, otherwise "k:2" sorts after "k:10". The bundled workloads are worth reading as patterns. mixed a uniform random blend of get, put, del, and short range. A plain baseline that reflects raw engine behavior. Reads items as its keyspace and ignores users. Seeds items keys at the value size set in the file, so items times that size is the dataset, raise items to push it into the gigabytes. cart an Amazon style shopping cart, one key per line item under a per user prefix, so "view cart" is one range scan. Reads users as the number of carts to seed, so the dataset grows with it, and items as the catalog its products spread across. Traffic skews to hot users. scan the streaming range read kv.scan, which calls back per row and builds no table, so a scan of millions of rows uses constant memory. Reads items as its keyspace and ignores users. batch the multi key verbs kv.mget and kv.mput. Sweeping --batch traces the amortization curve. Reads items as its keyspace and ignores users. The kv verb set ============== Every workload reaches the store through the global kv table. The same verbs hit whichever engine is selected. kv.put(key, value) store one key kv.get(key) return the value, or nil if absent kv.del(key) delete, return true if the key existed kv.range(lo, hi, limit) half open scan, return an array of {key,val} kv.scan(lo, hi, fn, limit) half open scan, call fn(key,val) per row, return a truthy value to stop early kv.mget(keys) array of keys in, array of values out kv.mput(pairs) array of {key=,val=} stored together kv.mdel(keys) delete an array of keys A range is half open, lo included and hi excluded. kv.range materializes the whole result as a Lua array. kv.scan streams and never builds a table, so it is the right verb for large reads. The callback passed to kv.scan must not call kv.put or kv.del, an engine that scans under its own read lock would deadlock on the re entrant write. mget, mput, and mdel record one latency sample but count batch primitive ops, which is how sweeping --batch shows amortization. mput and mdel commit the whole batch as one engine write, a RocksDB write batch or a TidesDB transaction, rather than one write per key. Engines ======= skiplist an in memory probabilistic skiplist guarded by a reader writer lock. No dependency, no persistence, destroyed on close. The reference engine and always compiled in. rocksdb the RocksDB LSM tree, persistent, opened in a fresh temp directory under --data-dir. Any key in its [rocksdb] config section is passed through as a RocksDB option string. tidesdb the TidesDB LSM tree, persistent and transactional, opened in a fresh temp directory under --data-dir. Its [tidesdb] config section tunes write buffer, compression, bloom filters, sync mode, isolation, and compaction. Persistent engines need --data-dir, a directory on the disk you mean to benchmark, and there is no default. A default could land in /tmp, which on many systems is tmpfs and would silently benchmark RAM, so keybench refuses to run a persistent engine without it and warns when the path is on a RAM filesystem. skiplist is in memory and needs none. Hits and misses ============== A get is a hit when the engine returns a value and a miss when it returns nothing. The kv bridge times the call either way, files the sample in the get histogram, and counts one hit or one miss from the return code, and kv.mget counts one per key it carries. The console reports the blend as a hit rate, the fraction of gets that found a key, shown only when the workload issued at least one get. The rate is a logical fact about the keys the workload asked for, not a cache statistic, and it is the context for reading the get latencies, since an engine may serve a hit and a miss at very different costs. Adding an engine =============== Create backends//. Implement the kv_backend vtable from src/bench.h, put, get, del, range, close, and the optional version, stats, datadir, putbatch, and delbatch. The two batch ops write or delete many keys as one engine batch, and when left NULL the store loops put and del instead. Write an open function with the backend_ctor signature and register it with KV_REGISTER_BACKEND, passing 1 if it stores on disk and so needs --data-dir or 0 if it is in memory. Add a backends//backend.mk that appends your source and libraries. Compile it in with make BACKENDS="skiplist ". The core never changes. Repository layout ================ src/ the engine agnostic harness main.c argument parse, worker sampler and progress threads, the grid driver kvlua.c the kv verb set bridged into Lua store.c the thin vtable wrapper over a backend registry.c backend self registration hist.c the latency histogram reporter.c console, tsv, and timeline sinks config.c the INI config parser probe_*.c the system and build probes backends/ one directory per engine workloads/ the bundled Lua workloads scripts/ plot.py and the Lua fetch helper samples/ an example config vendor/ the vendored Lua sources Version 0.1.0.
Keybench (ORG) Scriptable (ORG) Lua (LOCATION) wu (PERSON) del (LOCATION) sec (ORG) ops/sec (ORG) p99 (ORG) max (PERSON) keybench times (ORG)
Originally published by Hacker News Read original →