Chapter 11: Benchmarks and Performance¶

11.1 Overview¶

DSSim ships a benchmark suite under benchmarks/. Each script isolates a specific aspect of simulation throughput.

All results on this page were collected on the same machine (Python 3.12.3, SimPy 4.1.1). SimPy 4.1.1 is used as the performance baseline — every table shows throughput relative to SimPy (1.0×). Higher is better.

Each DSSim row appears in two variants: [BT] = TQBinTree (default heap-based queue) and [Bi] = TQBisect (sorted-deque queue). See Choosing a time queue for guidance on which to prefer.

Cell colours: green = faster than SimPy (> 1.15×) · neutral = within ±15% of SimPy · red = slower than SimPy (< 0.85×).

To run the full DSSim + SimPy cross-TQ comparison on your own hardware:

python benchmarks/bench_simulator.py      --with-simpy --with-tq-bisect
python benchmarks/bench_queue.py          --with-simpy --with-tq-bisect
python benchmarks/bench_queue_priority.py --with-simpy --with-tq-bisect
python benchmarks/bench_resource.py       --with-simpy --with-tq-bisect
python benchmarks/bench_resource_unit.py  --with-simpy --with-tq-bisect

To include salabim (run separately to avoid CPU contention):

python benchmarks/bench_simulator.py --with-salabim --without-dssim-pubsub --without-dssim-lite
python benchmarks/bench_resource.py  --with-salabim

Absolute numbers will vary by machine. The relative ordering has been stable across environments.

11.2 Simulator Core Throughput¶

bench_simulator measures how fast the engine processes events in six scenarios, from simple timed callbacks to routed fan-out across multiple workers.

Scenarios 1–5¶

Scenario	Raw [BT]	Raw [Bi]	Lite [BT]	Lite [Bi]	PubSub [BT]	PubSub [Bi]	SimPy
1 — Timed callbacks	1.14×	1.18×	0.78×	0.79×	0.56×	0.60×	1.0×
2 — Now-burst (zero-time fan-out)	5.01×	5.09×	2.59×	2.73×	1.09×	1.10×	1.0×
3 — Now-chain (self-rescheduling)	2.24×	2.26×	0.83×	0.80×	0.28×	0.28×	1.0×
4 — Generator wakeup (1 waiter + 1 producer)	3.29×	3.16×	1.76×	1.79×	0.67×	0.68×	1.0×
5 — Cross-signal ping-pong (2 peers)	3.36×	3.36×	1.19×	1.16×	0.41×	0.40×	1.0×

Key observations:

DSSim Raw (no Layer 2) outperforms SimPy by 1.1–5.1×, peaking on zero-time dispatch where DSSim's now-queue avoids the binary-search time queue entirely.
DSSim Lite beats SimPy on zero-time scenarios (2.6–2.7×) and generator wakeup (1.8×). The deficit in timed callbacks and now-chain reflects per-event scheduling overhead that SimPy's lighter callback dispatch avoids at those workloads.
DSSim PubSub is slower than SimPy in most scenarios — the expected trade-off for tier routing, condition evaluation, and circuit machinery. Zero-time burst (S2) is now effectively at parity (1.09–1.10×).
TQBinTree ≈ TQBisect across all S1–S5 scenarios (within 5%). TQ selection has negligible impact on these workloads.

Scenario 6 — Bucketed Burst¶

Scenario 6 dispatches N events to K workers at L different time buckets — high routing diversity and many concurrent timestamps.

Scenario	Raw [BT]	Raw [Bi]	Lite [BT]	Lite [Bi]	PubSub [BT]	PubSub [Bi]	SimPy
6 — Bucketed burst (K=10, L=100)	7.87×	4.59×	4.95×	3.40×	2.29×	1.93×	1.0×

Key observations:

TQBinTree is the clear winner on bucketed burst: 1.7–1.9× faster than TQBisect for DSSim. The heap-plus-bucket structure handles many concurrent distinct timestamps much better than the sorted deque's O(log N) bisection on every insert.
This scenario is the primary guide for choosing TQBinTree when your model has high concurrent-timestamp diversity.

11.3 Queue Throughput¶

bench_queue compares DSQueue (PubSubLayer2) and DSLiteQueue (LiteLayer2) against SimPy Store on five producer/consumer patterns.

Scenario	DSQueue [BT]	DSQueue [Bi]	DSLiteQueue [BT]	DSLiteQueue [Bi]	SimPy Store
Free-flow (unbounded, 1P+1C)	4.10×	4.07×	5.97×	6.01×	1.0×
Backpressure (cap=10, 1P+1C)	0.45×	0.45×	3.21×	3.21×	1.0×
Many-workers (100P+100C)	2.87×	3.04×	6.02×	5.87×	1.0×
Blocked-getters (100 getters, 1P)	0.23×	0.24×	1.12×	1.17×	1.0×
Cross-notify (100P+100C, cap=1)	0.15×	0.15×	1.05×	1.07×	1.0×

Key observations:

DSLiteQueue leads in every scenario (1.0–6.0× faster than SimPy) by bypassing pubsub routing and condition evaluation entirely.
DSQueue (PubSubLayer2) is fast on free-flow and many-workers but drops below SimPy under high contention (backpressure, blocked-getters, cross-notify). Each blocked put/get goes through a full condition-wait and subscriber-wakeup cycle. The upside is that DSQueue provides routing, condition filtering, probes, and circuit support that SimPy's Store does not.
TQBinTree ≈ TQBisect for all queue scenarios (within 1–2%). TQ selection has no meaningful impact on queue throughput.

11.4 Priority Queue Throughput¶

bench_queue_priority benchmarks DSQueue and DSLiteQueue used as priority queues against SimPy's PriorityStore.

Scenario	DSQueue [BT]	DSQueue [Bi]	DSLiteQueue [BT]	DSLiteQueue [Bi]	SimPy PriorityStore
Fill-drain (N=100k, worst-case order)	1.45×	1.47×	1.61×	1.61×	1.0×
Burst — put_nowait + gget	1.42×	1.39×	1.59×	1.61×	1.0×
Burst — gput + gget	1.31×	1.30×	1.58×	1.56×	1.0×
Bounded cap=1 (alternating put/get)	0.23×	0.23×	1.05×	1.06×	1.0×

Key observations:

Both DSSim flavours beat SimPy on fill-drain and burst (1.3–1.6×) because heap operations are fast and scheduling overhead is low.
DSQueue drops to 0.23× on the bounded scenario — same contention penalty seen in the regular queue benchmarks. DSLiteQueue maintains 1.05–1.06×.
TQBinTree ≈ TQBisect across all priority queue scenarios (within 1%). TQ selection is irrelevant for queue-heavy workloads.

11.5 Resource Throughput¶

bench_resource benchmarks DSResource and DSLiteResource (variable-amount) against SimPy's Container. Priority/preemption scenarios have no Container equivalent, so SimPy rows are skipped there.

Scenario	DSResource [BT]	DSResource [Bi]	DSLiteResource [BT]	DSLiteResource [Bi]	SimPy
Uncontended acquire/release (N=20k)	3.08×	3.27×	3.22×	3.10×	1.0×
Priority dispatch (K=100 waiters)	37k ev/s	38k ev/s	120k ev/s	122k ev/s	—
Preemption delivery (N=20k)	25k ev/s	25k ev/s	52k ev/s	51k ev/s	—
Resource contention (K=16 workers)	0.32×	0.31×	1.28×	1.29×	1.0×
Timed contention (K=16, timed hold)	0.33×	0.34×	1.27×	1.40×	1.0×

Key observations:

Uncontended resources: both DSSim variants are ~3.1–3.3× faster than SimPy Container because uncontended acquire/release goes through a minimal fast path.
Priority dispatch: DSPriorityResource (~37–38k ev/s) is slower than DSLitePriorityResource (~120–122k ev/s) — pubsub routing through 100 subscriber slots dominates for the PubSub variant.
Preemption: DSPriorityResource (~25k ev/s) and DSLitePriorityResource (~52k ev/s) — no SimPy Container equivalent; Lite is ~2× faster by skipping condition evaluation.
Contention (S4, S5): DSResource (PubSub) drops to 0.31–0.34× under contention; DSLiteResource maintains 1.27–1.40×, beating SimPy Container.
TQBinTree ≈ TQBisect across all resource scenarios (within 1–10%). Under timed contention TQBisect has a slight edge.

11.6 Unit Resource Throughput¶

bench_resource_unit benchmarks DSUnitResource and DSLiteUnitResource (unit-only, 1-token operations) against SimPy's Resource, PriorityResource, and PreemptiveResource.

Scenario	DSUnitResource [BT]	DSUnitResource [Bi]	DSLiteUnitResource [BT]	DSLiteUnitResource [Bi]	SimPy
Uncontended acquire/release (N=20k)	5.74×	6.13×	8.39×	8.25×	1.0× (Resource)
Priority dispatch (K=100 waiters)	0.43×	0.44×	1.37×	1.44×	1.0× (PriorityResource)
Preemption delivery (N=20k)	1.02×	1.02×	2.13×	2.09×	1.0× (PreemptiveResource)
Resource contention (K=16 workers)	0.33×	0.33×	1.57×	1.66×	1.0× (Resource)
Timed contention (K=16, timed hold)	0.28×	0.29×	0.98×	1.06×	1.0× (Resource)

Key observations:

DSLiteUnitResource is the fastest DSSim resource — up to 8.4× faster than SimPy Resource uncontended, and 1.6–1.7× faster under contention. Gains come from eliminating per-request _Waiter allocation and using a simpler dispatch loop (counter check only, no amount comparison).
DSUnitResource (PubSub) has the same contention penalty as DSResource (0.28–0.43× under contention) due to pubsub routing overhead, but matches SimPy on preemption delivery (1.02×).
Priority dispatch: DSLiteUnitResource (via DSLitePriorityResource) beats SimPy PriorityResource by 1.37–1.44×. DSUnitResource drops to 0.43–0.44× as pubsub routing over 100 blocked subscribers dominates.
Timed contention (S5): DSLiteUnitResource is at parity with SimPy Resource (0.98–1.06×). Both frameworks converge because the timed-hold workload is dominated by time_queue scheduling rather than dispatch overhead.
TQBinTree ≈ TQBisect for most unit resource scenarios. TQBisect has a slight advantage in contention-heavy scenarios.

11.7 Generator vs. Coroutine¶

bench_generator_vs_coroutine confirms that the choice between generator syntax (yield) and coroutine syntax (async def / await) has negligible throughput impact for timed workloads, with a larger gap on zero-time burst.

Scenario	DSSim raw generator	DSSim raw coroutine	DSSim lite generator	DSSim lite coroutine
Timed dispatch N=200k	~292,000	~278,000	~227,000	~208,000
Now-burst N=200k	~1,537,000	~1,235,000	~800,000	~629,000

Choose based on readability. For timed workloads the cost difference is ~5–9% (negligible). For zero-time burst workloads the coroutine overhead is ~20–25% — prefer generators in tight now-queue loops if throughput matters.

11.8 Real Example: Crossroad Grid Simulation¶

bench_crossroad benchmarks a complete, domain-level simulation rather than a microbenchmark: a 2×2 grid of traffic-light-controlled intersections running for 1 simulated hour, each junction dispatching vehicles across four arms. The same model is implemented five ways:

DSSim PubSub — crossroad_pubsub.py (PubSubLayer2, DSLitePub + DSLiteCallback)
DSSim Lite — crossroad_lite.py (LiteLayer2, DSLitePub + DSLiteCallback)
DSSim Lite (direct) — crossroad_lite_direct.py (LiteLayer2, simpy-like direct calls via plain callables + ISubscriber)
SimPy — crossroad_simpy.py
salabim — crossroad_salabim.py

To run it yourself:

python benchmarks/bench_crossroad.py

Results are shown relative to SimPy (1.0×). Higher is better.

Scenario 1 — Straight-through routing (no travel delays)¶

Implementation	Median	Performance delta vs. SimPy
DSSim PubSub [BT]	95.0 ms	0.31×
DSSim PubSub [Bi]	97.5 ms	0.30×
DSSim Lite [BT]	39.8 ms	0.74×
DSSim Lite [Bi]	42.2 ms	0.70×
DSSim Lite (direct) [BT]	20.9 ms	1.41×
DSSim Lite (direct) [Bi]	22.7 ms	1.30×
SimPy	29.4 ms	1.0×
salabim	82.1 ms	0.36×

Scenario 2 — Aligned travel delays (12 s EW / 15 s NS)¶

Vehicles carry a timed delay between intersections, which increases the number of in-flight scheduled events at any given time.

Implementation	Median	Performance delta vs. SimPy
DSSim PubSub [BT]	88.0 ms	0.53×
DSSim PubSub [Bi]	91.5 ms	0.51×
DSSim Lite [BT]	45.2 ms	1.03×
DSSim Lite [Bi]	48.9 ms	0.95×
DSSim Lite (direct) [BT]	21.6 ms	2.16×
DSSim Lite (direct) [Bi]	26.9 ms	1.74×
SimPy	46.7 ms	1.0×
salabim	1332.2 ms	0.04×

Key observations:

DSSim Lite (direct) [BT] is the fastest implementation in both scenarios — 1.41× faster than SimPy in straight-through and 2.16× in the delay scenario — because direct ISubscriber bindings eliminate all pubsub translation overhead.
DSSim Lite [BT] matches SimPy in the delay scenario (1.03×) and comes close in straight-through (0.74×), demonstrating that LiteLayer2 with DSLiteCallback is competitive for moderately complex models.
TQBinTree beats TQBisect in both DSSim variants under delays, consistent with the S6 bucketed-burst finding: when many distinct future timestamps are in-flight simultaneously, the heap-based queue has an advantage.
salabim degrades by 16× from scenario 1 to scenario 2 (82 ms → 1332 ms). Each timed delay spawns a heavyweight Component process, which dominates at scale. This is a fundamental design difference, not a tuning issue.
DSSim PubSub pays a consistent penalty (~4–4.5× slower than Lite direct) from tier routing and condition evaluation on every vehicle dispatch.

11.9 Guidelines for Performance-Sensitive Models¶

Use LiteLayer2 when you do not need pubsub routing, condition filtering, or circuit composition.
Prefer DSLiteUnitResource when all resource operations are 1-unit (mutex, semaphore, token-pool) — it eliminates per-request _Waiter allocation and uses a simpler dispatch loop, outperforming SimPy Resource by up to 8× uncontended and 1.6× under contention.
Use DSLiteQueue / DSLiteResource for high-contention components with variable amounts unless you need pubsub monitoring (tx_nempty, tx_changed).
Choose TQBinTree (default) when your model has many concurrent distinct timestamps or high event diversity — it wins decisively on bucketed workloads (S6) with no cost elsewhere.
Choose TQBisect if your model's events are predominantly sequential forward-in-time with few concurrent timestamps — it performs slightly better on timed-callback workloads (S1) with identical behavior elsewhere.
Implement ISubscriber directly instead of wrapping a callable in a DSSub / DSLiteSub object via sim.subscriber(). A class that implements send(event) satisfies the interface with zero translation overhead, eliminating the extra indirection layer that the built-in subscriber wrappers introduce.
Keep subscriber counts small in the CONSUME tier — hit-at-last-position throughput degrades as O(1/S).
Put the most likely consumer first in the subscriber list.
Use NotifierRoundRobin / NotifierPriority only when needed — they carry 20–40% overhead over NotifierDict.
Batch zero-time events where possible — now_queue events bypass the time-queue binary search and are significantly faster than timed events.