[Paper] VLCs: Managing Parallelism with Virtualized Libraries

Published: 2 months ago (December 3, 2025 at 06:11 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2512.04320v1

Overview

Modern applications increasingly stitch together multiple high‑performance libraries (e.g., OpenMP, OpenBLAS, PyTorch) to exploit the massive parallelism of today’s CPUs and GPUs. Unfortunately, most of these libraries assume they own the whole machine, so running them side‑by‑side can cause hidden contention and hurt performance. The paper “VLCs: Managing Parallelism with Virtualized Libraries” introduces Virtual Library Contexts (VLCs), a lightweight runtime mechanism that isolates libraries and their resource allocations without touching the library source code. The authors demonstrate that VLCs can reclaim lost performance and even enable safe parallel execution of thread‑unsafe code.

Key Contributions

Virtual Library Context (VLC) abstraction: a process‑level sub‑unit that bundles a set of libraries together with a dedicated slice of hardware resources (CPU cores, memory bandwidth, GPU streams, etc.).
Zero‑modification isolation: VLCs work with unmodified C++ and Python libraries, avoiding the need to fork or patch upstream code.
Dynamic resource partitioning: developers can explicitly allocate cores, NUMA nodes, or GPU queues to each VLC, preventing cross‑library contention.
Library duplication support: multiple instances of the same library can be loaded in separate VLCs, allowing parallel execution of code that would otherwise be thread‑unsafe.
Prototype implementations: a C++ runtime (using dlopen and pthread affinity) and a Python wrapper that expose a simple API (vlc_create, vlc_run, vlc_destroy).
Empirical validation: up to 2.85× speed‑up on real‑world benchmarks that combine OpenMP, OpenBLAS, and LibTorch, with modest overhead (< 5 %).

Methodology

Design of VLCs – The authors treat each VLC as a mini‑process inside a single OS process. When a VLC is created, the runtime:
- Loads the requested libraries via dynamic linking.
- Sets up a private thread pool and binds it to a user‑specified core set (using sched_setaffinity).
- Optionally creates separate memory arenas to avoid false sharing.
Isolation mechanisms –
- CPU affinity ensures that threads spawned by a library stay within its core slice.
- NUMA policy (via numactl) isolates memory bandwidth.
- GPU stream partitioning (for CUDA‑based libraries) assigns distinct streams/contexts.
Execution model – The host program calls vlc_run(vlc, fn, args…). The runtime switches to the VLC’s context, invokes the user function, and then restores the original context. This switch is lightweight (a few microseconds).
Evaluation setup – The authors built a suite of micro‑benchmarks and larger workloads (matrix multiplication, deep‑learning inference, graph analytics) that deliberately mix libraries known to clash. Experiments were run on a dual‑socket Intel Xeon platform (24 cores total) and an NVIDIA RTX 3090 GPU.
Metrics – They measured wall‑clock time, CPU utilization, and memory bandwidth, comparing three configurations: (a) naïve shared‑library execution, (b) hand‑tuned OS‑level resource pinning, and (c) VLC‑based isolation.

Results & Findings

Benchmark	Naïve (shared)	Hand‑tuned OS pinning	VLCs	Speed‑up vs. naïve
OpenMP + OpenBLAS (DGEMM)	12.4 s	9.8 s	8.7 s	1.43×
LibTorch inference + OpenMP	6.2 s	5.1 s	4.3 s	1.44×
Mixed OpenMP + CUDA (Hybrid)	14.8 s	11.9 s	8.2 s	1.80×
Thread‑unsafe custom library ×2	9.5 s (crash)	N/A	5.3 s (two VLCs)	—
End‑to‑end graph analytics pipeline	22.6 s	18.7 s	7.9 s	2.85×

Key takeaways

Contention reduction: By separating core pools, VLCs eliminated cache‑line bouncing and memory‑bandwidth oversubscription that plagued the naïve runs.
Safety for non‑thread‑safe code: Loading two copies of a legacy library in distinct VLCs allowed them to run concurrently without crashes.
Low overhead: Context switches added < 5 % overhead even for fine‑grained calls, confirming the lightweight nature of the approach.

Practical Implications

Simplified performance tuning: Developers can now declaratively assign resources to each library (e.g., “OpenBLAS gets cores 0‑7, OpenMP gets 8‑15”) without diving into OS‑level cgroups or custom build flags.
Safer library composition: Legacy or research libraries that were never designed for concurrent use can be safely combined, extending the usable ecosystem for data‑science pipelines, scientific simulations, and AI inference services.
Container‑friendly deployment: VLCs operate inside a single process, making them compatible with Docker/Kubernetes containers where spawning multiple processes may be undesirable.
Potential for automated tooling: The VLC API could be wrapped by build‑system plugins (CMake, Bazel) or runtime profilers that automatically infer optimal core partitions based on observed contention.
Cross‑language support: The Python prototype shows that high‑level frameworks (NumPy, PyTorch) can benefit without rewriting native extensions, opening the door for broader adoption in the data‑science community.

Limitations & Future Work

Scalability to many cores/GPU devices: The current prototype was evaluated on up to 24 CPU cores and a single GPU; extending VLCs to multi‑node clusters or heterogeneous accelerator fleets will require distributed coordination.
Dynamic workload adaptation: Resource partitions are static per VLC; future work could integrate runtime feedback loops that resize VLC allocations on the fly.
Interaction with OS schedulers: While VLCs enforce affinity, they still rely on the underlying OS scheduler for fairness; deeper integration (e.g., with cgroups or kernel‑level QoS) could improve isolation guarantees.
Security considerations: Loading multiple copies of the same library may increase the attack surface; sandboxing mechanisms could be explored.
Tooling ecosystem: The authors note the need for higher‑level abstractions (e.g., declarative YAML configs) and IDE plugins to lower the barrier for non‑expert developers.

Bottom line: Virtual Library Contexts provide a pragmatic, low‑overhead way to tame the chaos that arises when modern, high‑performance libraries are composed in a single process. By giving developers fine‑grained control over resource allocation without touching library code, VLCs open up new possibilities for building faster, more reliable parallel applications.

Authors

Yineng Yan
William Ruys
Hochan Lee
Ian Henriksen
Arthur Peters
Sean Stephens
Bozhi You
Henrique Fingler
Martin Burtscher
Milos Gligoric
Keshav Pingali
Mattan Erez
George Biros
Christopher J. Rossbach

Paper Information

arXiv ID: 2512.04320v1
Categories: cs.DC, cs.OS
Published: December 3, 2025
PDF: Download PDF

[Paper] VLCs: Managing Parallelism with Virtualized Libraries

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Metronome: Differentiated Delay Scheduling for Serverless Functions

[Paper] Are Bus-Mounted Edge Servers Feasible?

[Paper] Compiler-supported reduced precision and AoS-SoA transformations for heterogeneous hardware

[Paper] FedGMR: Federated Learning with Gradual Model Restoration under Asynchrony and Model Heterogeneity