[Paper] FLEX: Leveraging FPGA-CPU Synergy for Mixed-Cell-Height Legalization Acceleration

Published: 2 months ago (December 4, 2025 at 02:20 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.04527v1

Overview

The paper introduces FLEX, a hybrid FPGA‑CPU accelerator designed to speed up mixed‑cell‑height legalization—a critical step in physical design automation for modern chips. By intelligently splitting work between an FPGA and a CPU and applying a novel multi‑granularity pipeline, FLEX delivers order‑of‑magnitude performance gains while also improving placement quality.

Key Contributions

Hybrid Task Partitioning: Assigns the most parallelizable parts of legalization to the FPGA while keeping control‑flow‑heavy tasks on the CPU, exploiting each platform’s strengths.
Multi‑Granularity Pipelining: Operates at both coarse (macro‑level) and fine (cell‑level) granularities, dramatically accelerating the “finding optimal placement” (FOP) stage.
Optimized Cell‑Shifting Engine: A custom FPGA design that aligns perfectly with the pipeline, handling the computationally intensive cell‑shifting step with minimal overhead.
Performance Gains: Up to 18.3× speedup over leading CPU‑GPU legalizers and 5.4× over multi‑threaded CPU legalizers, plus a 4 % improvement in legalization quality (and 1 % over the best GPU baseline).
Scalability: Scales well with larger designs, maintaining speedups as problem size grows.

Methodology

Problem Decomposition – The legalization flow is broken into three logical phases:
a. preprocessing & dependency analysis
b. FOP (searching for legal positions)
c. cell shifting (adjusting placements)
Task Assignment –
- CPU handles preprocessing, global routing constraints, and coordination logic.
- FPGA executes the highly parallel FOP search and the cell‑shifting kernel.
Multi‑Granularity Pipeline –
- Coarse‑Grain Stage: Processes groups of cells (e.g., clusters of same height) to quickly prune infeasible regions.
- Fine‑Grain Stage: Refines the placement of individual cells within the surviving candidate windows.
  The pipeline overlaps these stages so that while the FPGA works on fine‑grain data for one batch, the CPU prepares the next batch’s coarse‑grain information.
FPGA Design Optimizations – Custom datapaths and on‑chip memory buffers are tuned to the shifting algorithm’s access pattern, reducing latency and avoiding pipeline stalls.
Integration & Synchronization – A lightweight host‑side driver orchestrates data movement via PCIe, using double‑buffering to hide transfer costs.

Results & Findings

Baseline	Speedup	Quality Δ (Legalization Cost)
CPU‑GPU legalizer (state‑of‑the‑art)	18.3×	+4 % (lower cost)
Multi‑threaded CPU legalizer	5.4×	+1 %
Scalability test (design size ↑)	Speedup remains > 4× up to 2× larger benchmarks	Quality improvement stays within 1–4 %

Key takeaways:

The FPGA handles the bulk of the compute‑heavy search, turning a previously serial bottleneck into a massively parallel operation.
The pipeline eliminates idle periods, achieving near‑continuous utilization of both CPU and FPGA resources.
Legalization quality improves because the fine‑grain stage can explore more candidate positions without the time pressure that limits pure‑CPU approaches.

Practical Implications

Faster Tape‑out Cycles: Design teams can shrink the physical design verification window, enabling more iterative optimization within a given project timeline.
Cost‑Effective Acceleration: Compared to GPU clusters, an FPGA‑CPU board (e.g., Xilinx Alveo or Intel Agilex) offers comparable or better performance per watt, making it attractive for fab‑less startups and midsize companies.
Integration into Existing EDA Flows: FLEX’s host‑side API mirrors typical CPU‑only legalizer calls, so tool vendors can drop‑in the accelerator with minimal code changes.
Potential for Cloud‑Based Services: The modular task partitioning maps well to heterogeneous cloud instances (CPU + FPGA), opening the door for on‑demand legalization-as-a‑service.
Extensibility to Other Placement Tasks: The multi‑granularity pipeline concept can be reused for timing‑driven placement, congestion analysis, or even post‑silicon floorplanning, where similar search‑and‑refine patterns appear.

Limitations & Future Work

FPGA Resource Constraints: Very large designs may exceed on‑chip memory, requiring additional off‑chip buffering that could erode some speedup.
Portability: The current implementation targets a specific FPGA family; retargeting to other vendors may need non‑trivial redesign of the custom kernels.
Dynamic Workloads: The static partitioning assumes a relatively stable workload; adaptive scheduling for highly irregular designs is left as an open problem.
Future Directions: The authors plan to explore hierarchical partitioning across multiple FPGAs, integrate machine‑learning‑guided candidate pruning, and extend the pipeline to support mixed‑technology (FinFET + emerging) node legalization.

Authors

Xingyu Liu
Jiawei Liang
Linfeng Du
Yipu Zhang
Chaofang Ma
Hanwei Fan
Jiang Xu
Wei Zhang

Paper Information

arXiv ID: 2512.04527v1
Categories: cs.AR, cs.DC
Published: December 4, 2025
PDF: Download PDF

[Paper] FLEX: Leveraging FPGA-CPU Synergy for Mixed-Cell-Height Legalization Acceleration

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Metronome: Differentiated Delay Scheduling for Serverless Functions

[Paper] Are Bus-Mounted Edge Servers Feasible?

[Paper] Compiler-supported reduced precision and AoS-SoA transformations for heterogeneous hardware

[Paper] FedGMR: Federated Learning with Gradual Model Restoration under Asynchrony and Model Heterogeneity