[Paper] FLEX: Leveraging FPGA-CPU Synergy for Mixed-Cell-Height Legalization Acceleration

Published: (December 4, 2025 at 02:20 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.04527v1

Overview

The paper introduces FLEX, a hybrid FPGA‑CPU accelerator designed to speed up mixed‑cell‑height legalization—a critical step in physical design automation for modern chips. By intelligently splitting work between an FPGA and a CPU and applying a novel multi‑granularity pipeline, FLEX delivers order‑of‑magnitude performance gains while also improving placement quality.

Key Contributions

  • Hybrid Task Partitioning: Assigns the most parallelizable parts of legalization to the FPGA while keeping control‑flow‑heavy tasks on the CPU, exploiting each platform’s strengths.
  • Multi‑Granularity Pipelining: Operates at both coarse (macro‑level) and fine (cell‑level) granularities, dramatically accelerating the “finding optimal placement” (FOP) stage.
  • Optimized Cell‑Shifting Engine: A custom FPGA design that aligns perfectly with the pipeline, handling the computationally intensive cell‑shifting step with minimal overhead.
  • Performance Gains: Up to 18.3× speedup over leading CPU‑GPU legalizers and 5.4× over multi‑threaded CPU legalizers, plus a 4 % improvement in legalization quality (and 1 % over the best GPU baseline).
  • Scalability: Scales well with larger designs, maintaining speedups as problem size grows.

Methodology

  1. Problem Decomposition – The legalization flow is broken into three logical phases:
    a. preprocessing & dependency analysis
    b. FOP (searching for legal positions)
    c. cell shifting (adjusting placements)
  2. Task Assignment
    • CPU handles preprocessing, global routing constraints, and coordination logic.
    • FPGA executes the highly parallel FOP search and the cell‑shifting kernel.
  3. Multi‑Granularity Pipeline
    • Coarse‑Grain Stage: Processes groups of cells (e.g., clusters of same height) to quickly prune infeasible regions.
    • Fine‑Grain Stage: Refines the placement of individual cells within the surviving candidate windows.
      The pipeline overlaps these stages so that while the FPGA works on fine‑grain data for one batch, the CPU prepares the next batch’s coarse‑grain information.
  4. FPGA Design Optimizations – Custom datapaths and on‑chip memory buffers are tuned to the shifting algorithm’s access pattern, reducing latency and avoiding pipeline stalls.
  5. Integration & Synchronization – A lightweight host‑side driver orchestrates data movement via PCIe, using double‑buffering to hide transfer costs.

Results & Findings

BaselineSpeedupQuality Δ (Legalization Cost)
CPU‑GPU legalizer (state‑of‑the‑art)18.3×+4 % (lower cost)
Multi‑threaded CPU legalizer5.4×+1 %
Scalability test (design size ↑)Speedup remains > 4× up to 2× larger benchmarksQuality improvement stays within 1–4 %

Key takeaways:

  • The FPGA handles the bulk of the compute‑heavy search, turning a previously serial bottleneck into a massively parallel operation.
  • The pipeline eliminates idle periods, achieving near‑continuous utilization of both CPU and FPGA resources.
  • Legalization quality improves because the fine‑grain stage can explore more candidate positions without the time pressure that limits pure‑CPU approaches.

Practical Implications

  • Faster Tape‑out Cycles: Design teams can shrink the physical design verification window, enabling more iterative optimization within a given project timeline.
  • Cost‑Effective Acceleration: Compared to GPU clusters, an FPGA‑CPU board (e.g., Xilinx Alveo or Intel Agilex) offers comparable or better performance per watt, making it attractive for fab‑less startups and midsize companies.
  • Integration into Existing EDA Flows: FLEX’s host‑side API mirrors typical CPU‑only legalizer calls, so tool vendors can drop‑in the accelerator with minimal code changes.
  • Potential for Cloud‑Based Services: The modular task partitioning maps well to heterogeneous cloud instances (CPU + FPGA), opening the door for on‑demand legalization-as-a‑service.
  • Extensibility to Other Placement Tasks: The multi‑granularity pipeline concept can be reused for timing‑driven placement, congestion analysis, or even post‑silicon floorplanning, where similar search‑and‑refine patterns appear.

Limitations & Future Work

  • FPGA Resource Constraints: Very large designs may exceed on‑chip memory, requiring additional off‑chip buffering that could erode some speedup.
  • Portability: The current implementation targets a specific FPGA family; retargeting to other vendors may need non‑trivial redesign of the custom kernels.
  • Dynamic Workloads: The static partitioning assumes a relatively stable workload; adaptive scheduling for highly irregular designs is left as an open problem.
  • Future Directions: The authors plan to explore hierarchical partitioning across multiple FPGAs, integrate machine‑learning‑guided candidate pruning, and extend the pipeline to support mixed‑technology (FinFET + emerging) node legalization.

Authors

  • Xingyu Liu
  • Jiawei Liang
  • Linfeng Du
  • Yipu Zhang
  • Chaofang Ma
  • Hanwei Fan
  • Jiang Xu
  • Wei Zhang

Paper Information

  • arXiv ID: 2512.04527v1
  • Categories: cs.AR, cs.DC
  • Published: December 4, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »