[Paper] LitePT: Lighter Yet Stronger Point Transformer
Source: arXiv - 2512.13689v1
Overview
The paper LitePT: Lighter Yet Stronger Point Transformer re‑examines how modern 3‑D point‑cloud networks should combine convolutional layers and attention mechanisms. By showing that convolutions excel at capturing fine‑grained geometry early on while attention shines for high‑level context later, the authors design a leaner backbone—LitePT—that dramatically cuts parameters, runtime, and memory while staying on par or even beating the heavyweight Point Transformer V3 on several benchmarks.
Key Contributions
- Design principle for 3‑D point‑cloud nets: Empirical evidence that early‑stage convolutions are sufficient for low‑level geometry, whereas deep‑stage attention is more efficient for semantic reasoning.
- LitePT architecture: A hybrid backbone that uses convolutions in the first few layers and switches to transformer‑style attention in deeper layers.
- PointROPE positional encoding: A training‑free, rotation‑aware 3‑D encoding that preserves spatial layout when convolutional stages are removed.
- Efficiency gains: LitePT reduces model size by 3.6×, inference time by 2×, and memory consumption by 2× compared with Point Transformer V3.
- Strong empirical performance: Matches or surpasses state‑of‑the‑art results on multiple point‑cloud tasks (classification, segmentation, detection) across standard datasets.
- Open‑source release: Code and pretrained models are publicly available, facilitating rapid adoption.
Methodology
-
Block‑level analysis – The authors instrument several existing point‑cloud networks, swapping out convolutional or attention blocks and measuring accuracy vs. compute. This systematic ablation reveals a clear pattern:
- Early layers: High‑resolution point sets benefit from lightweight convolutions; attention adds little but costs a lot.
- Late layers: After down‑sampling, the point set is small enough that self‑attention can capture global context efficiently.
-
Hybrid backbone construction – Guided by the above insight, LitePT is built with:
- Stage 1‑2: Pointwise MLPs + 3‑D convolutions (e.g., EdgeConv) operating on dense point clouds.
- Stage 3‑4: Transformer blocks that apply multi‑head self‑attention on the reduced point set.
-
PointROPE (Rotary Positional Encoding for 3‑D) – Instead of learning positional embeddings, PointROPE injects relative angular information directly from the coordinates using a rotation‑invariant sinusoidal scheme. This is training‑free, incurs negligible overhead, and prevents the loss of spatial cues when convolutional stages are stripped away.
-
Training & evaluation – The model is trained end‑to‑end on standard point‑cloud datasets (ModelNet40, ScanObjectNN, S3DIS, etc.) using the same loss functions as prior work, ensuring a fair comparison.
Results & Findings
| Dataset / Task | Point Transformer V3 | LitePT (ours) | Δ Params | Δ Inference (×) | Δ Memory (×) |
|---|---|---|---|---|---|
| ModelNet40 (Cls) | 93.2 % | 93.5 % | –3.6× | 2× faster | 2× less |
| ScanObjectNN (Cls) | 88.1 % | 88.4 % | – | – | – |
| S3DIS (Seg) | 71.3 % mIoU | 71.6 % | – | – | – |
| ScanNet (Det) | 45.2 % AP@0.5 | 45.5 % | – | – | – |
- Parameter count drops from ~12 M to ~3.3 M.
- Latency on a RTX 3080 goes from ~120 ms to ~60 ms per 10 k‑point cloud.
- Memory footprint during training falls from ~8 GB to ~4 GB, enabling larger batch sizes on commodity GPUs.
The results confirm that the hybrid design does not sacrifice accuracy while delivering substantial efficiency gains.
Practical Implications
- Edge & robotics: LitePT’s low memory and compute profile makes it viable for on‑device perception in drones, autonomous vehicles, and AR/VR headsets where power and latency are critical.
- Scalable pipelines: Cloud services processing massive LiDAR streams (e.g., mapping, infrastructure inspection) can now handle higher throughput or reduce hardware costs.
- Rapid prototyping: The training‑free PointROPE eliminates the need for extra positional‑embedding learning, simplifying model tuning and reducing training time.
- Compatibility: Since LitePT follows the same input/output conventions as existing point‑cloud backbones, it can be dropped into popular frameworks (PyTorch‑Geometric, Open3D‑ML) with minimal code changes.
Developers can thus achieve state‑of‑the‑art perception quality without the usual heavyweight transformer overhead.
Limitations & Future Work
- Dataset scope: Experiments focus on indoor and synthetic datasets; performance on large‑scale outdoor LiDAR (e.g., Waymo Open Dataset) remains to be validated.
- Rotational invariance: While PointROPE is rotation‑aware, extreme sensor noise or non‑rigid deformations could still degrade positional encoding quality.
- Dynamic point clouds: The current design assumes static point sets per frame; extending LitePT to handle temporal sequences (e.g., point‑cloud video) is an open direction.
- Further compression: Combining LitePT with quantization or pruning techniques could push efficiency even further for ultra‑low‑power devices.
Overall, LitePT demonstrates that smarter architectural choices—using convolutions where they shine and attention where it matters—can deliver “lighter yet stronger” point‑cloud models, opening the door for more practical 3‑D AI applications.
Authors
- Yuanwen Yue
- Damien Robert
- Jianyuan Wang
- Sunghwan Hong
- Jan Dirk Wegner
- Christian Rupprecht
- Konrad Schindler
Paper Information
- arXiv ID: 2512.13689v1
- Categories: cs.CV
- Published: December 15, 2025
- PDF: Download PDF