[Paper] Enhancing Building Semantics Preservation in AI Model Training with Large Language Model Encodings
Source: arXiv - 2602.15791v1
Overview
The paper proposes swapping traditional one‑hot encodings for large language model (LLM) embeddings when training AI models that need to understand building semantics (e.g., BIM object sub‑types). By feeding richer, context‑aware vectors from models like GPT‑4 and LLaMA into a GraphSAGE classifier, the authors show measurable gains in subtype classification across real‑world high‑rise residential BIM datasets.
Key Contributions
- LLM‑based encoding pipeline: Introduces a straightforward way to replace one‑hot vectors with high‑dimensional embeddings generated by off‑the‑shelf LLMs.
- Dimensionality‑reduction via Matryoshka: Demonstrates that compacted 1,024‑dim embeddings retain most of the semantic signal while being more practical for downstream models.
- Empirical validation on BIM data: Trains GraphSAGE on 42 object sub‑types from five high‑rise residential BIMs, achieving a weighted‑average F1‑score of 0.8766 (LLM) vs. 0.8475 (one‑hot).
- Benchmark across multiple LLMs: Evaluates raw embeddings from GPT‑4‑style (1,536‑dim), LLaMA‑2 (3,072‑dim), LLaMA‑3 (4,096‑dim) and their compacted versions, providing a clear performance‑vs‑size trade‑off.
- Open‑source reproducibility: Supplies code and data processing scripts, enabling developers to plug the encoding step into existing graph‑based pipelines.
Methodology
- Data preparation – Extracted object names (e.g., “exterior wall – insulated concrete”) from BIM files and mapped each to one of 42 predefined sub‑types.
- Embedding generation – Sent each object label through an LLM API (OpenAI GPT, Meta LLaMA) to obtain a dense vector. For the compacted version, the Matryoshka representation model compressed the raw vectors to 1,024 dimensions.
- Graph construction – Built a heterogeneous graph where nodes represent BIM objects and edges capture spatial or functional relationships (e.g., “adjacent to”, “supports”).
- Model training – Trained a GraphSAGE network to predict the object sub‑type using either one‑hot or LLM‑derived node features. Hyperparameters (learning rate, number of layers, neighbor sampling) were kept identical across experiments.
- Evaluation – Measured weighted‑average F1‑score across a stratified test split, comparing each embedding strategy against the one‑hot baseline.
Results & Findings
| Encoding | Dimensionality | Weighted‑Avg F1 |
|---|---|---|
| One‑hot | 42 | 0.8475 |
| GPT (raw) | 1,536 | 0.8621 |
| LLaMA‑2 (raw) | 3,072 | 0.8684 |
| LLaMA‑3 (raw) | 4,096 | 0.8709 |
| LLaMA‑3 (Matryoshka) | 1,024 | 0.8766 |
- Semantic richness matters – Even the smallest raw LLM embedding outperformed one‑hot, confirming that contextual information captured by LLMs helps differentiate closely related sub‑types.
- Compact representations win – The Matryoshka‑compressed LLaMA‑3 vector achieved the highest F1 despite being far smaller than the raw 4,096‑dim vector, highlighting effective dimensionality reduction.
- Robustness across models – All LLM variants consistently beat the baseline, suggesting the approach is model‑agnostic.
Practical Implications
- Plug‑and‑play feature engineering – Developers can replace a simple categorical column with an API call to an LLM, gaining richer features without redesigning the whole pipeline.
- Improved BIM analytics – More accurate subtype classification enables downstream tasks such as automated clash detection, cost estimation, and facility management to be more precise.
- Scalable graph‑AI – Compact embeddings keep memory footprints low, making it feasible to run GraphSAGE (or similar GNNs) on large‑scale construction projects or cloud‑based BIM services.
- Cross‑domain transfer – The same encoding strategy could be applied to other domain‑specific taxonomies (e.g., mechanical parts, electrical components) where one‑hot falls short.
- Rapid prototyping – Since the embeddings are generated on‑the‑fly via public LLM APIs, teams can experiment with new semantic vocabularies without waiting for custom label encoders.
Limitations & Future Work
- Dependency on LLM APIs – Real‑time embedding generation incurs latency and cost; offline caching or open‑source LLMs may be needed for production.
- Static label focus – The study only encoded object names; richer BIM metadata (geometry, material properties) were not incorporated.
- Generalization to other building types – Experiments were limited to five high‑rise residential models; performance on commercial, infrastructure, or historic BIMs remains untested.
- Dimensionality‑reduction trade‑offs – While Matryoshka worked well here, other reduction techniques (e.g., PCA, autoencoders) could be benchmarked for speed vs. accuracy.
- Explainability – Dense embeddings obscure the exact semantic cues the model uses; future work could explore attention‑based visualizations to aid interpretability for engineers.
Authors
- Suhyung Jang
- Ghang Lee
- Jaekun Lee
- Hyunjun Lee
Paper Information
- arXiv ID: 2602.15791v1
- Categories: cs.AI, cs.CL
- Published: February 17, 2026
- PDF: Download PDF