[Paper] Enhancing Building Semantics Preservation in AI Model Training with Large Language Model Encodings

Published: 2 months ago (February 17, 2026 at 01:26 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.15791v1

Overview

The paper proposes swapping traditional one‑hot encodings for large language model (LLM) embeddings when training AI models that need to understand building semantics (e.g., BIM object sub‑types). By feeding richer, context‑aware vectors from models like GPT‑4 and LLaMA into a GraphSAGE classifier, the authors show measurable gains in subtype classification across real‑world high‑rise residential BIM datasets.

Key Contributions

LLM‑based encoding pipeline: Introduces a straightforward way to replace one‑hot vectors with high‑dimensional embeddings generated by off‑the‑shelf LLMs.
Dimensionality‑reduction via Matryoshka: Demonstrates that compacted 1,024‑dim embeddings retain most of the semantic signal while being more practical for downstream models.
Empirical validation on BIM data: Trains GraphSAGE on 42 object sub‑types from five high‑rise residential BIMs, achieving a weighted‑average F1‑score of 0.8766 (LLM) vs. 0.8475 (one‑hot).
Benchmark across multiple LLMs: Evaluates raw embeddings from GPT‑4‑style (1,536‑dim), LLaMA‑2 (3,072‑dim), LLaMA‑3 (4,096‑dim) and their compacted versions, providing a clear performance‑vs‑size trade‑off.
Open‑source reproducibility: Supplies code and data processing scripts, enabling developers to plug the encoding step into existing graph‑based pipelines.

Methodology

Data preparation – Extracted object names (e.g., “exterior wall – insulated concrete”) from BIM files and mapped each to one of 42 predefined sub‑types.
Embedding generation – Sent each object label through an LLM API (OpenAI GPT, Meta LLaMA) to obtain a dense vector. For the compacted version, the Matryoshka representation model compressed the raw vectors to 1,024 dimensions.
Graph construction – Built a heterogeneous graph where nodes represent BIM objects and edges capture spatial or functional relationships (e.g., “adjacent to”, “supports”).
Model training – Trained a GraphSAGE network to predict the object sub‑type using either one‑hot or LLM‑derived node features. Hyperparameters (learning rate, number of layers, neighbor sampling) were kept identical across experiments.
Evaluation – Measured weighted‑average F1‑score across a stratified test split, comparing each embedding strategy against the one‑hot baseline.

Results & Findings

Encoding	Dimensionality	Weighted‑Avg F1
One‑hot	42	0.8475
GPT (raw)	1,536	0.8621
LLaMA‑2 (raw)	3,072	0.8684
LLaMA‑3 (raw)	4,096	0.8709
LLaMA‑3 (Matryoshka)	1,024	0.8766

Semantic richness matters – Even the smallest raw LLM embedding outperformed one‑hot, confirming that contextual information captured by LLMs helps differentiate closely related sub‑types.
Compact representations win – The Matryoshka‑compressed LLaMA‑3 vector achieved the highest F1 despite being far smaller than the raw 4,096‑dim vector, highlighting effective dimensionality reduction.
Robustness across models – All LLM variants consistently beat the baseline, suggesting the approach is model‑agnostic.

Practical Implications

Plug‑and‑play feature engineering – Developers can replace a simple categorical column with an API call to an LLM, gaining richer features without redesigning the whole pipeline.
Improved BIM analytics – More accurate subtype classification enables downstream tasks such as automated clash detection, cost estimation, and facility management to be more precise.
Scalable graph‑AI – Compact embeddings keep memory footprints low, making it feasible to run GraphSAGE (or similar GNNs) on large‑scale construction projects or cloud‑based BIM services.
Cross‑domain transfer – The same encoding strategy could be applied to other domain‑specific taxonomies (e.g., mechanical parts, electrical components) where one‑hot falls short.
Rapid prototyping – Since the embeddings are generated on‑the‑fly via public LLM APIs, teams can experiment with new semantic vocabularies without waiting for custom label encoders.

Limitations & Future Work

Dependency on LLM APIs – Real‑time embedding generation incurs latency and cost; offline caching or open‑source LLMs may be needed for production.
Static label focus – The study only encoded object names; richer BIM metadata (geometry, material properties) were not incorporated.
Generalization to other building types – Experiments were limited to five high‑rise residential models; performance on commercial, infrastructure, or historic BIMs remains untested.
Dimensionality‑reduction trade‑offs – While Matryoshka worked well here, other reduction techniques (e.g., PCA, autoencoders) could be benchmarked for speed vs. accuracy.
Explainability – Dense embeddings obscure the exact semantic cues the model uses; future work could explore attention‑based visualizations to aid interpretability for engineers.

Authors

Suhyung Jang
Ghang Lee
Jaekun Lee
Hyunjun Lee

Paper Information

arXiv ID: 2602.15791v1
Categories: cs.AI, cs.CL
Published: February 17, 2026
PDF: Download PDF

[Paper] Enhancing Building Semantics Preservation in AI Model Training with Large Language Model Encodings

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Benchmarking Graph Neural Networks in Solving Hard Constraint Satisfaction Problems

[Paper] Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

[Paper] Causality is Key for Interpretability Claims to Generalise

[Paper] Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment