[Paper] Knowledge-Embedded Latent Projection for Robust Representation Learning

Published: 3 days ago (February 18, 2026 at 01:58 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.16709v1

Overview

The paper introduces Knowledge‑Embedded Latent Projection (KELP), a new way to learn low‑dimensional representations from high‑dimensional, sparse data such as electronic health records (EHRs). By weaving in publicly available semantic embeddings of medical concepts, KELP stabilizes representation learning when the number of patients (rows) is far smaller than the number of features (columns)—a common “imbalanced” regime in healthcare analytics.

Key Contributions

Semantic regularization: Treats column embeddings as smooth functions of external concept embeddings (e.g., clinical word vectors) using a reproducing‑kernel Hilbert space (RKHS) mapping.
Two‑step scalable estimator:
1. Constructs a semantically guided subspace via kernel PCA on the side information.
2. Refines the latent factors with projected gradient descent, keeping computation linear in the number of patients.
Theoretical guarantees: Derives finite‑sample error bounds that separate statistical error (due to limited data) from approximation error (due to kernel projection), and proves local convergence of the non‑convex optimization.
Empirical validation: Shows through simulations and a real‑world EHR cohort that KELP outperforms standard latent factor models (e.g., matrix factorization, Poisson PCA) in predictive accuracy and embedding quality.

Methodology

Problem setting:
- Data matrix X ∈ ℝⁿˣᵖ (n patients, p clinical codes).
- n ≪ p, making classical low‑rank factorization unstable.
- Side information S ∈ ℝᵖˣd provides a d‑dimensional semantic embedding for each code (e.g., embeddings learned from large medical corpora).
Kernel‑based column mapping:
- Assume each column embedding vⱼ can be expressed as vⱼ = f(sⱼ) where sⱼ is the j‑th row of S and f belongs to an RKHS defined by a kernel K(·,·) (e.g., Gaussian).
- This forces columns that are semantically similar to have similar latent representations, acting as a strong regularizer.
Two‑step estimation:
- Step 1 – Subspace construction: Perform kernel PCA on S to obtain a low‑dimensional basis Uₖ that captures most of the semantic variance.
- Step 2 – Projected gradient descent: Optimize the latent factor model (e.g., generalized linear model for count data) while constraining column factors to lie in the span of Uₖ. The projection step is cheap because Uₖ is low‑rank.
Optimization details:
- The objective is non‑convex (product of row and column factors).
- The authors use a projected stochastic gradient scheme with line search, and prove that, starting from a reasonable initialization, the iterates converge to a local optimum that satisfies the statistical error bound.

Results & Findings

Setting	Baseline (e.g., standard matrix factorization)	KELP	Relative gain
Simulated imbalanced data (n=500, p=10 000)	RMSE = 0.42	RMSE = 0.28	33 % reduction
Real EHR cohort (n≈2 000 patients, p≈5 000 codes)	AUC‑ROC = 0.71 (predicting 30‑day readmission)	AUC‑ROC = 0.78	+7 pts
Embedding quality (nearest‑neighbor semantic coherence)	62 % of top‑5 neighbors share same clinical group	84 %	+22 pts

Statistical error bound: Estimation error scales as O(√(r log p / n) + εₖ), where r is the latent rank and εₖ is the kernel approximation error.
Approximation trade‑off: Richer kernels reduce εₖ but increase computational cost; a Gaussian kernel bandwidth tuned via cross‑validation provided a good balance.
Convergence: Projected gradient converges within 50–100 iterations, far faster than generic alternating‑least‑squares on the full parameter space.

Practical Implications

Robust patient phenotyping: Generate stable low‑dimensional patient embeddings even with rare diseases or small trial cohorts, improving downstream clustering or risk‑stratification pipelines.
Feature reduction for predictive models: Embedding thousands of diagnosis/procedure codes into a compact, semantically guided space speeds up model training (e.g., deep nets, gradient‑boosted trees) and reduces over‑fitting.
Transferable knowledge: Leverages publicly released medical concept embeddings (e.g., from UMLS, PubMed, or MIMIC‑III), allowing organizations to inject domain knowledge without sharing proprietary patient data.
Scalable deployment: Two‑step algorithm fits naturally into existing data‑engineering stacks—kernel PCA can be run offline on the side‑information matrix, and the projected gradient step can be parallelized across patient batches.

Limitations & Future Work

Dependence on quality of side information: Noisy or misaligned external embeddings may degrade performance.
Kernel choice sensitivity: Theoretical bounds assume the true column mapping lies in the chosen RKHS; misspecifying the kernel can increase approximation error.
Local optimum guarantee: Convergence is proven only to a local stationary point; global optimality remains open.
Future directions suggested by the authors:
- Extending KELP to handle multi‑modal side information (e.g., lab test embeddings, imaging features).
- Developing adaptive kernel learning to automatically select the best RKHS for a given dataset.
- Investigating privacy‑preserving variants where the side embeddings are encrypted or differentially private.

Authors

Weijing Tang
Ming Yuan
Zongqi Xia
Tianxi Cai

Paper Information

arXiv ID: 2602.16709v1
Categories: cs.LG, math.ST, stat.ME
Published: February 18, 2026
PDF: Download PDF

[Paper] Knowledge-Embedded Latent Projection for Robust Representation Learning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Sink-Aware Pruning for Diffusion Language Models

[Paper] CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts

[Paper] MARS: Margin-Aware Reward-Modeling with Self-Refinement

[Paper] Mine and Refine: Optimizing Graded Relevance in E-commerce Search Retrieval