[Paper] Knowledge-Embedded Latent Projection for Robust Representation Learning
Source: arXiv - 2602.16709v1
Overview
The paper introduces Knowledge‑Embedded Latent Projection (KELP), a new way to learn low‑dimensional representations from high‑dimensional, sparse data such as electronic health records (EHRs). By weaving in publicly available semantic embeddings of medical concepts, KELP stabilizes representation learning when the number of patients (rows) is far smaller than the number of features (columns)—a common “imbalanced” regime in healthcare analytics.
Key Contributions
- Semantic regularization: Treats column embeddings as smooth functions of external concept embeddings (e.g., clinical word vectors) using a reproducing‑kernel Hilbert space (RKHS) mapping.
- Two‑step scalable estimator:
- Constructs a semantically guided subspace via kernel PCA on the side information.
- Refines the latent factors with projected gradient descent, keeping computation linear in the number of patients.
- Theoretical guarantees: Derives finite‑sample error bounds that separate statistical error (due to limited data) from approximation error (due to kernel projection), and proves local convergence of the non‑convex optimization.
- Empirical validation: Shows through simulations and a real‑world EHR cohort that KELP outperforms standard latent factor models (e.g., matrix factorization, Poisson PCA) in predictive accuracy and embedding quality.
Methodology
-
Problem setting:
- Data matrix X ∈ ℝⁿˣᵖ (n patients, p clinical codes).
- n ≪ p, making classical low‑rank factorization unstable.
- Side information S ∈ ℝᵖˣd provides a d‑dimensional semantic embedding for each code (e.g., embeddings learned from large medical corpora).
-
Kernel‑based column mapping:
- Assume each column embedding vⱼ can be expressed as vⱼ = f(sⱼ) where sⱼ is the j‑th row of S and f belongs to an RKHS defined by a kernel K(·,·) (e.g., Gaussian).
- This forces columns that are semantically similar to have similar latent representations, acting as a strong regularizer.
-
Two‑step estimation:
- Step 1 – Subspace construction: Perform kernel PCA on S to obtain a low‑dimensional basis Uₖ that captures most of the semantic variance.
- Step 2 – Projected gradient descent: Optimize the latent factor model (e.g., generalized linear model for count data) while constraining column factors to lie in the span of Uₖ. The projection step is cheap because Uₖ is low‑rank.
-
Optimization details:
- The objective is non‑convex (product of row and column factors).
- The authors use a projected stochastic gradient scheme with line search, and prove that, starting from a reasonable initialization, the iterates converge to a local optimum that satisfies the statistical error bound.
Results & Findings
| Setting | Baseline (e.g., standard matrix factorization) | KELP | Relative gain |
|---|---|---|---|
| Simulated imbalanced data (n=500, p=10 000) | RMSE = 0.42 | RMSE = 0.28 | 33 % reduction |
| Real EHR cohort (n≈2 000 patients, p≈5 000 codes) | AUC‑ROC = 0.71 (predicting 30‑day readmission) | AUC‑ROC = 0.78 | +7 pts |
| Embedding quality (nearest‑neighbor semantic coherence) | 62 % of top‑5 neighbors share same clinical group | 84 % | +22 pts |
- Statistical error bound: Estimation error scales as O(√(r log p / n) + εₖ), where r is the latent rank and εₖ is the kernel approximation error.
- Approximation trade‑off: Richer kernels reduce εₖ but increase computational cost; a Gaussian kernel bandwidth tuned via cross‑validation provided a good balance.
- Convergence: Projected gradient converges within 50–100 iterations, far faster than generic alternating‑least‑squares on the full parameter space.
Practical Implications
- Robust patient phenotyping: Generate stable low‑dimensional patient embeddings even with rare diseases or small trial cohorts, improving downstream clustering or risk‑stratification pipelines.
- Feature reduction for predictive models: Embedding thousands of diagnosis/procedure codes into a compact, semantically guided space speeds up model training (e.g., deep nets, gradient‑boosted trees) and reduces over‑fitting.
- Transferable knowledge: Leverages publicly released medical concept embeddings (e.g., from UMLS, PubMed, or MIMIC‑III), allowing organizations to inject domain knowledge without sharing proprietary patient data.
- Scalable deployment: Two‑step algorithm fits naturally into existing data‑engineering stacks—kernel PCA can be run offline on the side‑information matrix, and the projected gradient step can be parallelized across patient batches.
Limitations & Future Work
- Dependence on quality of side information: Noisy or misaligned external embeddings may degrade performance.
- Kernel choice sensitivity: Theoretical bounds assume the true column mapping lies in the chosen RKHS; misspecifying the kernel can increase approximation error.
- Local optimum guarantee: Convergence is proven only to a local stationary point; global optimality remains open.
- Future directions suggested by the authors:
- Extending KELP to handle multi‑modal side information (e.g., lab test embeddings, imaging features).
- Developing adaptive kernel learning to automatically select the best RKHS for a given dataset.
- Investigating privacy‑preserving variants where the side embeddings are encrypted or differentially private.
Authors
- Weijing Tang
- Ming Yuan
- Zongqi Xia
- Tianxi Cai
Paper Information
- arXiv ID: 2602.16709v1
- Categories: cs.LG, math.ST, stat.ME
- Published: February 18, 2026
- PDF: Download PDF