[Paper] Rethinking Genomic Modeling Through Optical Character Recognition

Published: 3 days ago (February 2, 2026 at 07:12 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.02014v1

Overview

The paper introduces OpticalDNA, a novel way to model genomic data by treating DNA sequences like images of text rather than long strings of characters. By rendering DNA into visual layouts and applying OCR‑style vision‑language models, the authors achieve far higher efficiency and accuracy on large‑scale genomic tasks, cutting the effective token count by up to 20× while still beating heavyweight language‑model baselines.

Key Contributions

Vision‑first genomic representation – DNA is visualized as structured “documents” and encoded with a dedicated visual DNA encoder, moving away from the traditional 1‑D token stream.
Compact, reconstructible visual tokens – The encoder learns a highly compressed token set that can be decoded back to the original sequence with negligible loss, enabling aggressive token‑budget reductions.
Prompt‑conditioned multimodal objectives – Four core tasks (reading, region grounding, subsequence retrieval, masked span completion) are framed as OCR‑style prompts, encouraging the model to understand both content and layout.
Parameter‑efficient fine‑tuning – Only 256 k trainable parameters are needed to adapt the large backbone, making the approach practical for labs with limited compute.
State‑of‑the‑art performance on long genomes – On benchmarks with sequences up to 450 k bases, OpticalDNA outperforms prior models while using ~20× fewer effective tokens and up to 985× fewer activated parameters.

Methodology

Rendering DNA as an image – Raw nucleotide strings are laid out on a canvas using a fixed‑width font, optionally adding visual cues (e.g., color‑coding for gene annotations, line breaks for regulatory regions). This yields a high‑resolution image that preserves the natural “document” structure of genomes.
Visual DNA Encoder – A vision transformer (ViT) processes the image, extracting patch embeddings that become the visual tokens. A lightweight reconstruction head ensures these tokens can be turned back into the original sequence, guaranteeing lossless compression.
Document Decoder (Vision‑Language Model) – A transformer decoder, pretrained on OCR and document‑understanding tasks, consumes the visual tokens together with textual prompts (e.g., “Find the promoter region of gene X”). The decoder outputs either text (nucleotide subsequences) or positional information (grounded regions).
Prompt‑conditioned training objectives
- Reading – Predict the full nucleotide string from the visual tokens (standard reconstruction).
- Region Grounding – Given a gene name, output bounding boxes that locate the gene in the image.
- Subsequence Retrieval – Retrieve a specific subsequence based on a textual query.
- Masked Span Completion – Mask random spans in the visual layout and ask the model to fill them, encouraging contextual reasoning.
Fine‑tuning strategy – The backbone weights stay frozen; only a small adapter layer (≈256 k parameters) is trained on each downstream genomic task, dramatically reducing compute and memory footprints.

Results & Findings

Benchmark	Sequence Length	Effective Tokens	Relative Performance vs. LLM‑style baselines
Gene‑annotation classification	≤ 100 k bp	5 k tokens	+7.2 % F1
Long‑range enhancer‑promoter prediction	250 k bp	12 k tokens	+9.5 % AUROC
Whole‑genome variant calling (simulated)	450 k bp	22 k tokens	+5.8 % accuracy
Subsequence retrieval (prompt‑based)	300 k bp	15 k tokens	+12.3 % exact‑match

Token efficiency: OpticalDNA uses ~20× fewer tokens than a comparable 1‑D transformer while preserving (or improving) downstream accuracy.
Parameter efficiency: The model matches or exceeds baselines that have up to 985× more activated parameters during inference.
Scalability: Performance gains grow with sequence length, confirming that the visual layout mitigates the “background noise” problem of long, low‑information genomic regions.

Practical Implications

Faster inference for large genomes – Bioinformatics pipelines (e.g., variant calling, gene annotation) can process whole chromosomes in a fraction of the time and memory required by current language‑model approaches.
Edge‑device deployment – The compact token representation and tiny adapter make it feasible to run genomic analyses on modest GPUs or even specialized ASICs in clinical labs.
Prompt‑driven genomics – Researchers can ask natural‑language questions (“Show me the CpG islands near gene TP53”) and receive precise, grounded answers without writing custom scripts.
Cross‑modal integration – Because the backbone is OCR‑ready, future extensions could ingest mixed data (e.g., gel images, microscopy slides) alongside DNA, enabling richer multi‑omics diagnostics.
Cost‑effective model updates – Adding new annotations or organism‑specific knowledge only requires fine‑tuning the small adapter, avoiding expensive full‑model retraining.

Limitations & Future Work

Visualization overhead – Converting DNA to images adds a preprocessing step and may be less straightforward for streaming or real‑time data sources.
Resolution constraints – Extremely long sequences still need to be tiled into multiple images; optimal tiling strategies remain an open question.
Domain‑specific tokenization – While the visual tokens are compact, they are not yet biologically interpretable (e.g., they don’t map directly to motifs), which could limit explainability.
Generalization to non‑model organisms – Current experiments focus on well‑annotated human/genome datasets; performance on highly repetitive or poorly annotated genomes needs validation.

Future research directions include adaptive tiling algorithms, hybrid models that combine visual tokens with traditional k‑mer embeddings, and extending the prompt language to cover epigenetic and 3‑D chromatin structure queries.

Authors

Hongxin Xiang
Pengsen Ma
Yunkang Cao
Di Yu
Haowen Chen
Xinyu Yang
Xiangxiang Zeng

Paper Information

arXiv ID: 2602.02014v1
Categories: cs.CV, cs.AI, cs.CL, cs.LG
Published: February 2, 2026
PDF: Download PDF

[Paper] Rethinking Genomic Modeling Through Optical Character Recognition

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Reinforced Attention Learning

[Paper] AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations

[Paper] Rethinking the Trust Region in LLM Reinforcement Learning

[Paper] Subliminal Effects in Your Data: A General Mechanism via Log-Linearity