[Paper] Rethinking Genomic Modeling Through Optical Character Recognition

Published: (February 2, 2026 at 07:12 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.02014v1

Overview

The paper introduces OpticalDNA, a novel way to model genomic data by treating DNA sequences like images of text rather than long strings of characters. By rendering DNA into visual layouts and applying OCR‑style vision‑language models, the authors achieve far higher efficiency and accuracy on large‑scale genomic tasks, cutting the effective token count by up to 20× while still beating heavyweight language‑model baselines.

Key Contributions

  • Vision‑first genomic representation – DNA is visualized as structured “documents” and encoded with a dedicated visual DNA encoder, moving away from the traditional 1‑D token stream.
  • Compact, reconstructible visual tokens – The encoder learns a highly compressed token set that can be decoded back to the original sequence with negligible loss, enabling aggressive token‑budget reductions.
  • Prompt‑conditioned multimodal objectives – Four core tasks (reading, region grounding, subsequence retrieval, masked span completion) are framed as OCR‑style prompts, encouraging the model to understand both content and layout.
  • Parameter‑efficient fine‑tuning – Only 256 k trainable parameters are needed to adapt the large backbone, making the approach practical for labs with limited compute.
  • State‑of‑the‑art performance on long genomes – On benchmarks with sequences up to 450 k bases, OpticalDNA outperforms prior models while using ~20× fewer effective tokens and up to 985× fewer activated parameters.

Methodology

  1. Rendering DNA as an image – Raw nucleotide strings are laid out on a canvas using a fixed‑width font, optionally adding visual cues (e.g., color‑coding for gene annotations, line breaks for regulatory regions). This yields a high‑resolution image that preserves the natural “document” structure of genomes.
  2. Visual DNA Encoder – A vision transformer (ViT) processes the image, extracting patch embeddings that become the visual tokens. A lightweight reconstruction head ensures these tokens can be turned back into the original sequence, guaranteeing lossless compression.
  3. Document Decoder (Vision‑Language Model) – A transformer decoder, pretrained on OCR and document‑understanding tasks, consumes the visual tokens together with textual prompts (e.g., “Find the promoter region of gene X”). The decoder outputs either text (nucleotide subsequences) or positional information (grounded regions).
  4. Prompt‑conditioned training objectives
    • Reading – Predict the full nucleotide string from the visual tokens (standard reconstruction).
    • Region Grounding – Given a gene name, output bounding boxes that locate the gene in the image.
    • Subsequence Retrieval – Retrieve a specific subsequence based on a textual query.
    • Masked Span Completion – Mask random spans in the visual layout and ask the model to fill them, encouraging contextual reasoning.
  5. Fine‑tuning strategy – The backbone weights stay frozen; only a small adapter layer (≈256 k parameters) is trained on each downstream genomic task, dramatically reducing compute and memory footprints.

Results & Findings

BenchmarkSequence LengthEffective TokensRelative Performance vs. LLM‑style baselines
Gene‑annotation classification≤ 100 k bp5 k tokens+7.2 % F1
Long‑range enhancer‑promoter prediction250 k bp12 k tokens+9.5 % AUROC
Whole‑genome variant calling (simulated)450 k bp22 k tokens+5.8 % accuracy
Subsequence retrieval (prompt‑based)300 k bp15 k tokens+12.3 % exact‑match
  • Token efficiency: OpticalDNA uses ~20× fewer tokens than a comparable 1‑D transformer while preserving (or improving) downstream accuracy.
  • Parameter efficiency: The model matches or exceeds baselines that have up to 985× more activated parameters during inference.
  • Scalability: Performance gains grow with sequence length, confirming that the visual layout mitigates the “background noise” problem of long, low‑information genomic regions.

Practical Implications

  • Faster inference for large genomes – Bioinformatics pipelines (e.g., variant calling, gene annotation) can process whole chromosomes in a fraction of the time and memory required by current language‑model approaches.
  • Edge‑device deployment – The compact token representation and tiny adapter make it feasible to run genomic analyses on modest GPUs or even specialized ASICs in clinical labs.
  • Prompt‑driven genomics – Researchers can ask natural‑language questions (“Show me the CpG islands near gene TP53”) and receive precise, grounded answers without writing custom scripts.
  • Cross‑modal integration – Because the backbone is OCR‑ready, future extensions could ingest mixed data (e.g., gel images, microscopy slides) alongside DNA, enabling richer multi‑omics diagnostics.
  • Cost‑effective model updates – Adding new annotations or organism‑specific knowledge only requires fine‑tuning the small adapter, avoiding expensive full‑model retraining.

Limitations & Future Work

  • Visualization overhead – Converting DNA to images adds a preprocessing step and may be less straightforward for streaming or real‑time data sources.
  • Resolution constraints – Extremely long sequences still need to be tiled into multiple images; optimal tiling strategies remain an open question.
  • Domain‑specific tokenization – While the visual tokens are compact, they are not yet biologically interpretable (e.g., they don’t map directly to motifs), which could limit explainability.
  • Generalization to non‑model organisms – Current experiments focus on well‑annotated human/genome datasets; performance on highly repetitive or poorly annotated genomes needs validation.

Future research directions include adaptive tiling algorithms, hybrid models that combine visual tokens with traditional k‑mer embeddings, and extending the prompt language to cover epigenetic and 3‑D chromatin structure queries.

Authors

  • Hongxin Xiang
  • Pengsen Ma
  • Yunkang Cao
  • Di Yu
  • Haowen Chen
  • Xinyu Yang
  • Xiangxiang Zeng

Paper Information

  • arXiv ID: 2602.02014v1
  • Categories: cs.CV, cs.AI, cs.CL, cs.LG
  • Published: February 2, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

[Paper] Reinforced Attention Learning

Post-training with Reinforcement Learning (RL) has substantially improved reasoning in Large Language Models (LLMs) via test-time scaling. However, extending th...