[Paper] Co-Evolution of Types and Dependencies: Towards Repository-Level Type Inference for Python Code
Source: arXiv - 2512.21591v1
Overview
The paper introduces Co‑Evolution of Types and Dependencies (CoTyDe), a new technique that leverages large language models (LLMs) to infer types across an entire Python codebase—not just isolated files or functions. By modeling how objects and their type relationships evolve together, CoTyDe dramatically improves the accuracy of repository‑level type annotation, a long‑standing pain point for large‑scale Python projects.
Key Contributions
- Entity Dependency Graph (EDG): A novel graph representation that captures objects, functions, and their cross‑module type dependencies throughout a repository.
- Iterative Co‑evolution Inference: Types and dependencies are refined together in multiple passes, allowing earlier guesses to inform later ones and vice‑versa.
- Type‑Checker‑in‑the‑Loop: An integrated static type checker validates each inference step, automatically correcting mistakes and preventing error propagation.
- Empirical Validation: Evaluation on 12 real‑world Python repositories shows a 27 % boost in TypeSim and 40 % boost in TypeExact over the strongest prior tool, while eliminating 92.7 % of newly introduced type errors.
Methodology
-
Graph Construction:
- Parse the whole repository to extract entities (classes, functions, variables).
- Build the EDG where nodes are entities and edges encode “uses”, “inherits”, or “calls” relationships, enriched with any existing type hints.
-
LLM‑Powered Inference Loop:
- Feed each node (and its local graph context) to a pre‑trained LLM (e.g., GPT‑4) that proposes a candidate type.
- Update the node’s type annotation in the EDG.
-
Co‑evolution Cycle:
- After a round of LLM predictions, run a static type checker (e.g.,
mypy) on the partially annotated code. - The checker reports conflicts; these are fed back to the LLM as corrective prompts, prompting it to revise the problematic nodes.
- Repeat until the graph stabilizes (no new conflicts) or a maximum iteration count is reached.
- After a round of LLM predictions, run a static type checker (e.g.,
-
Final Validation:
- Run a full repository‑wide type check to compute the final TypeSim (semantic similarity to ground‑truth types) and TypeExact (exact match) scores.
Results & Findings
| Metric | CoTyDe | Best Baseline |
|---|---|---|
| TypeSim | 0.89 | 0.70 |
| TypeExact | 0.84 | 0.60 |
| New Type Errors Introduced | 7.3 % (i.e., 92.7 % removed) | 30 %+ |
- The iterative co‑evolution reduces cascading errors: each correction narrows the search space for subsequent inferences.
- The EDG enables the LLM to reason about global relationships (e.g., a class used across many modules) rather than isolated snippets, which accounts for the large performance jump.
- Runtime overhead is modest: on average, a 500‑file repository is processed in ~15 minutes on a single GPU, making the approach feasible for CI pipelines.
Practical Implications
- Automated Annotation for Legacy Code: Teams can run CoTyDe on existing monoliths to generate high‑quality type hints, unlocking static analysis, IDE autocompletion, and safer refactoring.
- CI/CD Integration: Because the tool produces a type‑checker‑validated output, it can be added as a gate in CI pipelines to enforce type‑safety without manual review.
- Improved Tooling Ecosystem: IDEs and linters can consume the generated stubs to provide better diagnostics, reducing the “dynamic‑typing surprise” that often leads to runtime crashes.
- Facilitates Migration to Typed Python: Projects aiming to adopt
typing‑heavy codebases (e.g., for mypy strict mode or Pyright) get a solid starting point, cutting migration effort by an order of magnitude.
Limitations & Future Work
- LLM Dependency: The quality of inferred types hinges on the underlying LLM; smaller or open‑source models may not match the reported gains.
- Scalability to Very Large Repos: While 15 minutes is acceptable for medium‑size codebases, repositories with tens of thousands of files may need graph partitioning or distributed inference.
- Handling Dynamic Metaprogramming: Heavy use of
exec,eval, or runtime attribute injection remains challenging for static graph construction. - Future Directions: The authors plan to (1) explore model‑agnostic prompting strategies to reduce reliance on proprietary LLMs, (2) integrate incremental graph updates for continuous development, and (3) extend the EDG to capture runtime‑generated types via hybrid static‑dynamic analysis.
Authors
- Shuo Sun
- Shixin Zhang
- Jiwei Yan
- Jun Yan
- Jian Zhang
Paper Information
- arXiv ID: 2512.21591v1
- Categories: cs.SE
- Published: December 25, 2025
- PDF: Download PDF