[Paper] From Obfuscated to Obvious: A Comprehensive JavaScript Deobfuscation Tool for Security Analysis
Source: arXiv - 2512.14070v1
Overview
JavaScript’s ubiquity on the web makes it a prime target for attackers who hide malicious payloads behind layers of obfuscation. The paper From Obfuscated to Obvious introduces JSIMPLIFIER, a new deobfuscation framework that tackles the shortcomings of existing tools—limited input handling, narrow technique coverage, and unreadable output—by combining static analysis, dynamic tracing, and Large Language Model (LLM)‑driven identifier renaming. The authors also release the largest real‑world dataset of obfuscated JavaScript to date, providing a fresh benchmark for the community.
Key Contributions
- JSIMPLIFIER pipeline: a four‑stage process (pre‑processing → AST‑based static analysis → dynamic execution tracing → LLM‑enhanced renaming).
- Comprehensive coverage: handles 20 distinct obfuscation techniques and processes 100 % of inputs, regardless of format (minified, packed, or mixed).
- Multi‑dimensional evaluation metrics: blends control‑flow/data‑flow analysis, code‑complexity reduction, entropy measurement, and LLM‑based readability scores.
- Largest public dataset: 44,421 real‑world samples (23,212 malicious, 21,209 benign) released under an open‑source license.
- Empirical superiority: achieves 88.2 % reduction in code complexity and >4× readability improvement compared with the best prior tools, while maintaining 100 % correctness on curated test subsets.
Methodology
- Pre‑processing – Normalizes input (e.g., handling different encodings, stripping comments, detecting embedded resources) to create a uniform code base.
- AST‑based static analysis – Parses the JavaScript into an Abstract Syntax Tree, then applies pattern‑matching and data‑flow analyses to identify and simplify typical obfuscation constructs (string encoders, control‑flow flattening, dead code).
- Dynamic execution tracing – Executes the code in a sandboxed environment (Node.js + Chrome V8) while recording runtime values, branch outcomes, and side‑effects. This step resolves constructs that are impossible to simplify statically (e.g., runtime‑generated code via
eval). - LLM‑enhanced identifier renaming – Feeds the partially deobfuscated code to a fine‑tuned LLM (e.g., GPT‑4) that suggests human‑readable variable/function names based on context, usage patterns, and common naming conventions. The suggestions are then validated against the control‑flow graph to avoid breaking semantics.
The pipeline is orchestrated by a lightweight controller that iteratively feeds the output of one stage back into the previous stage when further simplifications are possible, ensuring maximal reduction before final output.
Results & Findings
| Metric | Prior State‑of‑the‑Art | JSIMPLIFIER |
|---|---|---|
| Processing coverage | ~70 % (many inputs rejected) | 100 % |
| Obfuscation techniques handled | 8–10 | 20 |
| Correctness on ground‑truth subset | 92 % | 100 % |
| Code complexity reduction (Cyclomatic + AST depth) | 45 % | 88.2 % |
| Readability gain (LLM‑based score) | 1.8× | >4× |
| Entropy drop (measure of randomness) | 30 % | ≈55 % |
The authors also performed a user study where security analysts rated the deobfuscated output on a 5‑point Likert scale; JSIMPLIFIER’s results averaged 4.3, compared to 2.7 for the closest competitor. The large dataset enabled a statistically robust evaluation, confirming that the tool scales to real‑world traffic volumes.
Practical Implications
- Threat hunting & incident response – Analysts can feed suspicious scripts directly into JSIMPLIFIER and obtain clean, readable code, dramatically cutting the time needed to understand payload behavior.
- Automated sandboxing pipelines – Security platforms (e.g., Cuckoo, VirusTotal) can integrate the tool as a pre‑processing step, improving detection rates for heavily obfuscated malware.
- Secure development tooling – Build‑time linters could use the static‑analysis component to flag unintentionally obfuscated code (e.g., from third‑party libraries) before deployment.
- Compliance & code‑review automation – Enterprises can run JSIMPLIFIER on codebases to ensure that no hidden, potentially malicious transformations are present in shipped JavaScript bundles.
- Research acceleration – The released dataset and evaluation framework give the community a common benchmark, fostering reproducible research and enabling rapid iteration on new deobfuscation techniques.
Limitations & Future Work
- Dynamic analysis overhead – Executing every script in a sandbox adds latency; the authors note that for high‑throughput environments a selective “static‑first” mode may be needed.
- LLM dependency – The quality of identifier renaming hinges on the underlying LLM; proprietary models could limit reproducibility or increase cost.
- Evasion arms race – Sophisticated attackers may adopt anti‑sandbox tricks (e.g., timing checks) that could bypass the dynamic tracing stage.
- Language scope – The current implementation focuses on ECMAScript 5/6; newer features (e.g., async/await, modules) are only partially supported.
Future directions include optimizing the dynamic tracing phase with lightweight instrumentation, exploring open‑source LLM alternatives for renaming, and extending support to modern JavaScript syntax and emerging obfuscation patterns (e.g., WebAssembly‑based payloads).
Authors
- Dongchao Zhou
- Lingyun Ying
- Huajun Chai
- Dongbin Wang
Paper Information
- arXiv ID: 2512.14070v1
- Categories: cs.CR, cs.SE
- Published: December 16, 2025
- PDF: Download PDF