[Paper] Approximate Optimal Active Learning of Decision Trees

Published: 1 month ago (December 3, 2025 at 12:03 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.03971v1

Overview

The paper introduces a symbolic active‑learning framework for reconstructing an unknown binary decision tree using only membership queries (i.e., “what is the output for this input?”). By representing the entire space of bounded‑depth trees as a SAT formula and leveraging approximate model counting, the authors can pick near‑optimal queries without enumerating every candidate tree. The result is a learner that converges to the correct tree with very few queries while still offering formal guarantees suitable for verification‑heavy domains.

Key Contributions

SAT‑based encoding of the hypothesis space: All decision trees up to a given depth are captured compactly in a CNF formula, avoiding explicit enumeration.
Query selection via approximate model counting: Uses ApproxMC to estimate how much each potential query would shrink the version space, enabling near‑optimal, information‑theoretic query choices.
Incremental version‑space refinement: After each query, the CNF is tightened with the observed outcome, keeping the representation up‑to‑date without rebuilding from scratch.
Functional equivalence fallback: When ApproxMC stops making progress, a SAT‑based equivalence check confirms whether the remaining hypotheses are functionally identical, allowing early termination.
Empirical validation: Experiments on synthetic and benchmark decision‑tree datasets show convergence with only a handful of queries, outperforming heuristic‑based active learners.

Methodology

Symbolic hypothesis encoding
- The learner fixes a maximum depth d.
- Each internal node is represented by Boolean variables indicating which feature it tests and which child branch is taken.
- The entire set of feasible trees becomes a CNF formula Φ that is satisfiable exactly by the trees consistent with the current observations.
Query candidate generation
- For every possible input vector x (or a sampled subset), the learner creates two conditional formulas: one assuming the answer is 0, the other assuming 1.
Approximate model counting
- ApproxMC estimates the number of satisfying assignments of Φ under each assumption.
- The information gain of asking about x is approximated by the reduction in the count, e.g., |Φ| - max(|Φ∧answer=0|, |Φ∧answer=1|).
Query selection
- The input with the highest estimated gain is queried from the oracle (the unknown tree).
Version‑space update
- The CNF Φ is conjoined with the clause that forces the observed answer, shrinking the space of admissible trees.
Equivalence check (fallback)
- If ApproxMC reports the same count for many consecutive steps, a SAT solver checks whether all remaining models compute the same Boolean function. If they do, learning stops even though multiple syntactic trees may still exist.

The loop repeats until the version space collapses to a single functional model.

Results & Findings

Metric	Observation
Queries needed	Typically **≤ log₂(
Runtime	Dominated by ApproxMC calls; still sub‑second for trees up to depth 5 on a standard desktop.
Accuracy	Learner always recovered the exact target tree (or an equivalent one) on all benchmark instances.
Comparison	Outperformed entropy‑based active learners (e.g., ID3‑style heuristics) by 30‑50 % fewer queries and avoided the “local‑optimum” pitfalls of greedy impurity measures.

The experiments confirm that approximate counting provides a reliable proxy for true information gain, enabling near‑optimal query strategies without the combinatorial blow‑up of exact counting.

Practical Implications

Model extraction for security audits – An attacker (or auditor) can reconstruct a proprietary decision‑tree classifier with minimal probing, useful for reverse‑engineering black‑box models while still providing formal guarantees about the extraction process.
Test‑case generation in software verification – When a component’s behavior is known to be representable as a bounded‑depth decision tree, the method can automatically synthesize a minimal set of inputs that fully characterize its logic, accelerating regression testing and formal proof generation.
Interactive debugging tools – IDE plugins could ask developers targeted “what‑if” questions to pinpoint the exact decision rule causing a bug, reducing the number of manual test runs.
Resource‑constrained active learning – In IoT or edge scenarios where each query incurs a cost (e.g., sensor reading, network round‑trip), the approach ensures each query yields maximal reduction in uncertainty, extending battery life and bandwidth.

Because the core engine is a SAT solver plus an approximate counter, the technique can be plugged into existing verification pipelines (e.g., using Z3, MiniSat, or ApproxMC) without reinventing custom learning code.

Limitations & Future Work

Scalability to deep trees: The CNF size grows exponentially with depth; while ApproxMC mitigates enumeration, very deep trees (depth > 7) become memory‑intensive.
Feature space assumptions: The method assumes a finite, discrete set of Boolean features. Extending to numeric or categorical attributes would require additional encoding tricks (e.g., binarization).
Reliance on ApproxMC’s guarantees: Approximate counting introduces probabilistic error bounds; in pathological cases the query selection may be sub‑optimal.
Future directions suggested by the authors include (1) hybrid symbolic‑statistical models that combine SAT encoding with gradient‑based learners for mixed‑type data.

[Paper] Approximate Optimal Active Learning of Decision Trees

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Related posts

SoftBank and Nvidia reportedly in talks to fund Skild AI at $14B, nearly tripling its value

Google’s AI try-on app Doppl adds a shoppable discovery feed

Google says there are ‘no plans’ to put ads in the Gemini app

Gemini for Home update already working on some third-party Google Assistant speakers