[Paper] Confidence and Calibration of Activation Oracles for Reliable Interpretation of Language Model Internals
Activation oracles aim to make the activations of other models legible to humans and yield promising results compared to white-box interpretability techniques. ...