审计可视化：透明度方法难以检测异常行为

论文标题

审计可视化：透明度方法难以检测异常行为

Auditing Visualizations: Transparency Methods Struggle to Detect Anomalous Behavior

论文作者

Denain, Jean-Stanislas, Steinhardt, Jacob

论文摘要

模型可视化提供了可能仅输出可能会错过的信息。但是，我们可以相信该模型可视化反映了模型行为吗？例如，他们可以诊断出异常的行为，例如种植后门或过度验证过度吗？为了评估可视化方法，我们测试它们是否将不同的可视化分配给了异常训练的模型和正常模型。我们发现，尽管现有方法可以检测出明显异常行为的模型，但它们努力识别出更微妙的异常。此外，他们通常无法识别诱发异常行为的输入，例如包含虚假提示的图像。这些结果揭示了一些流行模型可视化的盲点和局限性。通过引入一个新颖的可视化评估框架，我们的工作为未来开发更可靠的模型透明度方法的方式铺平了道路。

Model visualizations provide information that outputs alone might miss. But can we trust that model visualizations reflect model behavior? For instance, can they diagnose abnormal behavior such as planted backdoors or overregularization? To evaluate visualization methods, we test whether they assign different visualizations to anomalously trained models and normal models. We find that while existing methods can detect models with starkly anomalous behavior, they struggle to identify more subtle anomalies. Moreover, they often fail to recognize the inputs that induce anomalous behavior, e.g. images containing a spurious cue. These results reveal blind spots and limitations of some popular model visualizations. By introducing a novel evaluation framework for visualizations, our work paves the way for developing more reliable model transparency methods in the future.

下载PDF全文

下载文献需遵守相关版权规定

论文标题