通过多层镜头解释神经网络

论文标题

通过多层镜头解释神经网络

Interpreting Neural Networks through the Polytope Lens

论文作者

Black, Sid, Sharkey, Lee, Grinsztajn, Leo, Winsor, Eric, Braun, Dan, Merizian, Jacob, Parker, Kip, Guevara, Carlos Ramón, Millidge, Beren, Alfour, Gabriel, Leahy, Connor

论文摘要

机械性解释性旨在解释神经网络在螺栓和螺栓级别上学到的知识。神经网络表示的基本原则是什么？以前的机械描述使用了单独的神经元或线性组合来了解网络所学的表示形式。但是有线索认为神经元及其线性组合不是描述的正确基本单位：方向无法描述神经网络如何使用非线性来构建其表示。此外，许多单个神经元及其组合的实例都是多性义的（即它们具有多种无关的含义）。多性性性使解释网络从神经元或方向方向提出挑战，因为我们不能再为神经单位分配特定功能。为了找到一个不遇到这些问题的基本描述单位，我们朝外放大了，以研究分段线性激活函数（例如relu）将激活空间划分为众多离散的多面体的方式。我们将此视角称为多层镜头。多层镜头对神经网络的行为做出了具体的预测，我们通过对卷积图像分类器和语言模型进行实验来评估。具体而言，我们表明可以使用多面体来识别激活空间的单义区域（而方向不是一般的单义大镜），并且多层边界的密度反映了语义边界。我们还概述了通过多层镜头的机械性可解释性外观的愿景。

Mechanistic interpretability aims to explain what a neural network has learned at a nuts-and-bolts level. What are the fundamental primitives of neural network representations? Previous mechanistic descriptions have used individual neurons or their linear combinations to understand the representations a network has learned. But there are clues that neurons and their linear combinations are not the correct fundamental units of description: directions cannot describe how neural networks use nonlinearities to structure their representations. Moreover, many instances of individual neurons and their combinations are polysemantic (i.e. they have multiple unrelated meanings). Polysemanticity makes interpreting the network in terms of neurons or directions challenging since we can no longer assign a specific feature to a neural unit. In order to find a basic unit of description that does not suffer from these problems, we zoom in beyond just directions to study the way that piecewise linear activation functions (such as ReLU) partition the activation space into numerous discrete polytopes. We call this perspective the polytope lens. The polytope lens makes concrete predictions about the behavior of neural networks, which we evaluate through experiments on both convolutional image classifiers and language models. Specifically, we show that polytopes can be used to identify monosemantic regions of activation space (while directions are not in general monosemantic) and that the density of polytope boundaries reflect semantic boundaries. We also outline a vision for what mechanistic interpretability might look like through the polytope lens.

下载PDF全文

下载文献需遵守相关版权规定

论文标题