论文标题
生成对抗模拟器
Generative Adversarial Simulator
论文作者
论文摘要
机器学习模型之间的知识蒸馏为降低参数计数,改进绩效或摊销培训时间开辟了许多新的途径,当时教师和学生网络之间的体系结构发生变化。在强化学习的情况下,该技术也已应用于将教师政策蒸发到学生的政策。到目前为止,政策蒸馏需要访问模拟器或现实世界轨迹。 在本文中,我们在强化学习的背景下介绍了一种无模拟器的知识蒸馏方法。一个关键的挑战是让学生学习与给定动作相对应的案例的多样性。虽然先前的工作表明,通过生成合成示例,可以通过监督学习模型进行无数据的知识蒸馏,但这些方法容易仅为每个类别生成一个单一的原型示例。我们提出了一个扩展程序,以明确处理每个输出类别的多个观测值,该分类旨在通过重新定位我们的数据生成器并利用对抗性损失来找到给定输出类的尽可能多的示例。 据我们所知,这是教师和学生政策之间无模拟知识蒸馏的首次演示。这种新方法在基准数据集(MNIST,Fashion-Mnist,CIFAR-10)上对学生网络无数据学习的最新方法进行了改善,我们还证明它专门解决了具有多种输入模式的问题。当在高维环境(例如乒乓球,突破或海格斯)培训的培训的代理时,我们还确定了开放的问题。
Knowledge distillation between machine learning models has opened many new avenues for parameter count reduction, performance improvements, or amortizing training time when changing architectures between the teacher and student network. In the case of reinforcement learning, this technique has also been applied to distill teacher policies to students. Until now, policy distillation required access to a simulator or real world trajectories. In this paper we introduce a simulator-free approach to knowledge distillation in the context of reinforcement learning. A key challenge is having the student learn the multiplicity of cases that correspond to a given action. While prior work has shown that data-free knowledge distillation is possible with supervised learning models by generating synthetic examples, these approaches to are vulnerable to only producing a single prototype example for each class. We propose an extension to explicitly handle multiple observations per output class that seeks to find as many exemplars as possible for a given output class by reinitializing our data generator and making use of an adversarial loss. To the best of our knowledge, this is the first demonstration of simulator-free knowledge distillation between a teacher and a student policy. This new approach improves over the state of the art on data-free learning of student networks on benchmark datasets (MNIST, Fashion-MNIST, CIFAR-10), and we also demonstrate that it specifically tackles issues with multiple input modes. We also identify open problems when distilling agents trained in high dimensional environments such as Pong, Breakout, or Seaquest.