在评估使用验证集评估具体代理模型概括的范围内

论文标题

在评估使用验证集评估具体代理模型概括的范围内

On the Limits of Evaluating Embodied Agent Model Generalization Using Validation Sets

论文作者

Kim, Hyounghun, Padmakumar, Aishwarya, Jin, Di, Bansal, Mohit, Hakkani-Tur, Dilek

论文摘要

自然语言指导的体现任务完成是一个具有挑战性的问题，因为它需要理解自然语言指示，使其与以自我为中心的视觉观察保持一致，并选择适当的措施在环境中执行以产生理想的变化。我们通过使用模块有效地利用更广阔视野的模块来增强该任务的变压器模型，并学习选择下一步是否需要导航或操纵操作。我们观察到，所提出的模块改善了，实际上是在一个流行的基准数据集Alfred的看不见验证集中的最先进的性能。但是，我们使用Uney验证集在Alfred的看不见测试拆分上选择的最佳模型，这表明在看不见的验证集上的性能本身可能不足以表明模型改进是否概括为未见测试集。我们强调了这一结果，因为我们认为这可能是机器学习任务中的一个更广泛的现象，但主要仅在限制测试拆分评估的基准中值得注意，并强调需要修改基准设计以更好地说明模型性能方面的差异。

Natural language guided embodied task completion is a challenging problem since it requires understanding natural language instructions, aligning them with egocentric visual observations, and choosing appropriate actions to execute in the environment to produce desired changes. We experiment with augmenting a transformer model for this task with modules that effectively utilize a wider field of view and learn to choose whether the next step requires a navigation or manipulation action. We observed that the proposed modules resulted in improved, and in fact state-of-the-art performance on an unseen validation set of a popular benchmark dataset, ALFRED. However, our best model selected using the unseen validation set underperforms on the unseen test split of ALFRED, indicating that performance on the unseen validation set may not in itself be a sufficient indicator of whether model improvements generalize to unseen test sets. We highlight this result as we believe it may be a wider phenomenon in machine learning tasks but primarily noticeable only in benchmarks that limit evaluations on test splits, and highlights the need to modify benchmark design to better account for variance in model performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题