网络店：与基础语言代理进行可扩展的现实世界网络互动

论文标题

网络店：与基础语言代理进行可扩展的现实世界网络互动

WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents

论文作者

Yao, Shunyu, Chen, Howard, Yang, John, Narasimhan, Karthik

论文摘要

在交互式环境中，现有的基础语言基准缺乏现实世界的语言元素，或者由于人类参与数据收集或反馈信号而被证明难以扩展。为了弥合这一差距，我们开发了网络商店 - 一个模拟的电子商务网站环境，拥有11.18亿美元的现实世界中的产品和12,087美元的人群文本说明。给定指定产品要求的文本指令，代理需要导航多种类型的网页并发布各种操作以查找，自定义和购买项目。 WebShop为语言基础提供了一些挑战，包括了解构图说明，查询（重新）表述，理解和对网页中的嘈杂文本进行操作以及进行战略探索。我们为这项任务收集了超过$ 1,600的人类示范，并使用强化学习，模仿学习以及预先训练的图像和语言模型来培训和评估各种代理商。我们的最佳模型达到了$ 29 \％$的任务成功率，这超过了基于规则的启发式方法（$ 9.6 \％$），但远低于人类专家绩效（$ 59 \％$）。我们还分析了代理和人类轨迹，并消融各种模型组件，以提供有关具有更强语言理解和决策能力的未来代理人的见解。最后，我们表明，在Amazon.com和eBay.com上进行评估时，在网络商店进行培训的代理商展示了非平凡的SIM到漫步转移，这表明网络商店在开发可以在野外运行的实用基于Web的代理商的潜在价值。

Existing benchmarks for grounding language in interactive environments either lack real-world linguistic elements, or prove difficult to scale up due to substantial human involvement in the collection of data or feedback signals. To bridge this gap, we develop WebShop -- a simulated e-commerce website environment with $1.18$ million real-world products and $12,087$ crowd-sourced text instructions. Given a text instruction specifying a product requirement, an agent needs to navigate multiple types of webpages and issue diverse actions to find, customize, and purchase an item. WebShop provides several challenges for language grounding including understanding compositional instructions, query (re-)formulation, comprehending and acting on noisy text in webpages, and performing strategic exploration. We collect over $1,600$ human demonstrations for the task, and train and evaluate a diverse range of agents using reinforcement learning, imitation learning, and pre-trained image and language models. Our best model achieves a task success rate of $29\%$, which outperforms rule-based heuristics ($9.6\%$) but is far lower than human expert performance ($59\%$). We also analyze agent and human trajectories and ablate various model components to provide insights for developing future agents with stronger language understanding and decision making abilities. Finally, we show that agents trained on WebShop exhibit non-trivial sim-to-real transfer when evaluated on amazon.com and ebay.com, indicating the potential value of WebShop in developing practical web-based agents that can operate in the wild.

下载PDF全文

下载文献需遵守相关版权规定

论文标题