论文标题
聚光灯:移动UI使用视觉语言模型以重点
Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus
论文作者
论文摘要
移动UI的理解对于启用各种交互任务(例如UI自动化和可访问性)很重要。以前的移动UI建模通常取决于屏幕的视图层次结构信息,该屏幕直接提供了UI的结构数据,希望从屏幕像素绕过挑战性的视觉建模任务。但是,视图层次结构并不总是可用的,并且经常被丢失的对象描述或未对准的结构信息而损坏。结果,尽管使用了视图层次结构可以提供短期收益,但最终可能会阻碍模型的适用性和性能。在本文中,我们提出了Spotlight,这是一种仅视觉的移动UI理解方法。具体而言,我们增强了一个视觉模型,该模型仅在屏幕上使用UI的屏幕截图(焦点)作为输入。 Spotlight的这种一般体系结构很容易扩展,并且能够执行一系列UI建模任务。我们的实验表明,我们的模型在几个代表性的UI任务上建立了SOTA结果,并且优于先前使用屏幕截图并将层次结构视为输入的方法。此外,我们探索了多任务学习,几乎没有射击促使所提出的模型的能力,从而在多任务学习方向上展示了有希望的结果。
Mobile UI understanding is important for enabling various interaction tasks such as UI automation and accessibility. Previous mobile UI modeling often depends on the view hierarchy information of a screen, which directly provides the structural data of the UI, with the hope to bypass challenging tasks of visual modeling from screen pixels. However, view hierarchies are not always available, and are often corrupted with missing object descriptions or misaligned structure information. As a result, despite the use of view hierarchies could offer short-term gains, it may ultimately hinder the applicability and performance of the model. In this paper, we propose Spotlight, a vision-only approach for mobile UI understanding. Specifically, we enhance a vision-language model that only takes the screenshot of the UI and a region of interest on the screen -- the focus -- as the input. This general architecture of Spotlight is easily scalable and capable of performing a range of UI modeling tasks. Our experiments show that our model establishes SoTA results on several representative UI tasks and outperforms previous methods that use both screenshots and view hierarchies as inputs. Furthermore, we explore multi-task learning and few-shot prompting capacities of the proposed models, demonstrating promising results in the multi-task learning direction.