论文标题
仅检测您指定的内容:用语言目标检测对象检测
Detect Only What You Specify : Object Detection with Linguistic Target
论文作者
论文摘要
对象检测是一项计算机视觉任务,用于预测给定图像中每个对象的一个边界框和类别标签。该类别与“狗”或“人”之类的语言符号有关,它们之间应该存在关系。但是,对象检测器只学会将类别分类,而不将其视为语言符号。多模式模型经常使用预训练的对象检测器从图像中提取对象特征,但是模型与检测器分开,而提取的视觉特征不会随着语言输入而变化。我们将对象检测重新考虑为视觉和语言推理任务。然后,我们提出了目标检测任务,其中检测目标由自然语言给出,而任务的目标是仅检测给定图像中的所有目标对象。如果未给出目标,则没有检测。常用的现代对象探测器具有许多手工设计的组件,例如锚点,很难将文本输入融合到复杂的管道中。因此,我们根据最近提出的基于变压器的检测器提出针对目标检测的语言检测器(LTD)。 Ltd是一种编码器架构,我们的条件解码器允许模型以文本输入为语言上下文来对编码的图像进行推理。我们在可可对象检测数据集上评估了LTD的检测性能,还表明我们的模型通过文本输入接地到视觉对象改善了检测结果。
Object detection is a computer vision task of predicting a set of bounding boxes and category labels for each object of interest in a given image. The category is related to a linguistic symbol such as 'dog' or 'person' and there should be relationships among them. However the object detector only learns to classify the categories and does not treat them as the linguistic symbols. Multi-modal models often use the pre-trained object detector to extract object features from the image, but the models are separated from the detector and the extracted visual features does not change with their linguistic input. We rethink the object detection as a vision-and-language reasoning task. We then propose targeted detection task, where detection targets are given by a natural language and the goal of the task is to detect only all the target objects in a given image. There are no detection if the target is not given. Commonly used modern object detectors have many hand-designed components like anchor and it is difficult to fuse the textual inputs into the complex pipeline. We thus propose Language-Targeted Detector (LTD) for the targeted detection based on a recently proposed Transformer-based detector. LTD is a encoder-decoder architecture and our conditional decoder allows the model to reason about the encoded image with the textual input as the linguistic context. We evaluate detection performances of LTD on COCO object detection dataset and also show that our model improves the detection results with the textual input grounding to the visual object.