我们只需要边界框：通过检测到的建筑物的上下文编码街道查看图像分类

论文标题

我们只需要边界框：通过检测到的建筑物的上下文编码街道查看图像分类

Bounding Boxes Are All We Need: Street View Image Classification via Context Encoding of Detected Buildings

论文作者

Zhao, Kun, Liu, Yongkun, Hao, Siyuan, Lu, Shaoxing, Liu, Hongbin, Zhou, Lijian

论文摘要

与一般视觉任务（例如，人和汽车）相比，针对城市土地利用分析的街道图像分类非常困难，因为类标签（例如商业领域）是具有较高抽象水平的概念。因此，仅使用视觉功能的分类模型通常无法实现令人满意的性能。在本文中，提出了一种基于“检测器编码器分类器”框架的新方法。所提出的框架并没有直接将整个图像的视觉特征作为基于卷积神经网络（CNN）的常见图像级模型（CNN），而是从检测器中获取了街道视图图像中建筑物的边界框。然后，通过拟议的算法“编码”（对检测到的建筑物的上下文编码）将它们的上下文信息（例如建筑类的共发生模式及其布局）编码为元数据。最后，这些边界盒元数据由复发性神经网络（RNN）分类。此外，我们根据现有的BIC GSV [1]制作了19,070个街景图像和38,857座建筑物的19,070街道图像和38,857座建筑物的双标记数据集（建筑物检测和城市功能区域刻画）。该数据集不仅可用于街道图像分类，还可用于多级建筑物检测。关于“美容”的实验表明，所提出的方法在宏观精制方面取得了12.65％的性能提高，而基于图像级CNN的模型比宏观回报获得了12％。我们的代码和数据集可在https://github.com/kyle-one/context-ecoding-of-detected-buildings/

Street view images classification aiming at urban land use analysis is difficult because the class labels (e.g., commercial area), are concepts with higher abstract level compared to the ones of general visual tasks (e.g., persons and cars). Therefore, classification models using only visual features often fail to achieve satisfactory performance. In this paper, a novel approach based on a "Detector-Encoder-Classifier" framework is proposed. Instead of using visual features of the whole image directly as common image-level models based on convolutional neural networks (CNNs) do, the proposed framework firstly obtains the bounding boxes of buildings in street view images from a detector. Their contextual information such as the co-occurrence patterns of building classes and their layout are then encoded into metadata by the proposed algorithm "CODING" (Context encOding of Detected buildINGs). Finally, these bounding box metadata are classified by a recurrent neural network (RNN). In addition, we made a dual-labeled dataset named "BEAUTY" (Building dEtection And Urban funcTional-zone portraYing) of 19,070 street view images and 38,857 buildings based on the existing BIC GSV [1]. The dataset can be used not only for street view image classification, but also for multi-class building detection. Experiments on "BEAUTY" show that the proposed approach achieves a 12.65% performance improvement on macro-precision and 12% on macro-recall over image-level CNN based models. Our code and dataset are available at https://github.com/kyle-one/Context-Encoding-of-Detected-Buildings/

下载PDF全文

下载文献需遵守相关版权规定

论文标题