论文标题
通过模仿人类视觉饮食来改善概括
Improving generalization by mimicking the human visual diet
论文作者
论文摘要
我们提出了关于弥合生物学和计算机视觉之间概括差距的新观点 - 模仿人类的视觉饮食。尽管计算机视觉模型依赖于Internet缠绕的数据集,但人类在自然环境中具有物体的不同现实世界转换下,从有限的3D场景中学习。我们的结果表明,在人类视觉训练数据中纳入无处不在的变化和上下文提示可以显着改善对现实世界转变(例如照明,观点和物质变化)的概括。这种改进还扩展到从合成到现实世界数据的概括 - 所有模型在对自然图像数据进行测试时,经过类似人类的视觉饮食训练的型号优于大幅度的专业结构。这些实验是通过我们的两个关键贡献来实现的:一个新型的数据集捕获场景上下文和各种现实世界的变化以模仿人类的视觉饮食,以及量身定制的变压器模型,以利用人类视觉饮食的这些方面。可以在https://github.com/spandan-madan/human_visual_diet上访问所有数据和源代码。
We present a new perspective on bridging the generalization gap between biological and computer vision -- mimicking the human visual diet. While computer vision models rely on internet-scraped datasets, humans learn from limited 3D scenes under diverse real-world transformations with objects in natural context. Our results demonstrate that incorporating variations and contextual cues ubiquitous in the human visual training data (visual diet) significantly improves generalization to real-world transformations such as lighting, viewpoint, and material changes. This improvement also extends to generalizing from synthetic to real-world data -- all models trained with a human-like visual diet outperform specialized architectures by large margins when tested on natural image data. These experiments are enabled by our two key contributions: a novel dataset capturing scene context and diverse real-world transformations to mimic the human visual diet, and a transformer model tailored to leverage these aspects of the human visual diet. All data and source code can be accessed at https://github.com/Spandan-Madan/human_visual_diet.