论文标题
通过视觉变压器重新考虑烹饪状态认可
Rethinking Cooking State Recognition with Vision Transformers
论文作者
论文摘要
为了确保厨房环境的适当知识代表,对于厨房机器人识别正在烹饪的食品状态至关重要。尽管已经对对象检测和识别的域进行了广泛的研究,但是对象状态分类的任务仍然相对尚未探索。在烹饪的不同状态下,成分的高层内部相似性使任务更具挑战性。研究人员近期提出了采用基于深度学习的策略的建议,但是,他们尚未实现高性能。在这项研究中,我们利用了视觉变压器(VIT)结构的自我发挥机制来完成烹饪状态识别任务。所提出的方法封装了图像中全球显着的特征,同时还利用从较大数据集中学到的权重。这种全球关注使该模型可以承受不同烹饪对象样品之间的相似性,而转移学习的使用有助于克服缺乏电感偏见,通过利用预处理的重量。为了提高识别准确性,也采用了几种增强技术。评估我们在“烹饪状态识别挑战数据集”上提出的框架的精度为94.3%,这极大地超过了最新的。
To ensure proper knowledge representation of the kitchen environment, it is vital for kitchen robots to recognize the states of the food items that are being cooked. Although the domain of object detection and recognition has been extensively studied, the task of object state classification has remained relatively unexplored. The high intra-class similarity of ingredients during different states of cooking makes the task even more challenging. Researchers have proposed adopting Deep Learning based strategies in recent times, however, they are yet to achieve high performance. In this study, we utilized the self-attention mechanism of the Vision Transformer (ViT) architecture for the Cooking State Recognition task. The proposed approach encapsulates the globally salient features from images, while also exploiting the weights learned from a larger dataset. This global attention allows the model to withstand the similarities between samples of different cooking objects, while the employment of transfer learning helps to overcome the lack of inductive bias by utilizing pretrained weights. To improve recognition accuracy, several augmentation techniques have been employed as well. Evaluation of our proposed framework on the `Cooking State Recognition Challenge Dataset' has achieved an accuracy of 94.3%, which significantly outperforms the state-of-the-art.