论文标题
DCASENET:一个综合的深度神经网络,用于检测和分类声学场景和事件
DcaseNet: An integrated pretrained deep neural network for detecting and classifying acoustic scenes and events
论文作者
论文摘要
尽管声学场景和事件包括许多相关的任务,但几乎没有研究它们的综合检测和分类。我们提出了深层神经网络的三个体系结构,它们集成到同时执行声学场景分类,音频标签和声音事件检测。前两个体系结构受到人类认知过程的启发。第一个体系结构类似于成年人场景分类的短期感知,他们可以检测到各种声音事件,然后用来识别声学场景。第二个体系结构类似于对婴儿的长期学习,也是自我监督学习的基础概念。婴儿首先观察到抽象概念(例如重力)的影响,然后使用这种感知来学习特定的任务。第三个体系结构将几层添加到第二个层,这些层仅在其相应的输出层之前仅执行单个任务。我们旨在建立一个可以用作验证模型来执行三个上述任务的模型的集成系统。三个数据集的实验表明,所提出的称为DCASENET的构建结构可以直接用于任何任务,同时提供合适的结果或微调以提高一项任务的性能。代码和预估计的DCASENET权重可在https://github.com/jungjee/dcasenet上找到。
Although acoustic scenes and events include many related tasks, their combined detection and classification have been scarcely investigated. We propose three architectures of deep neural networks that are integrated to simultaneously perform acoustic scene classification, audio tagging, and sound event detection. The first two architectures are inspired by human cognitive processes. The first architecture resembles the short-term perception for scene classification of adults, who can detect various sound events that are then used to identify the acoustic scene. The second architecture resembles the long-term learning of babies, being also the concept underlying self-supervised learning. Babies first observe the effects of abstract notions such as gravity and then learn specific tasks using such perceptions. The third architecture adds a few layers to the second one that solely perform a single task before its corresponding output layer. We aim to build an integrated system that can serve as a pretrained model to perform the three abovementioned tasks. Experiments on three datasets demonstrate that the proposed architecture, called DcaseNet, can be either directly used for any of the tasks while providing suitable results or fine-tuned to improve the performance of one task. The code and pretrained DcaseNet weights are available at https://github.com/Jungjee/DcaseNet.