论文标题
展示和说话:直接合成图像的说明
Show and Speak: Directly Synthesize Spoken Description of Images
论文作者
论文摘要
本文提出了一个新模型,称为展览和说话(SAS)模型,该模型首次能够直接合成图像的口头描述,绕开了对任何文本或音素的需求。 SAS的基本结构是一种编码器架构,该体系结构将图像作为输入,并预测描述此图像的语音频谱。最终的语音音频是从预测的频谱图中获得的。在公共基准数据库FlickR8K上进行的广泛实验表明,所提出的SAS能够合成图像的自然口语描述,表明绕过文本和音素的图像合成图像的口头描述是可行的。
This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes. The basic structure of SAS is an encoder-decoder architecture that takes an image as input and predicts the spectrogram of speech that describes this image. The final speech audio is obtained from the predicted spectrogram via WaveNet. Extensive experiments on the public benchmark database Flickr8k demonstrate that the proposed SAS is able to synthesize natural spoken descriptions for images, indicating that synthesizing spoken descriptions for images while bypassing text and phonemes is feasible.