论文标题
对话挑战:对话摘要共享任务的结果
DialogSum Challenge: Results of the Dialogue Summarization Shared Task
论文作者
论文摘要
我们报告了Dialogsum Challenge的结果,这是INLG 2022上的真实情况对话的共同任务。四个团队参与了此共享任务,并提交了他们的系统报告,探索了不同的方法,以提高对话摘要的性能。尽管对于自动评估指标(例如Rouge分数),基线模型有很大的改进,但我们发现模型生成的输出与通过多个方面的人类评估之间的人类评估之间存在显着差距。这些发现表明了对话摘要的困难,并表明需要更多细粒度的评估指标。
We report the results of DialogSum Challenge, the shared task on summarizing real-life scenario dialogues at INLG 2022. Four teams participate in this shared task and three submit their system reports, exploring different methods to improve the performance of dialogue summarization. Although there is a great improvement over the baseline models regarding automatic evaluation metrics, such as Rouge scores, we find that there is a salient gap between model generated outputs and human annotated summaries by human evaluation from multiple aspects. These findings demonstrate the difficulty of dialogue summarization and suggest that more fine-grained evaluatuion metrics are in need.