H2-STEREO：高速，高分辨率立体视频系统

论文标题

H2-STEREO：高速，高分辨率立体视频系统

H2-Stereo: High-Speed, High-Resolution Stereoscopic Video System

论文作者

Cheng, Ming, Xu, Yiling, Shen, Wang, Asif, M. Salman, Ma, Chao, Sun, Jun, Ma, Zhan

论文摘要

高速，高分辨率的立体视频（H2-STEREO）视频使我们能够在细粒度上感知动态3D内容。然而，对商品摄像机的收购H2-STEREO视频仍然具有挑战性。现有的空间超分辨率或时间框架插值方法分别提供了缺乏时间或空间细节的折衷解决方案。为了减轻这个问题，我们提出了一个双摄像头系统，其中一个相机捕获具有丰富空间详细信息的高空间分辨率低框架速率（HSR-LFR）视频，而另一个摄像头则捕获了带有光滑临时详细信息的低空间分辨率高架速率（LSR-HFR）。然后，我们设计了一个学习的信息融合网络（LIFNET），该网络利用跨相机冗余，以增强两个相机视图，从而有效地重建H2-STEREO视频。即使在大型差异场景中，我们也利用一个差异网络将时空信息转移到视图上，以此为基础，我们提出了基于差异引导的LSR-HFR视图和HSR-LFR视图的互补翘曲。提出了特征域中的多尺度融合方法，以最大程度地减少闭塞引起的翘曲鬼魂和HSR-LFR视图中的孔。 LIFNET使用YouTube收集的高质量立体视频数据集以端到端的方式进行训练。广泛的实验表明，我们的模型在合成数据和摄像头捕获的真实数据均具有较大差异的视图均优于现有的最新方法。消融研究探讨了各个方面，包括时空分辨率，摄像头基线，摄像头解理，长/短曝光和应用程序，以充分了解其对潜在应用的能力。

High-speed, high-resolution stereoscopic (H2-Stereo) video allows us to perceive dynamic 3D content at fine granularity. The acquisition of H2-Stereo video, however, remains challenging with commodity cameras. Existing spatial super-resolution or temporal frame interpolation methods provide compromised solutions that lack temporal or spatial details, respectively. To alleviate this problem, we propose a dual camera system, in which one camera captures high-spatial-resolution low-frame-rate (HSR-LFR) videos with rich spatial details, and the other captures low-spatial-resolution high-frame-rate (LSR-HFR) videos with smooth temporal details. We then devise a Learned Information Fusion network (LIFnet) that exploits the cross-camera redundancies to enhance both camera views to high spatiotemporal resolution (HSTR) for reconstructing the H2-Stereo video effectively. We utilize a disparity network to transfer spatiotemporal information across views even in large disparity scenes, based on which, we propose disparity-guided flow-based warping for LSR-HFR view and complementary warping for HSR-LFR view. A multi-scale fusion method in feature domain is proposed to minimize occlusion-induced warping ghosts and holes in HSR-LFR view. The LIFnet is trained in an end-to-end manner using our collected high-quality Stereo Video dataset from YouTube. Extensive experiments demonstrate that our model outperforms existing state-of-the-art methods for both views on synthetic data and camera-captured real data with large disparity. Ablation studies explore various aspects, including spatiotemporal resolution, camera baseline, camera desynchronization, long/short exposures and applications, of our system to fully understand its capability for potential applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题