论文标题
快速使用微型变压器网络进行数字全息显微镜的快速自动关注
Fast Autofocusing using Tiny Transformer Networks for Digital Holographic Microscopy
论文作者
论文摘要
数字全息图的数值波前反向传播原理赋予了独特的扩展焦点功能,而没有沿Z轴的机械位移。但是,确定正确的聚焦距离是一个非平凡且耗时的问题。提出了一种深度学习(DL)解决方案,以将自动关注作为回归问题施放,并在实验和模拟全息图中进行了测试。单波长数字全息图是通过数字全息显微镜(DHM)记录的,其具有10 $ \ MATHRM {X} $显微镜物镜的10 $ \ MATHRM {x} $显微镜物镜,该目标是在3D中移动的92 $μ$ m的3D。提出了小型DL模型,并进行了比较,例如微小的视觉变压器(TVIT),Tiny VGG16(TVGG)和小型Swin-TransFomer(TSWINT)。将所提出的微型网络与其原始版本(VIT/B16,VGG16和SWIN-TRANSFORMER TINY)以及用于数字全息图(如Lenet和Alexnet)中的主要神经网络进行了比较。实验表明,预测的聚焦距离$ z_r^{\ mathrm {pred}} $准确地推断出,与15 $μ$ m的DHM野外深度相比,准确的准确度为1.2 $μ$ m。数值模拟表明,所有微型模型都给出了$ z_r^{\ mathrm {pred}} $,错误低于0.3 $μ$ m。这样的前景将显着提高计算机视觉位置感测的当前功能,例如生命科学或微生物制作的3D显微镜。此外,所有模型都达到CPU的推理时间,每次推理不到25 ms。在遮挡方面,基于其变压器体系结构的TVIT是最强大的。
The numerical wavefront backpropagation principle of digital holography confers unique extended focus capabilities, without mechanical displacements along z-axis. However, the determination of the correct focusing distance is a non-trivial and time consuming issue. A deep learning (DL) solution is proposed to cast the autofocusing as a regression problem and tested over both experimental and simulated holograms. Single wavelength digital holograms were recorded by a Digital Holographic Microscope (DHM) with a 10$\mathrm{x}$ microscope objective from a patterned target moving in 3D over an axial range of 92 $μ$m. Tiny DL models are proposed and compared such as a tiny Vision Transformer (TViT), tiny VGG16 (TVGG) and a tiny Swin-Transfomer (TSwinT). The proposed tiny networks are compared with their original versions (ViT/B16, VGG16 and Swin-Transformer Tiny) and the main neural networks used in digital holography such as LeNet and AlexNet. The experiments show that the predicted focusing distance $Z_R^{\mathrm{Pred}}$ is accurately inferred with an accuracy of 1.2 $μ$m in average in comparison with the DHM depth of field of 15 $μ$m. Numerical simulations show that all tiny models give the $Z_R^{\mathrm{Pred}}$ with an error below 0.3 $μ$m. Such a prospect would significantly improve the current capabilities of computer vision position sensing in applications such as 3D microscopy for life sciences or micro-robotics. Moreover, all models reach an inference time on CPU, inferior to 25 ms per inference. In terms of occlusions, TViT based on its Transformer architecture is the most robust.