论文标题
关于天文光谱数据的数据挖掘技术。 II:分类分析
Data mining techniques on astronomical spectra data. II : Classification Analysis
论文作者
论文摘要
在光谱分析中,分类是有价值的,并且是必不可少的,尤其是对于数据驱动的采矿。随着光谱调查的快速发展,多种分类技术已成功地应用于天文数据处理。但是,由于不同的算法思想和数据特征,在实际场景中很难在实际情况下选择适当的分类方法。在这里,我们介绍了数据挖掘系列中的第二项工作 - 光谱分类技术的综述。这项工作还包括三个部分:对当前文献的系统概述,对本文使用的常用分类算法的实验分析和源代码。首先,我们仔细研究了天文学文献中当前的分类方法,并根据其算法思想将这些方法组织成十种类型。对于每种类型的算法,分析是从以下三个角度进行的。 (1)总结了它们当前的应用和光谱分类中的使用频率; (2)引入和初步分析他们的基本思想; (3)讨论了每种算法的优点和警告。其次,分析了统一数据集上不同算法的分类性能。实验数据是从Lamost调查和SDSS调查中选择的。六组光谱数据集的设计来自数据特征,数据质量和数据量,以检查这些算法的性能。然后在实验分析中显示和讨论了九种基本算法的得分。最后,提供了9种基本算法源代码,并提供了用于使用和改进的手册。
Classification is valuable and necessary in spectral analysis, especially for data-driven mining. Along with the rapid development of spectral surveys, a variety of classification techniques have been successfully applied to astronomical data processing. However, it is difficult to select an appropriate classification method in practical scenarios due to the different algorithmic ideas and data characteristics. Here, we present the second work in the data mining series - a review of spectral classification techniques. This work also consists of three parts: a systematic overview of current literature, experimental analyses of commonly used classification algorithms and source codes used in this paper. Firstly, we carefully investigate the current classification methods in astronomical literature and organize these methods into ten types based on their algorithmic ideas. For each type of algorithm, the analysis is organized from the following three perspectives. (1) their current applications and usage frequencies in spectral classification are summarized; (2) their basic ideas are introduced and preliminarily analysed; (3) the advantages and caveats of each type of algorithm are discussed. Secondly, the classification performance of different algorithms on the unified data sets is analysed. Experimental data are selected from the LAMOST survey and SDSS survey. Six groups of spectral data sets are designed from data characteristics, data qualities and data volumes to examine the performance of these algorithms. Then the scores of nine basic algorithms are shown and discussed in the experimental analysis. Finally, nine basic algorithms source codes written in python and manuals for usage and improvement are provided.