论文标题
极限K分类样本问题
Extreme-K categorical samples problem
论文作者
论文摘要
以直方图为基础,我们在极端$ K $样本问题下开发了分类探索性数据分析(CEDA),并通过四个1D分类数据集说明了其普遍适用性。给定相当大的$ K $,CEDA的最终目标金额可以通过数据的信息内容来通过执行两个数据驱动的计算任务来发现:1)在$ k $种群上建立树几何形状,以此作为发现种群中各种模式的平台; 2)评估每个几何模式的可靠性。在CEDA的发展中,每个人群都会产生类别比例的行矢量。在数据矩阵的行轴上,我们讨论了欧几里得距离的利弊,以构建其加权版本,以构建二进制聚类树几何形状。选择标准在这二进制聚类树框起来构成的圆柱块中取决于均匀度。然后用二进制代码序列对每个树叶(种群)进行编码,因此基于树的模式也是如此。为了评估可靠性,我们采用行列的多项式随机性来产生矩阵模仿的合奏,从而产生模仿二进制树的合奏。任何观察到的模式的可靠性是树集合中的复发率。高可靠性值表示确定性模式。我们的Ceda的四个应用阐明了极端$ K $样本问题的四个重要方面。
With histograms as its foundation, we develop Categorical Exploratory Data Analysis (CEDA) under the extreme-$K$ sample problem, and illustrate its universal applicability through four 1D categorical datasets. Given a sizable $K$, CEDA's ultimate goal amounts to discover by data's information content via carrying out two data-driven computational tasks: 1) establish a tree geometry upon $K$ populations as a platform for discovering a wide spectrum of patterns among populations; 2) evaluate each geometric pattern's reliability. In CEDA developments, each population gives rise to a row vector of categories proportions. Upon the data matrix's row-axis, we discuss the pros and cons of Euclidean distance against its weighted version for building a binary clustering tree geometry. The criterion of choice rests on degrees of uniformness in column-blocks framed by this binary clustering tree. Each tree-leaf (population) is then encoded with a binary code sequence, so is tree-based pattern. For evaluating reliability, we adopt row-wise multinomial randomness to generate an ensemble of matrix mimicries, so an ensemble of mimicked binary trees. Reliability of any observed pattern is its recurrence rate within the tree ensemble. A high reliability value means a deterministic pattern. Our four applications of CEDA illuminate four significant aspects of extreme-$K$ sample problems.