论文标题
在Apache Spark下使用流数据分析的Now Chancial Time系列
Nowcasting the Financial Time Series with Streaming Data Analytics under Apache Spark
论文作者
论文摘要
本文建议使用Apache Spark的流分析功能实时使用5分钟的间隔来实时对高频财务数据集进行现实。所提出的2阶段方法包括在第一阶段对混乱进行建模,然后使用滑动窗口方法来训练机器学习算法,即Lasso回归,脊回归,广义线性模型,梯度增强树和第二阶段的Apache Mllib的随机森林。为了测试拟议方法论的有效性,有3个不同的数据集,其中两个是股票市场,即国家证券交易所和孟买证券交易所,最后一个比特币INR转换数据集。为了评估所提出的方法,我们使用了指标,例如对称的绝对百分比误差,定向对称性和Theil U系数。我们使用Diebold Mariano(DM)测试测试了每对模型的重要性。
This paper proposes nowcasting of high-frequency financial datasets in real-time with a 5-minute interval using the streaming analytics feature of Apache Spark. The proposed 2 stage method consists of modelling chaos in the first stage and then using a sliding window approach for training with machine learning algorithms namely Lasso Regression, Ridge Regression, Generalised Linear Model, Gradient Boosting Tree and Random Forest available in the MLLib of Apache Spark in the second stage. For testing the effectiveness of the proposed methodology, 3 different datasets, of which two are stock markets namely National Stock Exchange & Bombay Stock Exchange, and finally One Bitcoin-INR conversion dataset. For evaluating the proposed methodology, we used metrics such as Symmetric Mean Absolute Percentage Error, Directional Symmetry, and Theil U Coefficient. We tested the significance of each pair of models using the Diebold Mariano (DM) test.