Acta Petrolei Sinica ›› 2018, Vol. 39 ›› Issue (2): 240-246.DOI: 10.7623/syxb201802013

• SPECIAL CONTRIBUTION • Previous Articles     Next Articles

Optimization of common data mining algorithms for petroleum exploration and development

Li Dawei, Shi Guangren   

  1. PetroChina Research Institute of Petroleum Exploration and Development, Beijing 100083, China
  • Received:2017-01-12 Revised:2018-01-17 Online:2018-02-25 Published:2018-03-09

油气勘探开发常用数据挖掘算法优选

李大伟, 石广仁   

  1. 中国石油勘探开发研究院 北京 100083
  • 通讯作者: 李大伟,男,1969年5月生,1991年获中国地质大学(武汉)学士学位,1996年获中国地质大学(北京)博士学位,现为中国石油勘探开发研究院高级工程师,主要从事海外勘探开发信息化建设、应用与数据挖掘工作。Email:leedw@petrochina.com.cn
  • 作者简介:李大伟,男,1969年5月生,1991年获中国地质大学(武汉)学士学位,1996年获中国地质大学(北京)博士学位,现为中国石油勘探开发研究院高级工程师,主要从事海外勘探开发信息化建设、应用与数据挖掘工作。Email:leedw@petrochina.com.cn
  • 基金资助:
    国家重大科技专项"全球油气资源评价与选区选带研究"(2016ZX05029)资助。

Abstract: For the petroleum industry in the big data period, it is necessary to fully exploit the great potential value of big data in the petroleum industry. Although data mining has achieved remarkable results in many industries, its application in the field of hydrocarbon exploration and development is still in its initial stage, which mainly lies on the particularity of the data and its specific applications in hydrocarbon exploration and development. The common algorithms in data mining can be divided into regression, classification, clustering, estimation, prediction, association analysis and so on. Among them, regression and classification are the most mature and most widely used algorithms. However, for specific research objects as well as different research questions and data resources, different regression and classification algorithms have their own applicability, thus it is required to optimize the appropriate algorithm for data sets aiming at specific problems. Taking the oil test data of Tahe oilfield as an example, and formation factor and reservoir classification as the mining objects, the applicability of common regression and classification algorithms is analyzed in detail. The results show that for common petroleum industry data and study objects, the optimal regression algorithm is the back propagation neural network (BPNN), followed by support vector machine regression (R-SVM) and multivariate regression analysis (MRA); the optimal classification algorithm is the support vector machine classification (C-SVM), followed by Bayesian stepwise discrimination (BAYSD); MRA and BAYSD can also be used for data dimensionality reduction, and the latter is better; R-type clustering analysis (RCA) can also be used for data dimensionality reduction, while Q cluster analysis (QCA) can be adopted for sample reduction; in the research of specific data mining applications, the algorithm must be optimized according to specific data set.

Key words: big data, data mining, regression, classification, data cleaning, optimization, formation factor, oil layer classification

摘要: 迈入大数据时代的石油工业,需要充分挖掘石油工业大数据的巨大潜在价值。虽然数据挖掘已经在许多行业取得了丰硕的成果,但在油气勘探开发领域的应用还处于初始阶段,这主要由于油气勘探开发的数据及其应用具有自己的特殊性。数据挖掘常用的算法可分为回归、分类、聚类、估计、预测、关联分析等。其中的回归、分类是最成熟、应用最多的算法。但是对于具体的研究对象、不同的研究问题和数据源,不同的回归和分类算法又具有各自的适用性,因此需要针对具体问题优选适合该数据集的算法。以塔河油田的试油数据为例,以地层系数和油层分类为分析挖掘对象,详细解析了常用回归、分类算法的适用性。研究发现,对于常见的石油行业数据和研究对象:1最优的回归算法是反向传播神经网络(BPNN),其次为支持向量机回归(R-SVM)和多元回归分析(MRA);2最优的分类算法是支持向量机分类(C-SVM),其次为贝叶斯逐步判别(BAYSD);3 MRA和BAYSD可以用于数据降维,BAYSD的降维效果更好;4 R型聚类分析(RCA)可以用于数据降维,Q型聚类分析(QCA)可以用于样本约简;5在做具体的数据挖掘应用研究时一定要针对具体数据集对所用算法进行优选。

关键词: 大数据, 数据挖掘, 回归, 分类, 数据清洗, 优选, 地层系数, 油层分类

CLC Number: