练习使用Python+Scikit-learn预测航班延误
来源:互联网 发布:canal mysql 编辑:程序博客网 时间:2024/04/28 21:42
按照这篇博客的步骤进行。由于系统中没有安装PIG,故没有按文中的方式生成训练和测试数据,而是用Spark生成。系统环境为JDK 1.7,Spark 1.2.0, Scala 2.10.4,Python 2.7. Python最好使用集成安装包如Anaconda安装,会安装大部分扩展包。
1. 安装pydoop
可以使用pydoop库访问HDFS。下载后解压,在根目录执行
python setup.py build
python setup.py install --skip-build
2. 从原始数据生成特征数据
这里利用了Spark生成特征数据,joda包的安装参考上篇博客。在IntelliJ IDEA中直接运行以下代码就可将生成数据存入HDFS。
import org.apache.spark.rdd._import org.apache.spark.SparkConfimport org.apache.spark.SparkContextimport scala.collection.JavaConverters._import au.com.bytecode.opencsv.CSVReaderimport java.io._import org.joda.time._import org.joda.time.format._case class DelayRec(year: String, month: String, dayOfMonth: String, dayOfWeek: String, crsDepTime: String, depDelay: String, origin: String, distance: String, cancelled: String) { val holidays = List("01/01/2007", "01/15/2007", "02/19/2007", "05/28/2007", "06/07/2007", "07/04/2007", "09/03/2007", "10/08/2007" ,"11/11/2007", "11/22/2007", "12/25/2007", "01/01/2008", "01/21/2008", "02/18/2008", "05/22/2008", "05/26/2008", "07/04/2008", "09/01/2008", "10/13/2008" ,"11/11/2008", "11/27/2008", "12/25/2008") def gen_features: String = { "%s,%s,%s,%s,%s,%s,%d".format(depDelay, month, dayOfMonth, dayOfWeek, get_hour(crsDepTime), distance, days_from_nearest_holiday(year.toInt, month.toInt, dayOfMonth.toInt)) } def get_hour(depTime: String) : String = "%04d".format(depTime.toInt).take(2) def to_date(year: Int, month: Int, day: Int) = "%04d%02d%02d".format(year, month, day) def days_from_nearest_holiday(year:Int, month:Int, day:Int): Int = { val sampleDate = new DateTime(year, month, day, 0, 0) holidays.foldLeft(3000) { (r, c) => val holiday = DateTimeFormat.forPattern("MM/dd/yyyy").parseDateTime(c) val distance = Math.abs(Days.daysBetween(holiday, sampleDate).getDays) math.min(r, distance) } }}object MyApp { // function to do a preprocessing step for a given file def prepFlightDelays(sc: SparkContext, infile: String): RDD[DelayRec] = { val data = sc.textFile(infile) data.map { line => val reader = new CSVReader(new StringReader(line)) reader.readAll().asScala.toList.map(rec => DelayRec(rec(0),rec(1),rec(2), rec(3),rec(5),rec(15),rec(16),rec(18),rec(21))) }.map(list => list(0)) .filter(rec => rec.year != "Year") .filter(rec => rec.cancelled == "0") .filter(rec => rec.origin == "ORD") } def main (args: Array[String]) { val conf = new SparkConf().setAppName("MyApp") .setMaster("local") .set("spark.executor.memory", "600m") val sc = new SparkContext(conf) val data_2007 = prepFlightDelays(sc, "hdfs://node1:9000/airline/delay/2007.csv") .map(rec => rec.gen_features).saveAsTextFile("hdfs://node1:9000/airline/delay/ord_2007_1") val data_2008 = prepFlightDelays(sc, "hdfs://node1:9000/airline/delay/2008.csv") .map(rec => rec.gen_features).saveAsTextFile("hdfs://node1:9000/airline/delay/ord_2008_1") sc.stop() }}
3. 启动Spyder,在新建py文件中加入如下代码,运行,观看结果。这里调用Skicit-learn中的逻辑回归和随机森林算法进行分类。# Python library imports: numpy, random, sklearn, pandas, etcimport warningswarnings.filterwarnings('ignore')import sysimport randomimport numpy as npfrom sklearn import linear_model, cross_validation, metrics, svmfrom sklearn.metrics import confusion_matrix, precision_recall_fscore_support, accuracy_scorefrom sklearn.ensemble import RandomForestClassifierfrom sklearn.preprocessing import StandardScalerimport pandas as pdimport matplotlib.pyplot as pltimport pydoop.hdfs as hdfs# function to read HDFS file into dataframe using PyDoopdef read_csv_from_hdfs(path, cols, col_types=None): files = hdfs.ls(path); pieces = [] for f in files: pieces.append(pd.read_csv(hdfs.open(f), names=cols, dtype=col_types)) return pd.concat(pieces, ignore_index=True)# read filescols = ['delay', 'month', 'day', 'dow', 'hour', 'distance', 'days_from_holiday']col_types = {'delay': int, 'month': int, 'day': int, 'dow': int, 'hour': int, 'distance': int, 'days_from_holiday': int}data_2007 = read_csv_from_hdfs('hdfs://node1:9000/airline/delay/ord_2007_1', cols, col_types)data_2008 = read_csv_from_hdfs('hdfs://node1:9000/airline/delay/ord_2008_1', cols, col_types)data_2007['DepDelayed'] = data_2007['delay'].apply(lambda x: x>=15)print "total flights: " + str(data_2007.shape[0])print "total delays: " + str(data_2007['DepDelayed'].sum())# Select a Pandas dataframe with flight originating from ORD# Compute average number of delayed flights per monthgrouped = data_2007[['DepDelayed', 'month']].groupby('month').mean()# plot average delays by monthgrouped.plot(kind='bar')# Compute average number of delayed flights by hourgrouped = data_2007[['DepDelayed', 'hour']].groupby('hour').mean()# plot average delays by hour of daygrouped.plot(kind='bar')# Create training set and test setcols = ['month', 'day', 'dow', 'hour', 'distance', 'days_from_holiday']train_y = data_2007['delay'] >= 15train_x = data_2007[cols]test_y = data_2008['delay'] >= 15test_x = data_2008[cols]# Create logistic regression model with L2 regularizationclf_lr = linear_model.LogisticRegression(penalty='l2', class_weight='auto')clf_lr.fit(train_x, train_y)# Predict output labels on test setpr = clf_lr.predict(test_x)# display evaluation metricscm = confusion_matrix(test_y, pr)print("Confusion matrix")print(pd.DataFrame(cm))report_lr = precision_recall_fscore_support(list(test_y), list(pr), average='micro')print "\nprecision = %0.2f, recall = %0.2f, F1 = %0.2f, accuracy = %0.2f\n" % \ (report_lr[0], report_lr[1], report_lr[2], accuracy_score(list(test_y), list(pr))) # Create Random Forest classifier with 50 treesclf_rf = RandomForestClassifier(n_estimators=50, n_jobs=-1)clf_rf.fit(train_x, train_y)# Evaluate on test setpr = clf_rf.predict(test_x)# print resultscm = confusion_matrix(test_y, pr)print("Confusion matrix")print(pd.DataFrame(cm))report_svm = precision_recall_fscore_support(list(test_y), list(pr), average='micro')print "\nprecision = %0.2f, recall = %0.2f, F1 = %0.2f, accuracy = %0.2f\n" % \ (report_svm[0], report_svm[1], report_svm[2], accuracy_score(list(test_y), list(pr)))
0 0
- 练习使用Python+Scikit-learn预测航班延误
- 练习使用Spark and ML-Lib 预测航班延误
- [机器学习实战]使用 scikit-learn 预测用户流失
- scikit-learn得到预测概率
- 第5届大学生软件设计大赛赛题-航班延误预测讨论
- python scikit learn 模板
- 【Python】scikit-learn教程
- python scikit-learn中文翻译
- Python 安装scikit-learn
- python-scikit-learn-DBSCAN
- scikit -learn 的使用
- Scikit-Learn使用总结
- 初步使用scikit-learn
- Scikit-learn使用总结
- Scikit-learn使用总结
- Scikit-learn使用总结
- Scikit-learn使用总结
- Scikit-learn使用总结
- 三层例子
- hdu 2660 Accepted Necklace(dfs、dp)
- emacs里运用pomodoro进行蕃茄工作法
- Android 4.4(KitKat)窗口管理子系统 - 体系框架
- 强联通分量 Tarjan算法
- 练习使用Python+Scikit-learn预测航班延误
- django手册
- day04(for+continue+break+方法的重载+sacnner类)
- 6、ZigZag Conversion
- ArcEngine10.2 VC++实现地图浏览基本功能
- day05(数组+二维数组)
- shell脚本学习笔记
- Servlet之监听器Listener
- pickle和cPickle:Python对象的序列化