kaggle Code :House Prices: Advanced Regression Techniques 回归
来源:互联网 发布:手机网络助手下载 编辑:程序博客网 时间:2024/05/20 04:28
In [1]:
#invite people for the Kaggle partyimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport numpy as npfrom scipy.stats import normfrom sklearn.preprocessing import StandardScalerfrom scipy import statsimport warningswarnings.filterwarnings('ignore')%matplotlib inline
In [2]:
#bring in the six packsdf_train = pd.read_csv('../input/train.csv')
In [3]:
#check the decorationdf_train.columns
Out[3]:
In [4]:
#descriptive statistics summarydf_train['SalePrice'].describe()
Out[4]:
In [5]:
#histogramsns.distplot(df_train['SalePrice']);
In [6]:
#skewness and kurtosisprint("Skewness: %f" % df_train['SalePrice'].skew())print("Kurtosis: %f" % df_train['SalePrice'].kurt())
In [7]:
#scatter plot grlivarea/salepricevar = 'GrLivArea'data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));
In [8]:
#scatter plot totalbsmtsf/salepricevar = 'TotalBsmtSF'data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));
In [9]:
#box plot overallqual/salepricevar = 'OverallQual'data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)f, ax = plt.subplots(figsize=(8, 6))fig = sns.boxplot(x=var, y="SalePrice", data=data)fig.axis(ymin=0, ymax=800000);
In [10]:
var = 'YearBuilt'data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)f, ax = plt.subplots(figsize=(16, 8))fig = sns.boxplot(x=var, y="SalePrice", data=data)fig.axis(ymin=0, ymax=800000);plt.xticks(rotation=90);
In [11]:
#correlation matrixcorrmat = df_train.corr()f, ax = plt.subplots(figsize=(12, 9))sns.heatmap(corrmat, vmax=.8, square=True);
In [12]:
#saleprice correlation matrixk = 10 #number of variables for heatmapcols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].indexcm = np.corrcoef(df_train[cols].values.T)sns.set(font_scale=1.25)hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)plt.show()
In [13]:
#scatterplotsns.set()cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']sns.pairplot(df_train[cols], size = 2.5)plt.show();
In [14]:
#missing datatotal = df_train.isnull().sum().sort_values(ascending=False)percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])missing_data.head(20)
Out[14]:
In [15]:
#dealing with missing datadf_train = df_train.drop((missing_data[missing_data['Total'] > 1]).index,1)df_train = df_train.drop(df_train.loc[df_train['Electrical'].isnull()].index)df_train.isnull().sum().max() #just checking that there's no missing data missing...
Out[15]:
In [16]:
#standardizing datasaleprice_scaled = StandardScaler().fit_transform(df_train['SalePrice'][:,np.newaxis]);low_range = saleprice_scaled[saleprice_scaled[:,0].argsort()][:10]high_range= saleprice_scaled[saleprice_scaled[:,0].argsort()][-10:]print('outer range (low) of the distribution:')print(low_range)print('\nouter range (high) of the distribution:')print(high_range)
In [17]:
#bivariate analysis saleprice/grlivareavar = 'GrLivArea'data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));
In [18]:
#deleting pointsdf_train.sort_values(by = 'GrLivArea', ascending = False)[:2]df_train = df_train.drop(df_train[df_train['Id'] == 1299].index)df_train = df_train.drop(df_train[df_train['Id'] == 524].index)
In [19]:
#bivariate analysis saleprice/grlivareavar = 'TotalBsmtSF'data = pd.concat([df_train['SalePrice'], df_train[var]], axis=1)data.plot.scatter(x=var, y='SalePrice', ylim=(0,800000));
In [20]:
#histogram and normal probability plotsns.distplot(df_train['SalePrice'], fit=norm);fig = plt.figure()res = stats.probplot(df_train['SalePrice'], plot=plt)
In [21]:
#applying log transformationdf_train['SalePrice'] = np.log(df_train['SalePrice'])
In [22]:
#transformed histogram and normal probability plotsns.distplot(df_train['SalePrice'], fit=norm);fig = plt.figure()res = stats.probplot(df_train['SalePrice'], plot=plt)
In [23]:
#histogram and normal probability plotsns.distplot(df_train['GrLivArea'], fit=norm);fig = plt.figure()res = stats.probplot(df_train['GrLivArea'], plot=plt)
In [24]:
#data transformationdf_train['GrLivArea'] = np.log(df_train['GrLivArea'])
In [25]:
#transformed histogram and normal probability plotsns.distplot(df_train['GrLivArea'], fit=norm);fig = plt.figure()res = stats.probplot(df_train['GrLivArea'], plot=plt)
In [26]:
#histogram and normal probability plotsns.distplot(df_train['TotalBsmtSF'], fit=norm);fig = plt.figure()res = stats.probplot(df_train['TotalBsmtSF'], plot=plt)
In [27]:
#create column for new variable (one is enough because it's a binary categorical feature)#if area>0 it gets 1, for area==0 it gets 0df_train['HasBsmt'] = pd.Series(len(df_train['TotalBsmtSF']), index=df_train.index)df_train['HasBsmt'] = 0 df_train.loc[df_train['TotalBsmtSF']>0,'HasBsmt'] = 1
In [28]:
#transform datadf_train.loc[df_train['HasBsmt']==1,'TotalBsmtSF'] = np.log(df_train['TotalBsmtSF'])
In [29]:
#histogram and normal probability plotsns.distplot(df_train[df_train['TotalBsmtSF']>0]['TotalBsmtSF'], fit=norm);fig = plt.figure()res = stats.probplot(df_train[df_train['TotalBsmtSF']>0]['TotalBsmtSF'], plot=plt)
In [30]:
#scatter plotplt.scatter(df_train['GrLivArea'], df_train['SalePrice']);
In [31]:
#scatter plotplt.scatter(df_train[df_train['TotalBsmtSF']>0]['TotalBsmtSF'], df_train[df_train['TotalBsmtSF']>0]['SalePrice']);
In [32]:
#convert categorical variable into dummydf_train = pd.get_dummies(df_train)
阅读全文
0 0
- kaggle Code :House Prices: Advanced Regression Techniques 回归
- 【Kaggle笔记】House Prices: Advanced Regression Techniques
- kaggle中的可视化(一):House Prices
- Getting Started with Kaggle: House Prices Competition
- kaggle -- House Prices实例:分数+思路+代码:sklearn + xgboost
- Machine Learning Foundations: A Case Study Approach-Regression-Assignment: Predicting House Prices
- kaggle-House Price Prediction
- kaggle 各种评价指标之一 :Error Metrics for Regression Problems 回归问题错误度量
- Advanced JDBC Techniques
- Advanced Collision Detection Techniques
- Advanced Data Mining Techniques
- regression 回归
- 回归 regression
- 回归- Regression
- 回归-regression
- 多项式回归模型(Office Prices)
- iOS-Core-Animation-Advanced-Techniques
- iOS-Core-Animation-Advanced-Techniques
- 算法之冒泡算法(golang)
- 怎样去了解软件系统周边影响因素
- thinkphp框架中jquery $.post()用法详解
- iOS __block和__weak的区别
- pandas将类别属性转化为数值属性的方法
- kaggle Code :House Prices: Advanced Regression Techniques 回归
- C#与C++代码互相调用
- 二叉搜索树
- 再读SIFT理论及源码
- HDU 1847(Good Luck in CET-4 Everybody!) 巴什博弈 Java
- 单调队列入门
- 查找——线性索引查找
- unix 02
- Hbase数据库的一些基础知识