Machine Learning Methods in Regression

In [2]:
#Import some standard libraries for data analysis, visualization and machine learning

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
%matplotlib inline

from scipy.special import boxcox1p

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error, recall_score
from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC, LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor, RandomForestClassifier
from sklearn.kernel_ridge import KernelRidge
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone

import xgboost as xgb
import lightgbm as lgb

from bayes_opt import BayesianOptimization

<a id = "intro"></a>

1. Introduction

The goal of a regression problem is to predict the value of some continuous random variable $Y$, called the target variable, given the values of some input random variables $X_i$, which we call the feature variables. While the target variable must be continuous, the features can be continuous, discrete, or a mix of the two. In this notebook the target variable we are interested in will be housing prices in Ames, Iowa, and our features will be extracted from 79 possibilities.

We will first do some exploratory data analysis on these features, to determine the shape of their distributions and how they interact with the target sales price variable. Then we will do some feature extraction and selection, to determine which variables to feed into our models. Next we fit our regression models to our chosen features. Finally, we use stacking methods to combine our models into a final "megazord" model.

<a id = "EDA"></a>

2. Exploratory Data Analysis

In this section we perform some exploratory data analysis. First we look for any null values and outliers. Then we examine the distributions of our target variable and our features, and look at how the target variable and features interact. Throughout this section we also try to look at whether there are any noticable differences in the training set and the testing set.

First, let's read in the data and take a quick look at the first few rows.

In [70]:
train = pd.read_csv(r"C:\Users\ASUS\Documents\Data\kaggle\HousePrices\train.csv")
test = pd.read_csv(r"C:\Users\ASUS\Documents\Data\kaggle\HousePrices\test.csv")
In [85]:
train_id = train['Id']
test_id = test['Id']
n_train = train.shape[0]
n_test = test.shape[0]
data = pd.concat((train.drop(['Id', 'SalePrice'], axis = 1), test.drop('Id', axis = 1))).reset_index(drop=True)

y = train[['SalePrice']]
In [72]:
data.head().transpose()
Out[72]:
0 1 2 3 4
MSSubClass 60 20 60 70 60
MSZoning RL RL RL RL RL
LotFrontage 65 80 68 60 84
LotArea 8450 9600 11250 9550 14260
Street Pave Pave Pave Pave Pave
Alley NaN NaN NaN NaN NaN
LotShape Reg Reg IR1 IR1 IR1
LandContour Lvl Lvl Lvl Lvl Lvl
Utilities AllPub AllPub AllPub AllPub AllPub
LotConfig Inside FR2 Inside Corner FR2
LandSlope Gtl Gtl Gtl Gtl Gtl
Neighborhood CollgCr Veenker CollgCr Crawfor NoRidge
Condition1 Norm Feedr Norm Norm Norm
Condition2 Norm Norm Norm Norm Norm
BldgType 1Fam 1Fam 1Fam 1Fam 1Fam
HouseStyle 2Story 1Story 2Story 2Story 2Story
OverallQual 7 6 7 7 8
OverallCond 5 8 5 5 5
YearBuilt 2003 1976 2001 1915 2000
YearRemodAdd 2003 1976 2002 1970 2000
RoofStyle Gable Gable Gable Gable Gable
RoofMatl CompShg CompShg CompShg CompShg CompShg
Exterior1st VinylSd MetalSd VinylSd Wd Sdng VinylSd
Exterior2nd VinylSd MetalSd VinylSd Wd Shng VinylSd
MasVnrType BrkFace None BrkFace None BrkFace
MasVnrArea 196 0 162 0 350
ExterQual Gd TA Gd TA Gd
ExterCond TA TA TA TA TA
Foundation PConc CBlock PConc BrkTil PConc
BsmtQual Gd Gd Gd TA Gd
... ... ... ... ... ...
HalfBath 1 0 1 0 1
BedroomAbvGr 3 3 3 3 4
KitchenAbvGr 1 1 1 1 1
KitchenQual Gd TA Gd Gd Gd
TotRmsAbvGrd 8 6 6 7 9
Functional Typ Typ Typ Typ Typ
Fireplaces 0 1 1 1 1
FireplaceQu NaN TA TA Gd TA
GarageType Attchd Attchd Attchd Detchd Attchd
GarageYrBlt 2003 1976 2001 1998 2000
GarageFinish RFn RFn RFn Unf RFn
GarageCars 2 2 2 3 3
GarageArea 548 460 608 642 836
GarageQual TA TA TA TA TA
GarageCond TA TA TA TA TA
PavedDrive Y Y Y Y Y
WoodDeckSF 0 298 0 0 192
OpenPorchSF 61 0 42 35 84
EnclosedPorch 0 0 0 272 0
3SsnPorch 0 0 0 0 0
ScreenPorch 0 0 0 0 0
PoolArea 0 0 0 0 0
PoolQC NaN NaN NaN NaN NaN
Fence NaN NaN NaN NaN NaN
MiscFeature NaN NaN NaN NaN NaN
MiscVal 0 0 0 0 0
MoSold 2 5 9 2 12
YrSold 2008 2007 2008 2006 2008
SaleType WD WD WD WD WD
SaleCondition Normal Normal Normal Abnorml Normal

79 rows × 5 columns

We can already see some NaN's. Let's get a better look at the breakdown of null values.

In [75]:
train_null = data[:n_train].isnull().sum().sort_values(ascending = False)
test_null = data[n_train:].isnull().sum().sort_values(ascending = False)


f, ax = plt.subplots(figsize = (15,20), ncols = 2, nrows = 1)
sns.barplot(y= train_null[train_null >0].index, x = train_null[train_null >0], orient = 'h', ax = ax[0], palette="RdYlGn")
ax[0].set(title = 'Nulls in the training set')
sns.barplot(y= test_null[test_null >0].index, x = test_null[test_null >0], orient = 'h', ax = ax[1], palette="RdYlGn")
ax[1].set(title = 'Nulls in the testing set')
Out[75]:
[Text(0.5,1,'Nulls in the testing set')]

The breakdown looks pretty similar between the test set and training set, except there are a bunch of columns in the test set with just a couple of null values, which won't have a substantial effect on our predictions. We will consider the problem of replacing the null values in the modeling stage.

Next we take a look at the distribution of the target value.

In [125]:
f, (axs1, axs2) = plt.subplots(figsize = (15,5), ncols = 2)
sns.distplot(y, ax = axs1, color = sns.xkcd_rgb['faded green'])
axs1.set_title('Histogram of Sale Price')

sns.violinplot(y, ax = axs2, color = sns.xkcd_rgb['faded green'])
axs2.set_title('Violin Plot of Sale Price')
Out[125]:
Text(0.5,1,'Violin Plot of Sale Price')

We see that the SalePrice distribution has a heavy positive skew. This will negatively affect some of our models. To counteract this, we apply a log transform to y.

In [91]:
log_y = np.log(y)
In [127]:
f, (axs1, axs2) = plt.subplots(figsize = (15,5), ncols = 2, nrows = 1)
sns.distplot(log_y, ax = axs1,color = sns.xkcd_rgb['faded green'])
axs1.set(title = 'Histogram of Sale Price', yticks = [])

sns.violinplot(log_y, ax = axs2, color = sns.xkcd_rgb['faded green'])
axs2.set_title('Violin Plot of Sale Price')
Out[127]:
Text(0.5,1,'Violin Plot of Sale Price')

That looks much more normal. Next we will get a look at the distribution of our feature and see how they interact with SalePrice. We separate the continuous, nominal and categorical variables and visualize them in different ways

In [11]:
data.dtypes.value_counts()
Out[11]:
object     43
int64      25
float64    11
dtype: int64
In [130]:
continuous = [x for x in data if data[x].dtypes in ['float64', 'int64'] and data[x].value_counts().shape[0] > 20]

for x in continuous:
    
    f, ax = plt.subplots(figsize = (20, 5))
    sns.regplot(x = x, y = 'SalePrice', color = sns.xkcd_rgb['faded green'],
                data = pd.concat([data[:n_train][x], log_y], axis = 1).dropna(), ax=ax)
    ax.set(ylabel = 'Log of Sale Price')
    
    f, ax = plt.subplots(figsize = (20, 5), ncols = 2)
    sns.distplot(data[:n_train][x].dropna(), color = sns.xkcd_rgb['faded green'], ax = ax[0])
    ax[0].set(title = 'Training Set Distribution')
    sns.distplot(data[n_train:][x].dropna(), ax = ax[1])
    ax[1].set(title = 'Testing Set Distribution')
C:\Users\ASUS\Anaconda3\lib\site-packages\matplotlib\pyplot.py:523: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

There are a few thing we notice from these plots.

  1. Some of the variables are time-based variables as opposed to continuous variables. We separate these into their own category.

  2. There are many variables which are mostly 0. We create a new categorical variable for these which tell us whether the value is 0 or 1.

  3. Several variables are skewed. We apply a box plot transform to them to alleviate this defect.

In [107]:
temporal = ['YearBuilt',
            'YearRemodAdd',
            'GarageYrBlt',
            ]
continuous = [x for x in continuous if x not in temporal]

skewed = (data[continuous] == 0).sum()[(data[continuous] == 0).sum() > 500]
Out[107]:
MasVnrArea       1738
BsmtFinSF1        929
BsmtFinSF2       2571
2ndFlrSF         1668
LowQualFinSF     2879
WoodDeckSF       1523
OpenPorchSF      1298
EnclosedPorch    2460
3SsnPorch        2882
ScreenPorch      2663
MiscVal          2816
dtype: int64
In [99]:
normalized = pd.DataFrame()
for feat in continuous:
    normalized[feat] = boxcox1p(data[feat].dropna(), .15)
In [131]:
for x in normalized:
    
    f, ax = plt.subplots(figsize = (20, 5))
    sns.regplot(x = x, y = 'SalePrice', color = sns.xkcd_rgb['faded green'],
                data = pd.concat([normalized[:n_train][normalized[:n_train] >0 ][x], log_y], axis = 1).dropna())
    ax.set(ylabel = 'Log of Sale Price')
    
    f, ax = plt.subplots(figsize = (20, 5), ncols = 2)
    sns.distplot(normalized[:n_train][normalized[:n_train] >0 ][x].dropna(), color = sns.xkcd_rgb['faded green'], ax = ax[0])
    sns.distplot(normalized[:n_train][normalized[:n_train] >0 ][x].dropna(), ax = ax[1])
C:\Users\ASUS\Anaconda3\lib\site-packages\matplotlib\pyplot.py:523: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

With those changes, we can see there is a linear relationship between our target variable and some of our features. Other relationships can be inferred from the following heatmap of correlations.

In [120]:
f, ax = plt.subplots(figsize = (20, 20))
corr = pd.concat((data[continuous][:n_train], y), axis = 1).dropna().corr()
sns.heatmap(corr,annot=True, cmap = "RdYlGn")
Out[120]:
<matplotlib.axes._subplots.AxesSubplot at 0x22266ac7358>
In [146]:
for i in range(len(temporal)):
    f, ax = plt.subplots(figsize = (20,3))
    ax = sns.pointplot(data[temporal].iloc[:n_train,i], y = log_y, fit_reg = False, color = sns.xkcd_rgb['rust orange'])
    ax.set(ylabel = 'Log of Sale Price')
    labels = ax.get_xticklabels()
    for i,l in enumerate(labels):
        if(i%5 != 0): labels[i] = '' 
    ax.set_xticklabels(labels, rotation=30)
In [1]:
ordinal = [x for x in data if data[x].dtypes in ['float64', 'int64'] and data[x].value_counts().shape[0] <= 20]

for x in ordinal:
    
    f, ax = plt.subplots(figsize = (20, 5))
    sns.pointplot(x = x, y = 'SalePrice', 
                data = pd.concat([data[:n_train][x], log_y], axis = 1).dropna())
    ax.set(ylabel = 'Log of Sale Price')
    
    f, ax = plt.subplots(figsize = (20, 5), ncols = 2)
    sns.countplot(data[:n_train][x].dropna(), color = sns.xkcd_rgb['faded green'], ax = ax[0])
    sns.countplot(data[n_train:][x].dropna(), ax = ax[1], color = sns.xkcd_rgb['faded blue'])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-b90744f3480e> in <module>()
----> 1 ordinal = [x for x in data if data[x].dtypes in ['float64', 'int64'] and data[x].value_counts().shape[0] <= 20]
      2 
      3 for x in ordinal:
      4 
      5     f, ax = plt.subplots(figsize = (20, 5))

NameError: name 'data' is not defined
In [349]:
f, ax = plt.subplots(figsize = (20, 20))
corr = pd.concat((data[nominal].iloc[:n_train,:], y), axis = 1).corr()
sns.heatmap(corr,annot=True, cmap = "RdYlGn")
Out[349]:
<matplotlib.axes._subplots.AxesSubplot at 0x2bd38d10ac8>
In [163]:
categorical = [x for x in data if data[x].dtypes in ['object']]

for x in categorical:
    f, ax = plt.subplots(figsize = (20, 5))
    sns.lvplot(x = x, y = 'SalePrice', 
                data = pd.concat([data[:n_train][x], log_y], axis = 1).dropna())
    ax.set(ylabel = 'Log of Sale Price')
    
    f, ax = plt.subplots(figsize = (20, 5), ncols = 2)
    sns.countplot(data[:n_train][x].dropna(), color = sns.xkcd_rgb['faded green'], ax = ax[0])
    sns.countplot(data[n_train:][x].dropna(), ax = ax[1], color = sns.xkcd_rgb['faded blue'])
C:\Users\ASUS\Anaconda3\lib\site-packages\matplotlib\pyplot.py:523: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

<a id = "pred"> </a>

3. Predictions

In this section we are going to make our predictions. We will fit a variety of models, using Bayesian Optimization to choose our parameters.

In [165]:
data_drop = pd.get_dummies(data.dropna(axis = 1))
In [175]:
X_train, X_test, y_train, y_test = train_test_split(data_drop[:n_train], log_y, test_size = .2)
In [172]:
def xgb_evaluate(min_child_weight,
                 colsample_bytree,
                 max_depth,
                 subsample,
                 gamma,
                 eta,
                 alpha):

    params['min_child_weight'] = int(min_child_weight)
    params['cosample_bytree'] = max(min(colsample_bytree, 1), 0)
    params['max_depth'] = int(max_depth)
    params['subsample'] = max(min(subsample, 1), 0)
    params['gamma'] = max(gamma, 0)
    params['eta'] = max(eta, 0)
    params['alpha'] = max(alpha, 0)


    cv_result = xgb.cv(params, xgtrain, num_boost_round=num_rounds, nfold=5,
             seed=random_state,
             callbacks=[xgb.callback.early_stop(50)])

    return -cv_result['test-rmse-mean'].values[-1]


def prepare_data():
    xgtrain = xgb.DMatrix(X_train, label=y_train)

    return xgtrain
In [176]:
xgtrain = prepare_data()

num_rounds = 300
random_state = 2016
num_iter = 25
init_points = 5
params = {
    'silent': 1,
    'eval_metric': 'rmse',
    'verbose_eval': True,
    'seed': random_state
}

xgbBO = BayesianOptimization(xgb_evaluate, {'min_child_weight': (1, 20),
                                            'colsample_bytree': (0.1, 1),
                                            'max_depth': (5, 15),
                                            'subsample': (0.5, 1),
                                            'gamma': (0, 10),
                                            'eta': (.1,1),
                                            'alpha': (0, 10),
                                            })

xgbBO.maximize(init_points=init_points, n_iter=num_iter)
Initialization
---------------------------------------------------------------------------------------------------------------------------------------
 Step |   Time |      Value |     alpha |   colsample_bytree |       eta |     gamma |   max_depth |   min_child_weight |   subsample | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[106]	train-rmse:0.233946+0.00740024	test-rmse:0.241471+0.0244472

    1 | 00m02s |   -0.24147 |    5.1822 |             0.6602 |    0.5374 |    4.7521 |     13.3953 |            15.1858 |      0.5184 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[219]	train-rmse:0.194239+0.00614475	test-rmse:0.218339+0.00884284

    2 | 00m07s |   -0.21834 |    1.3357 |             0.4021 |    0.9072 |    1.9608 |      8.0977 |             2.8182 |      0.8444 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[18]	train-rmse:0.247243+0.00500388	test-rmse:0.250032+0.0294304

    3 | 00m01s |   -0.25003 |    2.2306 |             0.3971 |    0.8211 |    8.8132 |     11.4214 |             6.6806 |      0.9538 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[61]	train-rmse:0.212436+0.00245033	test-rmse:0.219428+0.0203557

    4 | 00m02s |   -0.21943 |    2.7039 |             0.7019 |    0.2877 |    5.4981 |     11.8453 |            19.5134 |      0.8291 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[157]	train-rmse:0.249759+0.00347087	test-rmse:0.254802+0.026096

    5 | 00m03s |   -0.25480 |    8.5973 |             0.7555 |    0.2593 |    9.5659 |     10.5661 |            15.9529 |      0.8965 | 
Bayesian Optimization
---------------------------------------------------------------------------------------------------------------------------------------
 Step |   Time |      Value |     alpha |   colsample_bytree |       eta |     gamma |   max_depth |   min_child_weight |   subsample | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[70]	train-rmse:0.102158+0.00148938	test-rmse:0.169027+0.00953702

    6 | 00m07s |   -0.16903 |    0.7449 |             0.4718 |    0.8087 |    0.0534 |      5.1422 |            19.1144 |      0.7317 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[4]	train-rmse:0.137061+0.0033851	test-rmse:0.18413+0.00419682

    7 | 00m17s |   -0.18413 |    0.0000 |             1.0000 |    1.0000 |    0.0000 |      5.0000 |            20.0000 |      1.0000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
    8 | 00m10s |   -0.17320 |   10.0000 |             0.1000 |    0.1000 |    0.0000 |      5.0000 |            20.0000 |      0.5000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[220]	train-rmse:0.254172+0.00299967	test-rmse:0.257273+0.0275143

    9 | 00m10s |   -0.25727 |    0.0000 |             0.1000 |    0.1000 |    8.6971 |      5.0000 |            19.3678 |      0.5000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
   10 | 00m14s |   -0.29791 |   10.0000 |             0.1000 |    0.1000 |   10.0000 |      5.0000 |             1.0000 |      0.5000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[157]	train-rmse:0.0013986+0.000279735	test-rmse:0.151896+0.00532008

   11 | 00m17s |   -0.15190 |    0.0000 |             0.1000 |    0.1000 |    0.0000 |     15.0000 |             1.0000 |      1.0000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[93]	train-rmse:0.0661824+0.00163296	test-rmse:0.14464+0.0055696

   12 | 00m15s |   -0.14464 |    0.0000 |             0.1000 |    0.1000 |    0.0000 |     15.0000 |            20.0000 |      1.0000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[95]	train-rmse:0.061+0.00158753	test-rmse:0.144229+0.00457612

   13 | 00m14s |   -0.14423 |    0.0000 |             0.1000 |    0.1000 |    0.0000 |      9.1261 |            11.7048 |      1.0000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[113]	train-rmse:0.121448+0.00355288	test-rmse:0.172948+0.0167438

   14 | 00m12s |   -0.17295 |   10.0000 |             0.1000 |    0.1000 |    0.0000 |     15.0000 |             1.0000 |      1.0000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[140]	train-rmse:0.136772+0.000758777	test-rmse:0.161437+0.0121558

   15 | 00m11s |   -0.16144 |   10.0000 |             0.1000 |    0.1000 |    0.0000 |     14.4344 |            20.0000 |      1.0000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
   16 | 00m14s |   -0.14332 |    0.0464 |             0.1000 |    0.1000 |    0.1073 |      6.3651 |            19.5111 |      0.8626 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
   17 | 00m13s |   -0.14785 |    2.2147 |             0.2550 |    0.1053 |    0.0432 |     10.3388 |            19.0581 |      0.9321 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[106]	train-rmse:0.0871092+0.000720051	test-rmse:0.142686+0.00615836

   18 | 00m10s |   -0.14269 |    0.0000 |             0.1000 |    0.1000 |    0.0000 |      5.0000 |            15.7883 |      0.8518 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[85]	train-rmse:0.0586026+0.000709885	test-rmse:0.171218+0.00956596

   19 | 00m10s |   -0.17122 |    1.1712 |             0.1049 |    0.7417 |    0.0003 |      7.8559 |            16.9492 |      0.9201 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[192]	train-rmse:0.287076+0.00397964	test-rmse:0.290744+0.033543

   20 | 00m15s |   -0.29074 |   10.0000 |             0.1000 |    0.1000 |   10.0000 |     15.0000 |             1.0000 |      0.5612 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[72]	train-rmse:0.124206+0.00211708	test-rmse:0.158269+0.00997683

   21 | 00m08s |   -0.15827 |    6.6518 |             0.1000 |    0.1780 |    0.0000 |      6.3120 |             6.4357 |      1.0000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[157]	train-rmse:0.0508168+0.00132904	test-rmse:0.145073+0.00742828

   22 | 00m11s |   -0.14507 |    0.0000 |             0.1000 |    0.1000 |    0.0000 |      5.0000 |             1.0000 |      0.5000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[231]	train-rmse:0.151161+0.00200079	test-rmse:0.177958+0.011941

   23 | 00m10s |   -0.17796 |    8.8461 |             0.4096 |    0.2104 |    0.0602 |      5.4902 |             1.7929 |      0.5271 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
   24 | 00m16s |   -0.14816 |    0.8569 |             0.1122 |    0.1050 |    0.1129 |     14.1386 |             9.8778 |      0.9577 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[104]	train-rmse:0.08578+0.00131836	test-rmse:0.141664+0.00590819

   25 | 00m10s |   -0.14166 |    0.0000 |             1.0000 |    0.1000 |    0.0000 |      5.0000 |             7.1967 |      0.5000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[57]	train-rmse:0.0861834+0.000494653	test-rmse:0.147854+0.00432513

   26 | 00m10s |   -0.14785 |    0.0000 |             0.1000 |    0.1779 |    0.0000 |      5.0000 |             6.2466 |      0.5103 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[220]	train-rmse:0.26215+0.00360585	test-rmse:0.265079+0.0283394

   27 | 00m15s |   -0.26508 |    0.0000 |             1.0000 |    0.1000 |   10.0000 |     15.0000 |            20.0000 |      0.5000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
   28 | 00m11s |   -0.15060 |    3.9157 |             0.1000 |    0.1000 |    0.0000 |      6.1917 |            11.9866 |      0.5000 | 
C:\Users\ASUS\Anaconda3\lib\site-packages\sklearn\gaussian_process\gpr.py:457: UserWarning: fmin_l_bfgs_b terminated abnormally with the  state: {'grad': array([  5.17741965e-05]), 'task': b'ABNORMAL_TERMINATION_IN_LNSRCH', 'funcalls': 57, 'nit': 5, 'warnflag': 2}
  " state: %s" % convergence_dict)
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[75]	train-rmse:0.0709766+0.00134991	test-rmse:0.146021+0.00410328

   29 | 00m13s |   -0.14602 |    0.0000 |             1.0000 |    0.1000 |    0.0000 |     10.7365 |             5.3112 |      0.5000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
   30 | 00m16s |   -0.14470 |    2.2589 |             0.1038 |    0.1000 |    0.0000 |      5.0063 |            19.8838 |      0.6391 | 
In [177]:
par = xgbBO.res['max']
par
Out[177]:
{'max_params': {'alpha': 0.0,
  'colsample_bytree': 1.0,
  'eta': 0.10000000000000001,
  'gamma': 0.0,
  'max_depth': 5.0,
  'min_child_weight': 7.1966945846891486,
  'subsample': 0.5},
 'max_val': -0.14166380000000001}
In [178]:
model_xgb = xgb.XGBRegressor(silent = 1, eval_metric = 'mae', verbose_eval = True, **par)
In [182]:
model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =7, nthread = -1)
In [800]:
model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
                              learning_rate=0.05, n_estimators=720,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2319,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)
In [183]:
np.sqrt(np.mean(-cross_val_score(model_xgb, data_drop[:n_train], log_y, scoring = 'neg_mean_squared_error', n_jobs = 5)))
Out[183]:
0.14092588215444402
In [798]:
np.sqrt(mean_squared_error(model_xgb.predict(X_test), y_test.values.ravel()))
Out[798]:
0.13976753147419943
In [802]:
model_lgb.fit(X_train, y_train.values.ravel())
np.sqrt(np.mean(-cross_val_score(model_lgb, X_train, y_train.values.ravel(), scoring = 'neg_mean_squared_error', n_jobs = 5)))
Out[802]:
0.13641285365312741
In [ ]:
model_xgb.fit()
In [698]:
df = pd.DataFrame()
df['xgb'] = model_xgb.predict(X_test)
df['lgb'] = model_lgb.predict(X_test)
In [701]:
df = pd.DataFrame()
df['xgb'] = model_xgb.predict(data[n_train:])
df['lgb'] = model_lgb.predict(data[n_train:])
In [700]:
model_rfr = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)
model_rfr.fit(df, y_test.values.ravel())
np.sqrt(np.mean(-cross_val_score(model_rfr, df, y_test.values.ravel(), scoring = 'neg_mean_squared_error', n_jobs = 5)))
Out[700]:
0.12973250642180528
In [704]:
prediction = model_rfr.predict(df)
In [753]:
model_xgb.fit(data[:n_train], y)
Out[753]:
XGBRegressor(alpha=0.41258, base_score=0.5, booster='gbtree',
       colsample_bylevel=1, colsample_bytree=0.98935, eta=0.1,
       eval_metric='mae', gamma=0.0162872, learning_rate=0.1,
       max_delta_step=0, max_depth=6, max_val=-0.0958,
       min_child_weight=13.052, missing=None, n_estimators=100, n_jobs=1,
       nthread=None, objective='reg:linear', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=1,
       subsample=0.675489, verbose_eval=True)
In [758]:
prediction = model_xgb.predict(data[n_train:])
In [759]:
sub = pd.DataFrame()
sub['Id'] = test_id
sub['SalePrice'] = np.exp(prediction)
sub.to_csv(r'C:\Users\ASUS\Documents\Data\kaggle\HousePrices\submission.csv',index=False)
In [598]:
#model_rfr = RandomForestClassifier(n_estimators=100, bootstrap = 20, class_weight={0:20, 1:1})
model_lgb.fit(X_train[[a for a, b in x[1:15]]], holdout_err_1)
#a = [np.mean(cross_val_score(model_rfr, X_train[[a for a, b in x[1:15]]], y_1, scoring = 'recall', n_jobs = 5)) for i in range(10,20)]
Out[598]:
LGBMRegressor(bagging_fraction=0.8, bagging_freq=5, bagging_seed=9,
       boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
       feature_fraction=0.2319, feature_fraction_seed=9,
       learning_rate=0.05, max_bin=55, max_depth=-1, min_child_samples=20,
       min_child_weight=0.001, min_data_in_leaf=6, min_split_gain=0.0,
       min_sum_hessian_in_leaf=11, n_estimators=720, n_jobs=-1,
       num_leaves=5, objective='regression', random_state=None,
       reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0,
       subsample_for_bin=200000, subsample_freq=1)
In [601]:
sns.regplot(model_lgb.predict(X_test[[a for a, b in x[1:15]]]), holdout_err_2)
Out[601]:
<matplotlib.axes._subplots.AxesSubplot at 0x24cc36aceb8>
In [585]:
sns.regplot(np.arange(10,20), a)
Out[585]:
<matplotlib.axes._subplots.AxesSubplot at 0x24cc24638d0>
In [552]:
sns.distplot(holdout_err_2[model_rfr.predict(X_test) == 1])
sns.distplot(holdout_err_2[model_rfr.predict(X_test) == 0])
Out[552]:
<matplotlib.axes._subplots.AxesSubplot at 0x24cb1030d30>
In [516]:
x = [(a,b) for a, b in zip(X_train.columns, model_rfr.feature_importances_)]
x.sort(key=lambda x: x[1], reverse = True)
[a for a, b in x[1:6]]
Out[516]:
['model1', '1stFlrSF', 'GrLivArea', 'YearBuilt', 'OpenPorchSF']
In [441]:
sns.distplot(model_rfr.predict(X_train_2))
sns.distplot((model_xgb.predict(X_train_2.drop('model1', axis = 1)) - y_train_1.values.ravel()))
Out[441]:
<matplotlib.axes._subplots.AxesSubplot at 0x24cc1da4e10>
In [442]:
sns.regplot((model_xgb.predict(X_train_2.drop('model1', axis = 1)) - y_train_1.values.ravel()), model_rfr.predict(X_train_2))
Out[442]:
<matplotlib.axes._subplots.AxesSubplot at 0x24cc1da4f98>
In [417]:
np.corrcoef(e,p)
Out[417]:
array([[ 1.        ,  0.13995331],
       [ 0.13995331,  1.        ]])
In [383]:
sns.regplot((model_xgb.predict(X_test) - y_test.values.ravel()), model_rfr.predict(X_test))
Out[383]:
<matplotlib.axes._subplots.AxesSubplot at 0x24cc1b6c240>
In [370]:
df = pd.DataFrame()
df['yb'] = X_train_1['YearBuilt']
df['err'] = (model_xgb.predict(X_train_2) - y_train_2.values.ravel())
df['approx'] = model_rfr.predict(X_train_2)
In [372]:
 
Out[372]:
<matplotlib.axes._subplots.AxesSubplot at 0x24cc1961dd8>
In [288]:
np.sum(np.abs(model_rfr.predict(X_train_1)) > .03)/len(y_train_2.values.ravel())
Out[288]:
0.57534246575342463
In [252]:
err_prediction_1 = model_rfr.predict(X_train_2)
np.percentile(np.abs(err_prediction_1), q = 80)
Out[252]:
0.002621699083649758
In [210]:
model_lgb.fit(X_train, y_train.values.ravel())
Out[210]:
LGBMRegressor(bagging_fraction=0.8, bagging_freq=5, bagging_seed=9,
       boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
       feature_fraction=0.2319, feature_fraction_seed=9,
       learning_rate=0.05, max_bin=55, max_depth=-1, min_child_samples=20,
       min_child_weight=0.001, min_data_in_leaf=6, min_split_gain=0.0,
       min_sum_hessian_in_leaf=11, n_estimators=720, n_jobs=-1,
       num_leaves=5, objective='regression', random_state=None,
       reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0,
       subsample_for_bin=200000, subsample_freq=1)
In [373]:
xgb_prediction = model_xgb.predict(X_test)
In [155]:
lgb_prediction = model_lgb.predict(X_test)
In [374]:
np.sqrt(mean_squared_error(xgb_prediction, y_test))
Out[374]:
0.13316217144520887
In [140]:
np.sqrt(mean_squared_error(lgb_prediction, y_test))
Out[140]:
0.1215717094435354
In [143]:
np.sqrt(mean_squared_error((xgb_prediction + lgb_prediction)/2, y_test))
Out[143]:
0.11946591554322981
In [149]:
KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)
In [156]:
stack_X = pd.DataFrame()
stack_X['xgb_prediction'] = xgb_prediction
stack_X['lgb_prediction'] = lgb_prediction
In [153]:
KRR.fit(stack_X, y_holdout)
Out[153]:
KernelRidge(alpha=0.6, coef0=2.5, degree=2, gamma=None, kernel='polynomial',
      kernel_params=None)
In [158]:
stack_prediction = KRR.predict(stack_X)
In [159]:
np.sqrt(mean_squared_error(stack_prediction, y_test))
Out[159]:
0.12908870498531413
In [160]:
X_holdout.shape
Out[160]:
(234, 149)
In [163]:
model_xgb.fit(X_holdout, y_holdout)
Out[163]:
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.4603, gamma=0.0468, learning_rate=0.05,
       max_delta_step=0, max_depth=3, min_child_weight=1.7817,
       missing=None, n_estimators=2200, n_jobs=1, nthread=-1,
       objective='reg:linear', random_state=7, reg_alpha=0.464,
       reg_lambda=0.8571, scale_pos_weight=1, seed=None, silent=1,
       subsample=0.5213)
In [165]:
main_prediction = model_xgb.predict(X_train)
In [194]:
err_train = (main_prediction - y_train.values.ravel())/main_prediction
In [195]:
model_rfr = RandomForestRegressor()
model_rfr.fit(X_train, err_train)
np.sqrt(np.mean(-cross_val_score(model_rfr, X_train, err_train, scoring = 'neg_mean_squared_error', n_jobs = 5)))
Out[195]:
0.013093034851714348
In [199]:
avg = np.repeat(np.mean(err_train), err_train.shape[0])
In [201]:
np.sqrt(mean_squared_error(avg, err_train))
Out[201]:
0.013173352321748284