Machine Learning Methods in Regression¶

Table of Contents¶

Introduction
Exploratory Data Analysis
Feature Engineering
Fitting Models
Error Analysis
Stacking
Conclusion

#Import some standard libraries for data analysis, visualization and machine learning

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
%matplotlib inline

from scipy.special import boxcox1p

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error, recall_score
from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC, LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor, RandomForestClassifier
from sklearn.kernel_ridge import KernelRidge
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone

import xgboost as xgb
import lightgbm as lgb

from bayes_opt import BayesianOptimization

1. Introduction¶

The goal of a regression problem is to predict the value of some continuous random variable $Y$, called the target variable, given the values of some input random variables $X_i$, which we call the feature variables. While the target variable must be continuous, the features can be continuous, discrete, or a mix of the two. In this notebook the target variable we are interested in will be housing prices in Ames, Iowa, and our features will be extracted from 79 possibilities.

We will first do some exploratory data analysis on these features, to determine the shape of their distributions and how they interact with the target sales price variable. Then we will do some feature extraction and selection, to determine which variables to feed into our models. Next we fit our regression models to our chosen features. Finally, we use stacking methods to combine our models into a final "megazord" model.

2. Exploratory Data Analysis¶

In this section we perform some exploratory data analysis. First we look for any null values and outliers. Then we examine the distributions of our target variable and our features, and look at how the target variable and features interact. Throughout this section we also try to look at whether there are any noticable differences in the training set and the testing set.

First, let's read in the data and take a quick look at the first few rows.

train = pd.read_csv(r"C:\Users\ASUS\Documents\Data\kaggle\HousePrices\train.csv")
test = pd.read_csv(r"C:\Users\ASUS\Documents\Data\kaggle\HousePrices\test.csv")

train_id = train['Id']
test_id = test['Id']
n_train = train.shape[0]
n_test = test.shape[0]
data = pd.concat((train.drop(['Id', 'SalePrice'], axis = 1), test.drop('Id', axis = 1))).reset_index(drop=True)

y = train[['SalePrice']]

data.head().transpose()

We can already see some NaN's. Let's get a better look at the breakdown of null values.

train_null = data[:n_train].isnull().sum().sort_values(ascending = False)
test_null = data[n_train:].isnull().sum().sort_values(ascending = False)


f, ax = plt.subplots(figsize = (15,20), ncols = 2, nrows = 1)
sns.barplot(y= train_null[train_null >0].index, x = train_null[train_null >0], orient = 'h', ax = ax[0], palette="RdYlGn")
ax[0].set(title = 'Nulls in the training set')
sns.barplot(y= test_null[test_null >0].index, x = test_null[test_null >0], orient = 'h', ax = ax[1], palette="RdYlGn")
ax[1].set(title = 'Nulls in the testing set')

[Text(0.5,1,'Nulls in the testing set')]

The breakdown looks pretty similar between the test set and training set, except there are a bunch of columns in the test set with just a couple of null values, which won't have a substantial effect on our predictions. We will consider the problem of replacing the null values in the modeling stage.

Next we take a look at the distribution of the target value.

f, (axs1, axs2) = plt.subplots(figsize = (15,5), ncols = 2)
sns.distplot(y, ax = axs1, color = sns.xkcd_rgb['faded green'])
axs1.set_title('Histogram of Sale Price')

sns.violinplot(y, ax = axs2, color = sns.xkcd_rgb['faded green'])
axs2.set_title('Violin Plot of Sale Price')

Text(0.5,1,'Violin Plot of Sale Price')

We see that the SalePrice distribution has a heavy positive skew. This will negatively affect some of our models. To counteract this, we apply a log transform to y.

log_y = np.log(y)

f, (axs1, axs2) = plt.subplots(figsize = (15,5), ncols = 2, nrows = 1)
sns.distplot(log_y, ax = axs1,color = sns.xkcd_rgb['faded green'])
axs1.set(title = 'Histogram of Sale Price', yticks = [])

sns.violinplot(log_y, ax = axs2, color = sns.xkcd_rgb['faded green'])
axs2.set_title('Violin Plot of Sale Price')

Text(0.5,1,'Violin Plot of Sale Price')

That looks much more normal. Next we will get a look at the distribution of our feature and see how they interact with SalePrice. We separate the continuous, nominal and categorical variables and visualize them in different ways

data.dtypes.value_counts()

object     43
int64      25
float64    11
dtype: int64

continuous = [x for x in data if data[x].dtypes in ['float64', 'int64'] and data[x].value_counts().shape[0] > 20]

for x in continuous:
    
    f, ax = plt.subplots(figsize = (20, 5))
    sns.regplot(x = x, y = 'SalePrice', color = sns.xkcd_rgb['faded green'],
                data = pd.concat([data[:n_train][x], log_y], axis = 1).dropna(), ax=ax)
    ax.set(ylabel = 'Log of Sale Price')
    
    f, ax = plt.subplots(figsize = (20, 5), ncols = 2)
    sns.distplot(data[:n_train][x].dropna(), color = sns.xkcd_rgb['faded green'], ax = ax[0])
    ax[0].set(title = 'Training Set Distribution')
    sns.distplot(data[n_train:][x].dropna(), ax = ax[1])
    ax[1].set(title = 'Testing Set Distribution')

C:\Users\ASUS\Anaconda3\lib\site-packages\matplotlib\pyplot.py:523: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

There are a few thing we notice from these plots.

Some of the variables are time-based variables as opposed to continuous variables. We separate these into their own category.
There are many variables which are mostly 0. We create a new categorical variable for these which tell us whether the value is 0 or 1.
Several variables are skewed. We apply a box plot transform to them to alleviate this defect.

temporal = ['YearBuilt',
            'YearRemodAdd',
            'GarageYrBlt',
            ]
continuous = [x for x in continuous if x not in temporal]

skewed = (data[continuous] == 0).sum()[(data[continuous] == 0).sum() > 500]

MasVnrArea       1738
BsmtFinSF1        929
BsmtFinSF2       2571
2ndFlrSF         1668
LowQualFinSF     2879
WoodDeckSF       1523
OpenPorchSF      1298
EnclosedPorch    2460
3SsnPorch        2882
ScreenPorch      2663
MiscVal          2816
dtype: int64

normalized = pd.DataFrame()
for feat in continuous:
    normalized[feat] = boxcox1p(data[feat].dropna(), .15)

for x in normalized:
    
    f, ax = plt.subplots(figsize = (20, 5))
    sns.regplot(x = x, y = 'SalePrice', color = sns.xkcd_rgb['faded green'],
                data = pd.concat([normalized[:n_train][normalized[:n_train] >0 ][x], log_y], axis = 1).dropna())
    ax.set(ylabel = 'Log of Sale Price')
    
    f, ax = plt.subplots(figsize = (20, 5), ncols = 2)
    sns.distplot(normalized[:n_train][normalized[:n_train] >0 ][x].dropna(), color = sns.xkcd_rgb['faded green'], ax = ax[0])
    sns.distplot(normalized[:n_train][normalized[:n_train] >0 ][x].dropna(), ax = ax[1])

C:\Users\ASUS\Anaconda3\lib\site-packages\matplotlib\pyplot.py:523: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

With those changes, we can see there is a linear relationship between our target variable and some of our features. Other relationships can be inferred from the following heatmap of correlations.

f, ax = plt.subplots(figsize = (20, 20))
corr = pd.concat((data[continuous][:n_train], y), axis = 1).dropna().corr()
sns.heatmap(corr,annot=True, cmap = "RdYlGn")

<matplotlib.axes._subplots.AxesSubplot at 0x22266ac7358>

for i in range(len(temporal)):
    f, ax = plt.subplots(figsize = (20,3))
    ax = sns.pointplot(data[temporal].iloc[:n_train,i], y = log_y, fit_reg = False, color = sns.xkcd_rgb['rust orange'])
    ax.set(ylabel = 'Log of Sale Price')
    labels = ax.get_xticklabels()
    for i,l in enumerate(labels):
        if(i%5 != 0): labels[i] = '' 
    ax.set_xticklabels(labels, rotation=30)

ordinal = [x for x in data if data[x].dtypes in ['float64', 'int64'] and data[x].value_counts().shape[0] <= 20]

for x in ordinal:
    
    f, ax = plt.subplots(figsize = (20, 5))
    sns.pointplot(x = x, y = 'SalePrice', 
                data = pd.concat([data[:n_train][x], log_y], axis = 1).dropna())
    ax.set(ylabel = 'Log of Sale Price')
    
    f, ax = plt.subplots(figsize = (20, 5), ncols = 2)
    sns.countplot(data[:n_train][x].dropna(), color = sns.xkcd_rgb['faded green'], ax = ax[0])
    sns.countplot(data[n_train:][x].dropna(), ax = ax[1], color = sns.xkcd_rgb['faded blue'])

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-b90744f3480e> in <module>()
----> 1 ordinal = [x for x in data if data[x].dtypes in ['float64', 'int64'] and data[x].value_counts().shape[0] <= 20]
      2 
      3 for x in ordinal:
      4 
      5     f, ax = plt.subplots(figsize = (20, 5))

NameError: name 'data' is not defined

f, ax = plt.subplots(figsize = (20, 20))
corr = pd.concat((data[nominal].iloc[:n_train,:], y), axis = 1).corr()
sns.heatmap(corr,annot=True, cmap = "RdYlGn")

<matplotlib.axes._subplots.AxesSubplot at 0x2bd38d10ac8>

categorical = [x for x in data if data[x].dtypes in ['object']]

for x in categorical:
    f, ax = plt.subplots(figsize = (20, 5))
    sns.lvplot(x = x, y = 'SalePrice', 
                data = pd.concat([data[:n_train][x], log_y], axis = 1).dropna())
    ax.set(ylabel = 'Log of Sale Price')
    
    f, ax = plt.subplots(figsize = (20, 5), ncols = 2)
    sns.countplot(data[:n_train][x].dropna(), color = sns.xkcd_rgb['faded green'], ax = ax[0])
    sns.countplot(data[n_train:][x].dropna(), ax = ax[1], color = sns.xkcd_rgb['faded blue'])

C:\Users\ASUS\Anaconda3\lib\site-packages\matplotlib\pyplot.py:523: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  max_open_warning, RuntimeWarning)

3. Predictions¶

In this section we are going to make our predictions. We will fit a variety of models, using Bayesian Optimization to choose our parameters.

data_drop = pd.get_dummies(data.dropna(axis = 1))

X_train, X_test, y_train, y_test = train_test_split(data_drop[:n_train], log_y, test_size = .2)

def xgb_evaluate(min_child_weight,
                 colsample_bytree,
                 max_depth,
                 subsample,
                 gamma,
                 eta,
                 alpha):

    params['min_child_weight'] = int(min_child_weight)
    params['cosample_bytree'] = max(min(colsample_bytree, 1), 0)
    params['max_depth'] = int(max_depth)
    params['subsample'] = max(min(subsample, 1), 0)
    params['gamma'] = max(gamma, 0)
    params['eta'] = max(eta, 0)
    params['alpha'] = max(alpha, 0)


    cv_result = xgb.cv(params, xgtrain, num_boost_round=num_rounds, nfold=5,
             seed=random_state,
             callbacks=[xgb.callback.early_stop(50)])

    return -cv_result['test-rmse-mean'].values[-1]


def prepare_data():
    xgtrain = xgb.DMatrix(X_train, label=y_train)

    return xgtrain

xgtrain = prepare_data()

num_rounds = 300
random_state = 2016
num_iter = 25
init_points = 5
params = {
    'silent': 1,
    'eval_metric': 'rmse',
    'verbose_eval': True,
    'seed': random_state
}

xgbBO = BayesianOptimization(xgb_evaluate, {'min_child_weight': (1, 20),
                                            'colsample_bytree': (0.1, 1),
                                            'max_depth': (5, 15),
                                            'subsample': (0.5, 1),
                                            'gamma': (0, 10),
                                            'eta': (.1,1),
                                            'alpha': (0, 10),
                                            })

xgbBO.maximize(init_points=init_points, n_iter=num_iter)

Initialization
---------------------------------------------------------------------------------------------------------------------------------------
 Step |   Time |      Value |     alpha |   colsample_bytree |       eta |     gamma |   max_depth |   min_child_weight |   subsample | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[106]	train-rmse:0.233946+0.00740024	test-rmse:0.241471+0.0244472

    1 | 00m02s |   -0.24147 |    5.1822 |             0.6602 |    0.5374 |    4.7521 |     13.3953 |            15.1858 |      0.5184 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[219]	train-rmse:0.194239+0.00614475	test-rmse:0.218339+0.00884284

    2 | 00m07s |   -0.21834 |    1.3357 |             0.4021 |    0.9072 |    1.9608 |      8.0977 |             2.8182 |      0.8444 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[18]	train-rmse:0.247243+0.00500388	test-rmse:0.250032+0.0294304

    3 | 00m01s |   -0.25003 |    2.2306 |             0.3971 |    0.8211 |    8.8132 |     11.4214 |             6.6806 |      0.9538 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[61]	train-rmse:0.212436+0.00245033	test-rmse:0.219428+0.0203557

    4 | 00m02s |   -0.21943 |    2.7039 |             0.7019 |    0.2877 |    5.4981 |     11.8453 |            19.5134 |      0.8291 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[157]	train-rmse:0.249759+0.00347087	test-rmse:0.254802+0.026096

    5 | 00m03s |   -0.25480 |    8.5973 |             0.7555 |    0.2593 |    9.5659 |     10.5661 |            15.9529 |      0.8965 | 
Bayesian Optimization
---------------------------------------------------------------------------------------------------------------------------------------
 Step |   Time |      Value |     alpha |   colsample_bytree |       eta |     gamma |   max_depth |   min_child_weight |   subsample | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[70]	train-rmse:0.102158+0.00148938	test-rmse:0.169027+0.00953702

    6 | 00m07s |   -0.16903 |    0.7449 |             0.4718 |    0.8087 |    0.0534 |      5.1422 |            19.1144 |      0.7317 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[4]	train-rmse:0.137061+0.0033851	test-rmse:0.18413+0.00419682

    7 | 00m17s |   -0.18413 |    0.0000 |             1.0000 |    1.0000 |    0.0000 |      5.0000 |            20.0000 |      1.0000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
    8 | 00m10s |   -0.17320 |   10.0000 |             0.1000 |    0.1000 |    0.0000 |      5.0000 |            20.0000 |      0.5000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[220]	train-rmse:0.254172+0.00299967	test-rmse:0.257273+0.0275143

    9 | 00m10s |   -0.25727 |    0.0000 |             0.1000 |    0.1000 |    8.6971 |      5.0000 |            19.3678 |      0.5000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
   10 | 00m14s |   -0.29791 |   10.0000 |             0.1000 |    0.1000 |   10.0000 |      5.0000 |             1.0000 |      0.5000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[157]	train-rmse:0.0013986+0.000279735	test-rmse:0.151896+0.00532008

   11 | 00m17s |   -0.15190 |    0.0000 |             0.1000 |    0.1000 |    0.0000 |     15.0000 |             1.0000 |      1.0000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[93]	train-rmse:0.0661824+0.00163296	test-rmse:0.14464+0.0055696

   12 | 00m15s |   -0.14464 |    0.0000 |             0.1000 |    0.1000 |    0.0000 |     15.0000 |            20.0000 |      1.0000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[95]	train-rmse:0.061+0.00158753	test-rmse:0.144229+0.00457612

   13 | 00m14s |   -0.14423 |    0.0000 |             0.1000 |    0.1000 |    0.0000 |      9.1261 |            11.7048 |      1.0000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[113]	train-rmse:0.121448+0.00355288	test-rmse:0.172948+0.0167438

   14 | 00m12s |   -0.17295 |   10.0000 |             0.1000 |    0.1000 |    0.0000 |     15.0000 |             1.0000 |      1.0000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[140]	train-rmse:0.136772+0.000758777	test-rmse:0.161437+0.0121558

   15 | 00m11s |   -0.16144 |   10.0000 |             0.1000 |    0.1000 |    0.0000 |     14.4344 |            20.0000 |      1.0000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
   16 | 00m14s |   -0.14332 |    0.0464 |             0.1000 |    0.1000 |    0.1073 |      6.3651 |            19.5111 |      0.8626 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
   17 | 00m13s |   -0.14785 |    2.2147 |             0.2550 |    0.1053 |    0.0432 |     10.3388 |            19.0581 |      0.9321 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[106]	train-rmse:0.0871092+0.000720051	test-rmse:0.142686+0.00615836

   18 | 00m10s |   -0.14269 |    0.0000 |             0.1000 |    0.1000 |    0.0000 |      5.0000 |            15.7883 |      0.8518 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[85]	train-rmse:0.0586026+0.000709885	test-rmse:0.171218+0.00956596

   19 | 00m10s |   -0.17122 |    1.1712 |             0.1049 |    0.7417 |    0.0003 |      7.8559 |            16.9492 |      0.9201 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[192]	train-rmse:0.287076+0.00397964	test-rmse:0.290744+0.033543

   20 | 00m15s |   -0.29074 |   10.0000 |             0.1000 |    0.1000 |   10.0000 |     15.0000 |             1.0000 |      0.5612 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[72]	train-rmse:0.124206+0.00211708	test-rmse:0.158269+0.00997683

   21 | 00m08s |   -0.15827 |    6.6518 |             0.1000 |    0.1780 |    0.0000 |      6.3120 |             6.4357 |      1.0000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[157]	train-rmse:0.0508168+0.00132904	test-rmse:0.145073+0.00742828

   22 | 00m11s |   -0.14507 |    0.0000 |             0.1000 |    0.1000 |    0.0000 |      5.0000 |             1.0000 |      0.5000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[231]	train-rmse:0.151161+0.00200079	test-rmse:0.177958+0.011941

   23 | 00m10s |   -0.17796 |    8.8461 |             0.4096 |    0.2104 |    0.0602 |      5.4902 |             1.7929 |      0.5271 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
   24 | 00m16s |   -0.14816 |    0.8569 |             0.1122 |    0.1050 |    0.1129 |     14.1386 |             9.8778 |      0.9577 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[104]	train-rmse:0.08578+0.00131836	test-rmse:0.141664+0.00590819

   25 | 00m10s |   -0.14166 |    0.0000 |             1.0000 |    0.1000 |    0.0000 |      5.0000 |             7.1967 |      0.5000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[57]	train-rmse:0.0861834+0.000494653	test-rmse:0.147854+0.00432513

   26 | 00m10s |   -0.14785 |    0.0000 |             0.1000 |    0.1779 |    0.0000 |      5.0000 |             6.2466 |      0.5103 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[220]	train-rmse:0.26215+0.00360585	test-rmse:0.265079+0.0283394

   27 | 00m15s |   -0.26508 |    0.0000 |             1.0000 |    0.1000 |   10.0000 |     15.0000 |            20.0000 |      0.5000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
   28 | 00m11s |   -0.15060 |    3.9157 |             0.1000 |    0.1000 |    0.0000 |      6.1917 |            11.9866 |      0.5000 |

C:\Users\ASUS\Anaconda3\lib\site-packages\sklearn\gaussian_process\gpr.py:457: UserWarning: fmin_l_bfgs_b terminated abnormally with the  state: {'grad': array([  5.17741965e-05]), 'task': b'ABNORMAL_TERMINATION_IN_LNSRCH', 'funcalls': 57, 'nit': 5, 'warnflag': 2}
  " state: %s" % convergence_dict)

Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
Stopping. Best iteration:
[75]	train-rmse:0.0709766+0.00134991	test-rmse:0.146021+0.00410328

   29 | 00m13s |   -0.14602 |    0.0000 |             1.0000 |    0.1000 |    0.0000 |     10.7365 |             5.3112 |      0.5000 | 
Multiple eval metrics have been passed: 'test-rmse' will be used for early stopping.

Will train until test-rmse hasn't improved in 50 rounds.
   30 | 00m16s |   -0.14470 |    2.2589 |             0.1038 |    0.1000 |    0.0000 |      5.0063 |            19.8838 |      0.6391 |

par = xgbBO.res['max']
par

{'max_params': {'alpha': 0.0,
  'colsample_bytree': 1.0,
  'eta': 0.10000000000000001,
  'gamma': 0.0,
  'max_depth': 5.0,
  'min_child_weight': 7.1966945846891486,
  'subsample': 0.5},
 'max_val': -0.14166380000000001}

model_xgb = xgb.XGBRegressor(silent = 1, eval_metric = 'mae', verbose_eval = True, **par)

model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.7817, n_estimators=2200,
                             reg_alpha=0.4640, reg_lambda=0.8571,
                             subsample=0.5213, silent=1,
                             random_state =7, nthread = -1)

model_lgb = lgb.LGBMRegressor(objective='regression',num_leaves=5,
                              learning_rate=0.05, n_estimators=720,
                              max_bin = 55, bagging_fraction = 0.8,
                              bagging_freq = 5, feature_fraction = 0.2319,
                              feature_fraction_seed=9, bagging_seed=9,
                              min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)

np.sqrt(np.mean(-cross_val_score(model_xgb, data_drop[:n_train], log_y, scoring = 'neg_mean_squared_error', n_jobs = 5)))

0.14092588215444402

np.sqrt(mean_squared_error(model_xgb.predict(X_test), y_test.values.ravel()))

0.13976753147419943

model_lgb.fit(X_train, y_train.values.ravel())
np.sqrt(np.mean(-cross_val_score(model_lgb, X_train, y_train.values.ravel(), scoring = 'neg_mean_squared_error', n_jobs = 5)))

0.13641285365312741

model_xgb.fit()

df = pd.DataFrame()
df['xgb'] = model_xgb.predict(X_test)
df['lgb'] = model_lgb.predict(X_test)

df = pd.DataFrame()
df['xgb'] = model_xgb.predict(data[n_train:])
df['lgb'] = model_lgb.predict(data[n_train:])

model_rfr = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)
model_rfr.fit(df, y_test.values.ravel())
np.sqrt(np.mean(-cross_val_score(model_rfr, df, y_test.values.ravel(), scoring = 'neg_mean_squared_error', n_jobs = 5)))

0.12973250642180528

prediction = model_rfr.predict(df)

model_xgb.fit(data[:n_train], y)

XGBRegressor(alpha=0.41258, base_score=0.5, booster='gbtree',
       colsample_bylevel=1, colsample_bytree=0.98935, eta=0.1,
       eval_metric='mae', gamma=0.0162872, learning_rate=0.1,
       max_delta_step=0, max_depth=6, max_val=-0.0958,
       min_child_weight=13.052, missing=None, n_estimators=100, n_jobs=1,
       nthread=None, objective='reg:linear', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=1,
       subsample=0.675489, verbose_eval=True)

prediction = model_xgb.predict(data[n_train:])

sub = pd.DataFrame()
sub['Id'] = test_id
sub['SalePrice'] = np.exp(prediction)
sub.to_csv(r'C:\Users\ASUS\Documents\Data\kaggle\HousePrices\submission.csv',index=False)

#model_rfr = RandomForestClassifier(n_estimators=100, bootstrap = 20, class_weight={0:20, 1:1})
model_lgb.fit(X_train[[a for a, b in x[1:15]]], holdout_err_1)
#a = [np.mean(cross_val_score(model_rfr, X_train[[a for a, b in x[1:15]]], y_1, scoring = 'recall', n_jobs = 5)) for i in range(10,20)]

LGBMRegressor(bagging_fraction=0.8, bagging_freq=5, bagging_seed=9,
       boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
       feature_fraction=0.2319, feature_fraction_seed=9,
       learning_rate=0.05, max_bin=55, max_depth=-1, min_child_samples=20,
       min_child_weight=0.001, min_data_in_leaf=6, min_split_gain=0.0,
       min_sum_hessian_in_leaf=11, n_estimators=720, n_jobs=-1,
       num_leaves=5, objective='regression', random_state=None,
       reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0,
       subsample_for_bin=200000, subsample_freq=1)

sns.regplot(model_lgb.predict(X_test[[a for a, b in x[1:15]]]), holdout_err_2)

<matplotlib.axes._subplots.AxesSubplot at 0x24cc36aceb8>

sns.regplot(np.arange(10,20), a)

<matplotlib.axes._subplots.AxesSubplot at 0x24cc24638d0>

sns.distplot(holdout_err_2[model_rfr.predict(X_test) == 1])
sns.distplot(holdout_err_2[model_rfr.predict(X_test) == 0])

<matplotlib.axes._subplots.AxesSubplot at 0x24cb1030d30>

x = [(a,b) for a, b in zip(X_train.columns, model_rfr.feature_importances_)]
x.sort(key=lambda x: x[1], reverse = True)
[a for a, b in x[1:6]]

['model1', '1stFlrSF', 'GrLivArea', 'YearBuilt', 'OpenPorchSF']

sns.distplot(model_rfr.predict(X_train_2))
sns.distplot((model_xgb.predict(X_train_2.drop('model1', axis = 1)) - y_train_1.values.ravel()))

<matplotlib.axes._subplots.AxesSubplot at 0x24cc1da4e10>

sns.regplot((model_xgb.predict(X_train_2.drop('model1', axis = 1)) - y_train_1.values.ravel()), model_rfr.predict(X_train_2))

<matplotlib.axes._subplots.AxesSubplot at 0x24cc1da4f98>

np.corrcoef(e,p)

array([[ 1.        ,  0.13995331],
       [ 0.13995331,  1.        ]])

sns.regplot((model_xgb.predict(X_test) - y_test.values.ravel()), model_rfr.predict(X_test))

<matplotlib.axes._subplots.AxesSubplot at 0x24cc1b6c240>

df = pd.DataFrame()
df['yb'] = X_train_1['YearBuilt']
df['err'] = (model_xgb.predict(X_train_2) - y_train_2.values.ravel())
df['approx'] = model_rfr.predict(X_train_2)

<matplotlib.axes._subplots.AxesSubplot at 0x24cc1961dd8>

np.sum(np.abs(model_rfr.predict(X_train_1)) > .03)/len(y_train_2.values.ravel())

0.57534246575342463

err_prediction_1 = model_rfr.predict(X_train_2)
np.percentile(np.abs(err_prediction_1), q = 80)

0.002621699083649758

model_lgb.fit(X_train, y_train.values.ravel())

LGBMRegressor(bagging_fraction=0.8, bagging_freq=5, bagging_seed=9,
       boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
       feature_fraction=0.2319, feature_fraction_seed=9,
       learning_rate=0.05, max_bin=55, max_depth=-1, min_child_samples=20,
       min_child_weight=0.001, min_data_in_leaf=6, min_split_gain=0.0,
       min_sum_hessian_in_leaf=11, n_estimators=720, n_jobs=-1,
       num_leaves=5, objective='regression', random_state=None,
       reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0,
       subsample_for_bin=200000, subsample_freq=1)

xgb_prediction = model_xgb.predict(X_test)

lgb_prediction = model_lgb.predict(X_test)

np.sqrt(mean_squared_error(xgb_prediction, y_test))

0.13316217144520887

np.sqrt(mean_squared_error(lgb_prediction, y_test))

0.1215717094435354

np.sqrt(mean_squared_error((xgb_prediction + lgb_prediction)/2, y_test))

0.11946591554322981

KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)

stack_X = pd.DataFrame()
stack_X['xgb_prediction'] = xgb_prediction
stack_X['lgb_prediction'] = lgb_prediction

KRR.fit(stack_X, y_holdout)

KernelRidge(alpha=0.6, coef0=2.5, degree=2, gamma=None, kernel='polynomial',
      kernel_params=None)

stack_prediction = KRR.predict(stack_X)

np.sqrt(mean_squared_error(stack_prediction, y_test))

0.12908870498531413

X_holdout.shape

(234, 149)

model_xgb.fit(X_holdout, y_holdout)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=0.4603, gamma=0.0468, learning_rate=0.05,
       max_delta_step=0, max_depth=3, min_child_weight=1.7817,
       missing=None, n_estimators=2200, n_jobs=1, nthread=-1,
       objective='reg:linear', random_state=7, reg_alpha=0.464,
       reg_lambda=0.8571, scale_pos_weight=1, seed=None, silent=1,
       subsample=0.5213)

main_prediction = model_xgb.predict(X_train)

err_train = (main_prediction - y_train.values.ravel())/main_prediction

model_rfr = RandomForestRegressor()
model_rfr.fit(X_train, err_train)
np.sqrt(np.mean(-cross_val_score(model_rfr, X_train, err_train, scoring = 'neg_mean_squared_error', n_jobs = 5)))

0.013093034851714348

avg = np.repeat(np.mean(err_train), err_train.shape[0])

np.sqrt(mean_squared_error(avg, err_train))

0.013173352321748284

	0	1	2	3	4
MSSubClass	60	20	60	70	60
MSZoning	RL	RL	RL	RL	RL
LotFrontage	65	80	68	60	84
LotArea	8450	9600	11250	9550	14260
Street	Pave	Pave	Pave	Pave	Pave
Alley	NaN	NaN	NaN	NaN	NaN
LotShape	Reg	Reg	IR1	IR1	IR1
LandContour	Lvl	Lvl	Lvl	Lvl	Lvl
Utilities	AllPub	AllPub	AllPub	AllPub	AllPub
LotConfig	Inside	FR2	Inside	Corner	FR2
LandSlope	Gtl	Gtl	Gtl	Gtl	Gtl
Neighborhood	CollgCr	Veenker	CollgCr	Crawfor	NoRidge
Condition1	Norm	Feedr	Norm	Norm	Norm
Condition2	Norm	Norm	Norm	Norm	Norm
BldgType	1Fam	1Fam	1Fam	1Fam	1Fam
HouseStyle	2Story	1Story	2Story	2Story	2Story
OverallQual	7	6	7	7	8
OverallCond	5	8	5	5	5
YearBuilt	2003	1976	2001	1915	2000
YearRemodAdd	2003	1976	2002	1970	2000
RoofStyle	Gable	Gable	Gable	Gable	Gable
RoofMatl	CompShg	CompShg	CompShg	CompShg	CompShg
Exterior1st	VinylSd	MetalSd	VinylSd	Wd Sdng	VinylSd
Exterior2nd	VinylSd	MetalSd	VinylSd	Wd Shng	VinylSd
MasVnrType	BrkFace	None	BrkFace	None	BrkFace
MasVnrArea	196	0	162	0	350
ExterQual	Gd	TA	Gd	TA	Gd
ExterCond	TA	TA	TA	TA	TA
Foundation	PConc	CBlock	PConc	BrkTil	PConc
BsmtQual	Gd	Gd	Gd	TA	Gd
...	...	...	...	...	...
HalfBath	1	0	1	0	1
BedroomAbvGr	3	3	3	3	4
KitchenAbvGr	1	1	1	1	1
KitchenQual	Gd	TA	Gd	Gd	Gd
TotRmsAbvGrd	8	6	6	7	9
Functional	Typ	Typ	Typ	Typ	Typ
Fireplaces	0	1	1	1	1
FireplaceQu	NaN	TA	TA	Gd	TA
GarageType	Attchd	Attchd	Attchd	Detchd	Attchd
GarageYrBlt	2003	1976	2001	1998	2000
GarageFinish	RFn	RFn	RFn	Unf	RFn
GarageCars	2	2	2	3	3
GarageArea	548	460	608	642	836
GarageQual	TA	TA	TA	TA	TA
GarageCond	TA	TA	TA	TA	TA
PavedDrive	Y	Y	Y	Y	Y
WoodDeckSF	0	298	0	0	192
OpenPorchSF	61	0	42	35	84
EnclosedPorch	0	0	0	272	0
3SsnPorch	0	0	0	0	0
ScreenPorch	0	0	0	0	0
PoolArea	0	0	0	0	0
PoolQC	NaN	NaN	NaN	NaN	NaN
Fence	NaN	NaN	NaN	NaN	NaN
MiscFeature	NaN	NaN	NaN	NaN	NaN
MiscVal	0	0	0	0	0
MoSold	2	5	9	2	12
YrSold	2008	2007	2008	2006	2008
SaleType	WD	WD	WD	WD	WD
SaleCondition	Normal	Normal	Normal	Abnorml	Normal