机器学习实战第2章笔记

《机器学习实战》一书第2章笔记。

性能指标

均方根误差(RMSE)

$$
RMSE(\mathbf{X}, h) \:=\:
\sqrt{\frac{1}{m}\sum^{m}_{i=1}{\Big(h(\mathbf{x}^{(i)})-y^{(i)}\Big)^2}}
$$

平均绝对误差(MAE)

$$
MAE(\mathbf{X}, h) \:=\: \frac{1}{m} \sum^m_{i=1}\Big|h(\mathbf{x}^{(i)})-y^{(i)}\Big|
$$

数据选取问题

随机切分

保证每次切分数据一致,需要设定随机种子。

1
2
3
4
5
import numpy as np
np.random.seed(42)

from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(dataset, test_size=0.3, random_state=42)

唯一字段

当数据更新后,上面的方法就不一定能保证切分后的训练集包括之前的训练集。可以通过对数据创建唯一字段,对唯一字段编写抽取方法,保证每次抽取的结果一致。

1
2
3
4
import hashlib

def test_set_check(identifier, test_ratio, hash=hashlib.md5):
return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio

分层抽样

数据中某个字段比较重要,根据该字段进行分层抽样,保证抽样后的训练集与原始数据集比例一致。

1
2
3
4
5
6
7
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
# dataset数据集, income分层依据字段
for train_index, test_index in split.split(dataset, dataset["income"]):
strat_train_set = dataset.loc[train_index]
strat_test_set = dataset.loc[test_index]

数据可视化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import pandas as pd
import matplotlib.pyplot as plt

# pandas.DataFrame类可以直接调用matplotlib进行画图
dataset.hist(bins=20, figsize=(10,10)) # 对所用数值字段画直方图
# 画散点图,设定alpha,有密度效果
dataset.plot(kind="scatter", x="x_column", y="y_column", alpha=0.1)
# 更多维度的散点图
dataset.plot(kind="scatter", x="x_column", y="y_column", alpha=0.4,
s="z_colunm", label="z_colunm", figsize=(10,7),
c="k_column", cmap=plt.get_cmap("jet"), colorbar=True,
sharex=False)
plt.legend()
# 多个散点图看相关性
# dataset.corr()方法,可以计算出相关系数矩阵
from pandas.plotting import scatter_matrix
scatter_matrix(dataset[attr], figsize=(12, 8)) # 如果特征过多,挑选特征传入

数据清洗

缺失值处理

  • 删除缺失样本
  • 删除缺失属性
  • 填充缺失值
1
2
3
4
5
6
# 填充缺失值
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
imputer.fit(dataset)
X = imputer.transform(dataset)
dataset_tr = pd.DataFrame(X, columns=dataset.columns, index=dataset.index)

文本处理和One-Hot编码

1
2
3
4
5
6
7
8
9
10
11
12
# 将文本类别转成number
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
data_text_encoded = ordinal_encoder.fit_transform(data_text) # 返回array
ordinal_encoder.categories_ # 返回类别

# one-hot编码,可以接受文本
from sklearn.preprocessing import OneHotEncoder
# OneHotEncoder(sparse=False) sparse默认True,false返回numpy的array矩阵
text_encoder = OneHotEncoder()
data_text_1hot = text_encoder.fit_transform(data_text) # Scipy稀疏矩阵
housing_cat_1hot.toarray() # 返回array

自定义转换器

1
2
3
4
5
6
7
8
9
10
11
12
13
# 可以放入pipeline中
from sklearn.preprocessing import FunctionTransformer

def test(X, k=True):
# input: X: array
# output: array
if k:
return X
else:
return X[:10]

test_func = FunctionTransformer(test, validate=False,kw_args={"k": False})
test_func_result = test_func.fit_transform(dataset.values)

Pipeline

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler # 标准化

num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('test_func', FunctionTransformer(test, validate=False)),
('std_scaler', StandardScaler()),
])

datatset_tr = num_pipeline.fit_transform(datatset)

# 多个pipeline结果结合
from sklearn.pipeline import FeatureUnion

old_full_pipeline = FeatureUnion(transformer_list=[
("num_pipeline", old_num_pipeline),
("cat_pipeline", old_cat_pipeline),
])

模型

计算RMSE和MAE

1
2
3
4
5
6
7
8
9
10
from sklearn.metrics import mean_squared_error # RMSE
from sklearn.metrics import mean_absolute_error # MAE
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, y)
y_hat = lin_reg.predict(y)
lin_mse = mean_squared_error(y, y_hat)
lin_rmse = np.sqrt(lin_mse)
lin_mae = mean_absolute_error(y, y_hat)

交叉验证

1
2
3
4
from sklearn.model_selection import cross_val_score

scores = cross_val_score(lin_reg, X, y, scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-scores)

关于scoring参数选择参见文档

超参数选择

网格搜索

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

param_grid = [
# try 12 (3×4) combinations of hyperparameters
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
# then try 6 (2×3) combinations with bootstrap set as False
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]

forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error',
return_train_score=True)
grid_search.fit(X, y)
grid_search.best_params_
grid_search.best_estimator_

# 输出每次结果
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)

随机搜索

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform, geom, expon, reciprocal
# uniform均匀分布, randint均匀分布的整数形式, geom几何分布, expon指数分布

param_distribs = {
'n_estimators': randint(low=1, high=200),
'max_features': randint(low=1, high=8),
}

forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
rnd_search.fit(X, y)

from sklearn.svm import SVR
param_distribs = {
'kernel': ['linear', 'rbf'],
'C': reciprocal(20, 200000),
'gamma': expon(scale=1.0),
}

svm_reg = SVR()
rnd_search = RandomizedSearchCV(svm_reg, param_distributions=param_distribs,
n_iter=50, cv=5, scoring='neg_mean_squared_error',
verbose=2, n_jobs=4, random_state=42)
rnd_search.fit(X, y)

reciprocal的概率密度函数如下
$$
f(x, a, b) = \frac{1}{x\log(b/a)} \qquad \textrm{for } a \le x \le b, \: b \ge a \ge 0
$$
更多详细的分布,参见官网文档

置信区间

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from scipy import stats
squared_errors = (final_predictions - y_test) ** 2
mean = squared_errors.mean()
m = len(squared_errors)
# t-分布
np.sqrt(stats.t.interval(confidence, m - 1,
loc=np.mean(squared_errors),
scale=stats.sem(squared_errors)))
# stats.sem计算标准差,自由度是n-1
# or
tscore = stats.t.ppf((1 + confidence) / 2, df=m - 1)
tmargin = tscore * squared_errors.std(ddof=1) / np.sqrt(m)
np.sqrt(mean - tmargin), np.sqrt(mean + tmargin)

# 正态分布
zscore = stats.norm.ppf((1 + confidence) / 2)
zmargin = zscore * squared_errors.std(ddof=1) / np.sqrt(m)
np.sqrt(mean - zmargin), np.sqrt(mean + zmargin)

选k个重要属性

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from sklearn.base import BaseEstimator, TransformerMixin
# 返回最大的k个数索引
def indices_of_top_k(arr, k):
return np.sort(np.argpartition(np.array(arr), -k)[-k:])

class TopFeatureSelector(BaseEstimator, TransformerMixin):
def __init__(self, feature_importances, k):
self.feature_importances = feature_importances
self.k = k
def fit(self, X, y=None):
self.feature_indices_ = indices_of_top_k(self.feature_importances, self.k)
return self
def transform(self, X):
return X[:, self.feature_indices_]

加入数据处理的网格搜索

在网格搜索中加入数据清洗阶段的参数,需要用到pipeline,通过__调用参数名。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
rp = Pipeline([
('imputer', SimpleImputer()),
('test_func', FunctionTransformer(test, validate=False)),
('std_scaler', StandardScaler()),
('feature_selection', TopFeatureSelector(feature_importances, k))
('lin_reg', LinearRegression()),
])
# feature_importances = grid_search.best_estimator_.feature_importances_
param_grid = [{
'imputer__strategy': ['mean', 'median', 'most_frequent'],
'feature_selection__k': list(range(1, len(feature_importances) + 1))
}]

grid_search_prep = GridSearchCV(rp, param_grid, cv=5,scoring='neg_mean_squared_error', verbose=2, n_jobs=4)
grid_search_prep.fit(X, y)
grid_search_prep.best_params_

-------------本文结束感谢您的阅读-------------
0%