性能指标
均方根误差(RMSE)
$$
RMSE(\mathbf{X}, h) \:=\:
\sqrt{\frac{1}{m}\sum^{m}_{i=1}{\Big(h(\mathbf{x}^{(i)})-y^{(i)}\Big)^2}}
$$
平均绝对误差(MAE)
$$
MAE(\mathbf{X}, h) \:=\: \frac{1}{m} \sum^m_{i=1}\Big|h(\mathbf{x}^{(i)})-y^{(i)}\Big|
$$
数据选取问题
随机切分
保证每次切分数据一致,需要设定随机种子。1
2
3
4
5import numpy as np
np.random.seed(42)
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(dataset, test_size=0.3, random_state=42)
唯一字段
当数据更新后,上面的方法就不一定能保证切分后的训练集包括之前的训练集。可以通过对数据创建唯一字段,对唯一字段编写抽取方法,保证每次抽取的结果一致。1
2
3
4import hashlib
def test_set_check(identifier, test_ratio, hash=hashlib.md5):
return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio
分层抽样
数据中某个字段比较重要,根据该字段进行分层抽样,保证抽样后的训练集与原始数据集比例一致。1
2
3
4
5
6
7from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
# dataset数据集, income分层依据字段
for train_index, test_index in split.split(dataset, dataset["income"]):
strat_train_set = dataset.loc[train_index]
strat_test_set = dataset.loc[test_index]
数据可视化
1 | import pandas as pd |
数据清洗
缺失值处理
- 删除缺失样本
- 删除缺失属性
- 填充缺失值
1 | # 填充缺失值 |
文本处理和One-Hot
编码
1 | # 将文本类别转成number |
自定义转换器
1 | # 可以放入pipeline中 |
Pipeline
1 | from sklearn.pipeline import Pipeline |
模型
计算RMSE和MAE
1 | from sklearn.metrics import mean_squared_error # RMSE |
交叉验证
1 | from sklearn.model_selection import cross_val_score |
关于scoring
参数选择参见文档
超参数选择
网格搜索
1 | from sklearn.model_selection import GridSearchCV |
随机搜索
1 | from sklearn.model_selection import RandomizedSearchCV |
reciprocal的概率密度函数如下
$$
f(x, a, b) = \frac{1}{x\log(b/a)} \qquad \textrm{for } a \le x \le b, \: b \ge a \ge 0
$$
更多详细的分布,参见官网文档
置信区间
1 | from scipy import stats |
选k个重要属性
1 | from sklearn.base import BaseEstimator, TransformerMixin |
加入数据处理的网格搜索
在网格搜索中加入数据清洗阶段的参数,需要用到pipeline
,通过__
调用参数名。1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16rp = Pipeline([
('imputer', SimpleImputer()),
('test_func', FunctionTransformer(test, validate=False)),
('std_scaler', StandardScaler()),
('feature_selection', TopFeatureSelector(feature_importances, k))
('lin_reg', LinearRegression()),
])
# feature_importances = grid_search.best_estimator_.feature_importances_
param_grid = [{
'imputer__strategy': ['mean', 'median', 'most_frequent'],
'feature_selection__k': list(range(1, len(feature_importances) + 1))
}]
grid_search_prep = GridSearchCV(rp, param_grid, cv=5,scoring='neg_mean_squared_error', verbose=2, n_jobs=4)
grid_search_prep.fit(X, y)
grid_search_prep.best_params_