Pandas的get_dummies用于机器学习的特征处理

分类特征有两种:

  • 普通分类:性别、颜色
  • 顺序分类:评分、级别

对于评分,可以把这个分类直接转换成1、2、3、4、5表示,因为它们之间有顺序、大小关系

但是对于颜色这种分类,直接用1/2/3/4/5/6/7表达,是不合适的,因为机器学习会误以为这些数字之间有大小关系

get_dummies就是用于颜色、性别这种特征的处理,也叫作one-hot-encoding处理

比如:

  • 男性:1 0
  • 女性:0 1

这就叫做one-hot-encoding,是机器学习对类别的特征处理

1、读取泰坦尼克数据集

import pandas as pd
df_train = pd.read_csv("./datas/titanic/titanic_train.csv")
df_train.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
df_train.drop(columns=["Name", "Ticket", "Cabin"], inplace=True)
df_train.head()
PassengerIdSurvivedPclassSexAgeSibSpParchFareEmbarked
0103male22.0107.2500S
1211female38.01071.2833C
2313female26.0007.9250S
3411female35.01053.1000S
4503male35.0008.0500S
df_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(2)
memory usage: 62.8+ KB

特征说明:

  • 数值特征:Fare
  • 分类-有序特征:Age
  • 分类-普通特征:PassengerId、Pclass、Sex、SibSp、Parch、Embarked

Survived为要预测的Label

2、分类有序特征可以用数字的方法处理

# 使用年龄的平均值,填充空值
df_train["Age"] = df_train["Age"].fillna(df_train["Age"].mean())
df_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(2)
memory usage: 62.8+ KB

3、普通无序分类特征可以用get_dummies编码

其实就是one-hot编码

# series
pd.get_dummies(df_train["Sex"]).head()
femalemale
001
110
210
310
401

注意,One-hot-Encoding一般要去掉一列,不然会出现dummy variable trap,因为一个人不是male就是femal,它俩有推导关系 https://www.geeksforgeeks.org/ml-dummy-variable-trap-in-regression-models/

# 便捷方法,用df全部替换
needcode_cat_columns = ["Pclass","Sex","SibSp","Parch","Embarked"]
df_coded = pd.get_dummies(
    df_train,
    # 要转码的列
    columns=needcode_cat_columns,
    # 生成的列名的前缀
    prefix=needcode_cat_columns,
    # 把空值也做编码
    dummy_na=True,
    # 把1 of k移除(dummy variable trap)
    drop_first=True
)
df_coded.head()
PassengerIdSurvivedAgeFarePclass_2.0Pclass_3.0Pclass_nanSex_maleSex_nanSibSp_1.0...Parch_1.0Parch_2.0Parch_3.0Parch_4.0Parch_5.0Parch_6.0Parch_nanEmbarked_QEmbarked_SEmbarked_nan
01022.07.2500010101...0000000010
12138.071.2833000001...0000000000
23126.07.9250010000...0000000010
34135.053.1000000001...0000000010
45035.08.0500010100...0000000010

5 rows × 26 columns

4、机器学习模型训练

y = df_coded.pop("Survived")
y.head()
0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64
X = df_coded
X.head()
PassengerIdAgeFarePclass_2.0Pclass_3.0Pclass_nanSex_maleSex_nanSibSp_1.0SibSp_2.0...Parch_1.0Parch_2.0Parch_3.0Parch_4.0Parch_5.0Parch_6.0Parch_nanEmbarked_QEmbarked_SEmbarked_nan
0122.07.25000101010...0000000010
1238.071.28330000010...0000000000
2326.07.92500100000...0000000010
3435.053.10000000010...0000000010
4535.08.05000101000...0000000010

5 rows × 25 columns

from sklearn.linear_model import LogisticRegression
# 创建模型对象
logreg = LogisticRegression(solver='liblinear')
 
# 实现模型训练
logreg.fit(X, y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)
logreg.score(X, y)
0.8148148148148148