Pandas怎样实现DataFrame的Merge
Pandas的Merge,相当于Sql的Join,将不同的表按key关联到一个表
merge的语法:
pd.merge(left, right, how=‘inner’, on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=(‘_x’, ‘_y’), copy=True, indicator=False,
validate=None)
- left,right:要merge的dataframe或者有name的Series
- how:join类型,‘left’, ‘right’, ‘outer’, ‘inner’
- on:join的key,left和right都需要有这个key
- left_on:left的df或者series的key
- right_on:right的df或者seires的key
- left_index,right_index:使用index而不是普通的column做join
- suffixes:两个元素的后缀,如果列有重名,自动添加后缀,默认是(‘_x’, ‘_y’)
文档地址:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
本次讲解提纲:
- 电影数据集的join实例
- 理解merge时一对一、一对多、多对多的数量对齐关系
- 理解left join、right join、inner join、outer join的区别
- 如果出现非Key的字段重名怎么办
1、电影数据集的join实例
电影评分数据集
是推荐系统研究的很好的数据集
位于本代码目录:./datas/movielens-1m
包含三个文件:
- 用户对电影的评分数据 ratings.dat
- 用户本身的信息数据 users.dat
- 电影本身的数据 movies.dat
可以关联三个表,得到一个完整的大表
数据集官方地址:https://grouplens.org/datasets/movielens/
| UserID | MovieID | Rating | Timestamp |
---|
0 | 1 | 1193 | 5 | 978300760 |
1 | 1 | 661 | 3 | 978302109 |
2 | 1 | 914 | 3 | 978301968 |
3 | 1 | 3408 | 4 | 978300275 |
4 | 1 | 2355 | 5 | 978824291 |
| UserID | Gender | Age | Occupation | Zip-code |
---|
0 | 1 | F | 1 | 10 | 48067 |
1 | 2 | M | 56 | 16 | 70072 |
2 | 3 | M | 25 | 15 | 55117 |
3 | 4 | M | 45 | 7 | 02460 |
4 | 5 | M | 25 | 20 | 55455 |
| MovieID | Title | Genres |
---|
0 | 1 | Toy Story (1995) | Animation|Children's|Comedy |
1 | 2 | Jumanji (1995) | Adventure|Children's|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
| UserID | MovieID | Rating | Timestamp | Gender | Age | Occupation | Zip-code |
---|
0 | 1 | 1193 | 5 | 978300760 | F | 1 | 10 | 48067 |
1 | 1 | 661 | 3 | 978302109 | F | 1 | 10 | 48067 |
2 | 1 | 914 | 3 | 978301968 | F | 1 | 10 | 48067 |
3 | 1 | 3408 | 4 | 978300275 | F | 1 | 10 | 48067 |
4 | 1 | 2355 | 5 | 978824291 | F | 1 | 10 | 48067 |
| UserID | MovieID | Rating | Timestamp | Gender | Age | Occupation | Zip-code | Title | Genres |
---|
0 | 1 | 1193 | 5 | 978300760 | F | 1 | 10 | 48067 | One Flew Over the Cuckoo's Nest (1975) | Drama |
1 | 2 | 1193 | 5 | 978298413 | M | 56 | 16 | 70072 | One Flew Over the Cuckoo's Nest (1975) | Drama |
2 | 12 | 1193 | 4 | 978220179 | M | 25 | 12 | 32793 | One Flew Over the Cuckoo's Nest (1975) | Drama |
3 | 15 | 1193 | 4 | 978199279 | M | 25 | 7 | 22903 | One Flew Over the Cuckoo's Nest (1975) | Drama |
4 | 17 | 1193 | 5 | 978158471 | M | 50 | 1 | 95350 | One Flew Over the Cuckoo's Nest (1975) | Drama |
5 | 18 | 1193 | 4 | 978156168 | F | 18 | 3 | 95825 | One Flew Over the Cuckoo's Nest (1975) | Drama |
6 | 19 | 1193 | 5 | 982730936 | M | 1 | 10 | 48073 | One Flew Over the Cuckoo's Nest (1975) | Drama |
7 | 24 | 1193 | 5 | 978136709 | F | 25 | 7 | 10023 | One Flew Over the Cuckoo's Nest (1975) | Drama |
8 | 28 | 1193 | 3 | 978125194 | F | 25 | 1 | 14607 | One Flew Over the Cuckoo's Nest (1975) | Drama |
9 | 33 | 1193 | 5 | 978557765 | M | 45 | 3 | 55421 | One Flew Over the Cuckoo's Nest (1975) | Drama |
2、理解merge时数量的对齐关系
以下关系要正确理解:
- one-to-one:一对一关系,关联的key都是唯一的
- 比如(学号,姓名) merge (学号,年龄)
- 结果条数为:1*1
- one-to-many:一对多关系,左边唯一key,右边不唯一key
- 比如(学号,姓名) merge (学号,[语文成绩、数学成绩、英语成绩])
- 结果条数为:1*N
- many-to-many:多对多关系,左边右边都不是唯一的
- 比如(学号,[语文成绩、数学成绩、英语成绩]) merge (学号,[篮球、足球、乒乓球])
- 结果条数为:M*N
2.1 one-to-one 一对一关系的merge
| sno | name |
---|
0 | 11 | name_a |
1 | 12 | name_b |
2 | 13 | name_c |
3 | 14 | name_d |
| sno | age |
---|
0 | 11 | 21 |
1 | 12 | 22 |
2 | 13 | 23 |
3 | 14 | 24 |
| sno | name | age |
---|
0 | 11 | name_a | 21 |
1 | 12 | name_b | 22 |
2 | 13 | name_c | 23 |
3 | 14 | name_d | 24 |
2.2 one-to-many 一对多关系的merge
注意:数据会被复制
| sno | name |
---|
0 | 11 | name_a |
1 | 12 | name_b |
2 | 13 | name_c |
3 | 14 | name_d |
| sno | grade |
---|
0 | 11 | 语文88 |
1 | 11 | 数学90 |
2 | 11 | 英语75 |
3 | 12 | 语文66 |
4 | 12 | 数学55 |
5 | 13 | 英语29 |
| sno | name | grade |
---|
0 | 11 | name_a | 语文88 |
1 | 11 | name_a | 数学90 |
2 | 11 | name_a | 英语75 |
3 | 12 | name_b | 语文66 |
4 | 12 | name_b | 数学55 |
5 | 13 | name_c | 英语29 |
2.3 many-to-many 多对多关系的merge
注意:结果数量会出现乘法
| sno | 爱好 |
---|
0 | 11 | 篮球 |
1 | 11 | 羽毛球 |
2 | 12 | 乒乓球 |
3 | 12 | 篮球 |
4 | 12 | 足球 |
| sno | grade |
---|
0 | 11 | 语文88 |
1 | 11 | 数学90 |
2 | 11 | 英语75 |
3 | 12 | 语文66 |
4 | 12 | 数学55 |
5 | 13 | 英语29 |
| sno | 爱好 | grade |
---|
0 | 11 | 篮球 | 语文88 |
1 | 11 | 篮球 | 数学90 |
2 | 11 | 篮球 | 英语75 |
3 | 11 | 羽毛球 | 语文88 |
4 | 11 | 羽毛球 | 数学90 |
5 | 11 | 羽毛球 | 英语75 |
6 | 12 | 乒乓球 | 语文66 |
7 | 12 | 乒乓球 | 数学55 |
8 | 12 | 篮球 | 语文66 |
9 | 12 | 篮球 | 数学55 |
10 | 12 | 足球 | 语文66 |
11 | 12 | 足球 | 数学55 |
3、理解left join、right join、inner join、outer join的区别
| key | A | B |
---|
0 | K0 | A0 | B0 |
1 | K1 | A1 | B1 |
2 | K2 | A2 | B2 |
3 | K3 | A3 | B3 |
| key | C | D |
---|
0 | K0 | C0 | D0 |
1 | K1 | C1 | D1 |
2 | K4 | C4 | D4 |
3 | K5 | C5 | D5 |
3.1 inner join,默认
左边和右边的key都有,才会出现在结果里
| key | A | B | C | D |
---|
0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | A1 | B1 | C1 | D1 |
3.2 left join
左边的都会出现在结果里,右边的如果无法匹配则为Null
| key | A | B | C | D |
---|
0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | A1 | B1 | C1 | D1 |
2 | K2 | A2 | B2 | NaN | NaN |
3 | K3 | A3 | B3 | NaN | NaN |
3.3 right join
右边的都会出现在结果里,左边的如果无法匹配则为Null
| key | A | B | C | D |
---|
0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | A1 | B1 | C1 | D1 |
2 | K4 | NaN | NaN | C4 | D4 |
3 | K5 | NaN | NaN | C5 | D5 |
3.4 outer join
左边、右边的都会出现在结果里,如果无法匹配则为Null
| key | A | B | C | D |
---|
0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | A1 | B1 | C1 | D1 |
2 | K2 | A2 | B2 | NaN | NaN |
3 | K3 | A3 | B3 | NaN | NaN |
4 | K4 | NaN | NaN | C4 | D4 |
5 | K5 | NaN | NaN | C5 | D5 |
4、如果出现非Key的字段重名怎么办
| key | A | B |
---|
0 | K0 | A0 | B0 |
1 | K1 | A1 | B1 |
2 | K2 | A2 | B2 |
3 | K3 | A3 | B3 |
| key | A | D |
---|
0 | K0 | A10 | D0 |
1 | K1 | A11 | D1 |
2 | K4 | A12 | D4 |
3 | K5 | A13 | D5 |
| key | A_x | B | A_y | D |
---|
0 | K0 | A0 | B0 | A10 | D0 |
1 | K1 | A1 | B1 | A11 | D1 |
| key | A_left | B | A_right | D |
---|
0 | K0 | A0 | B0 | A10 | D0 |
1 | K1 | A1 | B1 | A11 | D1 |