12. Pandas的索引的用途

Pandas的索引index的用途

把数据存储于普通的column列也能用于数据查询，那使用index有什么好处？

index的用途总结：

更方便的数据查询；
使用index可以获得性能提升；
自动的数据对齐功能；
更多更强大的数据结构支持；

import pandas as pd

df = pd.read_csv("./datas/ml-latest-small/ratings.csv")

df.head()

	userId	movieId	rating	timestamp
0	1	1	4.0	964982703
1	1	3	4.0	964981247
2	1	6	4.0	964982224
3	1	47	5.0	964983815
4	1	50	5.0	964982931

df.count()

userId       100836
movieId      100836
rating       100836
timestamp    100836
dtype: int64

1、使用index查询数据

# drop==False，让索引列还保持在column
df.set_index("userId", inplace=True, drop=False)

df.head()

	userId	movieId	rating	timestamp
userId
1	1	1	4.0	964982703
1	1	3	4.0	964981247
1	1	6	4.0	964982224
1	1	47	5.0	964983815
1	1	50	5.0	964982931

df.index

Int64Index([  1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
            ...
            610, 610, 610, 610, 610, 610, 610, 610, 610, 610],
           dtype='int64', name='userId', length=100836)

# 使用index的查询方法
df.loc[500].head(5)

	userId	movieId	rating	timestamp
userId
500	500	1	4.0	1005527755
500	500	11	1.0	1005528017
500	500	39	1.0	1005527926
500	500	101	1.0	1005527980
500	500	104	4.0	1005528065

# 使用column的condition查询方法
df.loc[df["userId"] == 500].head()

	userId	movieId	rating	timestamp
userId
500	500	1	4.0	1005527755
500	500	11	1.0	1005528017
500	500	39	1.0	1005527926
500	500	101	1.0	1005527980
500	500	104	4.0	1005528065

2. 使用index会提升查询性能

如果index是唯一的，Pandas会使用哈希表优化，查询性能为O(1);
如果index不是唯一的，但是有序，Pandas会使用二分查找算法，查询性能为O(logN);
如果index是完全随机的，那么每次查询都要扫描全表，查询性能为O(N);

实验1：完全随机的顺序查询

# 将数据随机打散
from sklearn.utils import shuffle
df_shuffle = shuffle(df)

df_shuffle.head()

	userId	movieId	rating	timestamp
userId
160	160	2340	1.0	985383314
129	129	1136	3.5	1167375403
167	167	44191	4.5	1154718915
536	536	276	3.0	832839990
67	67	5952	2.0	1501274082

# 索引是否是递增的
df_shuffle.index.is_monotonic_increasing

False

df_shuffle.index.is_unique

False

# 计时，查询id==500数据性能
%timeit df_shuffle.loc[500]

376 µs ± 52.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

实验2：将index排序后的查询

df_sorted = df_shuffle.sort_index()

df_sorted.head()

	userId	movieId	rating	timestamp
userId
1	1	2985	4.0	964983034
1	1	2617	2.0	964982588
1	1	3639	4.0	964982271
1	1	6	4.0	964982224
1	1	733	4.0	964982400

# 索引是否是递增的
df_sorted.index.is_monotonic_increasing

True

df_sorted.index.is_unique

False

%timeit df_sorted.loc[500]

203 µs ± 20.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

3. 使用index能自动对齐数据

包括series和dataframe

s1 = pd.Series([1,2,3], index=list("abc"))

s1

a    1
b    2
c    3
dtype: int64

s2 = pd.Series([2,3,4], index=list("bcd"))

s2

b    2
c    3
d    4
dtype: int64

s1+s2

a    NaN
b    4.0
c    6.0
d    NaN
dtype: float64

4. 使用index更多更强大的数据结构支持

很多强大的索引数据结构

CategoricalIndex，基于分类数据的Index，提升性能；
MultiIndex，多维索引，用于groupby多维聚合后结果等；
DatetimeIndex，时间类型索引，强大的日期和时间的方法支持；

ZhangX1n's Blog

Explorer

Recent Notes

使用递归及非递归两种方式实现快速排序

41.缺失的第一个正数

3.无重复字符的最长子串

25.K个一组翻转链表

200.岛屿数量

12. Pandas的索引的用途

Pandas的索引index的用途

1、使用index查询数据

2. 使用index会提升查询性能

实验1：完全随机的顺序查询

实验2：将index排序后的查询

3. 使用index能自动对齐数据

4. 使用index更多更强大的数据结构支持

Graph View

Table of Contents

Backlinks