实习-Kaggle-OTTO比赛回顾

2024-03-14 学习笔记 Kaggle 0 评论字数统计: 8.6k(字) 阅读时长: 40(分)

1. 比赛任务

The goal of this competition is to predict e-commerce clicks, cart additions, and orders. You’ll build a multi-objective recommender system based on previous events in a user session.

这场比赛的目标是预测电子商务点击量、购物车添加量和订单。您将基于用户会话中以前的事件构建一个多目标推荐系统。

The training data contains full e-commerce session information. For each session in the test data, your task it to predict the aid values for each session type thats occur after the last timestamp ts in the test session. In other words, the test data contains sessions truncated by timestamp, and you are to predict what occurs after the point of truncation.

训练数据包含完整的电子商务会话信息。对于测试数据中的每个会话，您的任务是预测测试会话中最后一个时间戳ts之后出现的每个会话类型的商品编号(20个)。换句话说，测试数据包含按时间戳截断的会话，您要预测截断点之后会发生什么。

总结: 给定每个用户每个时刻的行为(点击,加购,付款)的商品编号,给出下一个时刻,该用户三种行为最可能的20个商品编号.

2. 数据集

数据描述:

2,899,779 sessions
1,855,603 items
216,716,096 events
194,720,954 clicks
16,896,191 carts
5,098,951 orders

训练数据:

提交样例文件:

3. 思路

Candidate Generation

Candidate Generation方法

原因:像这个数据量的数据直接放到模型里是不可能的事,所以按照上面的流程一步步来.

Step 1 - Generate Candidates
用来选择候选商品的一些标准：

以前购买的物品
回购的物品
总体上最受欢迎的项目
基于某种聚类技术的相似项目
基于共同访问矩阵等类似项目

Step 2 - ReRank and Choose 20
通过上一步,商品会少很多,然后可以按照一些规则来选择20个商品.

Ranker Model
Handcrafted Rules

What is the co-visitation matrix, really?

It is very interesting to think of modern techniques in the context of their roots.

“Radek is a _”.当我们预测横线上的词的时候, 三元模型会从"Radek", “is”, and “a” 看,然后统计哪个单词和着三个单词出现的次数最多.

但是可能并没有那么多的"Radek", “is”, and "a"的出现过.
所以这就是RNN(or word2vec)的出现,他们会看 “Radek was an _”, "Tommy is a __"作为例子.

那么，这与共访矩阵有什么关系呢？
共访问矩阵统计两个动作在非常接近的情况下的共出现。
如果用户购买了a，在购买B后不久，我们将这些值存储在一起。
我们计算计数，并根据最近的历史来估计未来行动的概率。
理解共同访问矩阵方法中发生的事情是非常重要的…

Candidate ReRank Model - [LB 0.575]

Candidate ReRank Model - LB 0.575

Step 1 - Generate Candidates
对每一个用户生成候选商品,五种方法:

用户点击、购物车、订单的用户历史记录
测试数据一周内最受欢迎的20次点击、购物车、订单
点击/购物车/订单到购物车/订单的共同访问矩阵(带有类型权重)
称为buy2buy的购物车/订单到购物车/订单的共访问矩阵
点击/购物车/订单与点击的共访问矩阵（带时间权重）

Step 2 - ReRank and Choose 20
从上面的候选列表中选择20个作为最终预测结果,选取的顺序为:

最近访问过的项目
以前多次访问的项目
以前在购物车或订单中的项目
购物车/订单到购物车/订单的共同访问矩阵
当前热门项目

候选商品生成

“Carts Orders” Co-visitation Matrix - Type Weighted

"""
构建思路:
按照用户和时间进行排序
df.groupby('session').cumcount()求出每个用户的行数
保留每个用户最近的30个行为
df.merge(df, on='session')创建每个用户的行为对其他行为的关系(目的是构建商品和商品的关系)
筛选出时间间隔小于1天 且商品编号不相同的行,删除重复的值
根据行为的权重系数,对商品之间的关系进行赋值
得到了每个商品之间的权重系数
"""
# 
type_weight = {0:1, 1:6, 2:3}
df = df.sort_values(['session', 'ts'], ascending=[True, False])  # 根据'session'和'ts'排序数据框，先升序排'session'再降序排'ts'
# 使用SESSION的TAIL
df = df.reset_index(drop=True)  # 重设索引并删除原索引
df['n'] = df.groupby('session').cumcount()  # 对'session'进行分组计数
df = df.loc[df.n < 30].drop('n', axis=1)  # 保留每个'session'组的前30行，然后删除计数列'n'
# 创建对关系
df = df.merge(df, on='session')  # 在'session'列上合并数据框自身，创建对关系
df = df.loc[((df.ts_x - df.ts_y).abs() < 24 * 60 * 60) & (df.aid_x != df.aid_y)]  # 选择时间间隔小于1天且'aid_x'不等于'aid_y'的行
# 内存管理，分部计算
df = df.loc[(df.aid_x >= PART*SIZE) & (df.aid_x < (PART+1)*SIZE)]  # 根据'aid_x'值进行分块处理
# 分配权重
df = df[['session', 'aid_x', 'aid_y', 'type_y']].drop_duplicates(['session', 'aid_x', 'aid_y'])  # 根据指定列去除重复行，选择指定列
df['wgt'] = df.type_y.map(type_weight)  # 根据'type_y'映射权重值到'wgt'列
df = df[['aid_x', 'aid_y', 'wgt']]  # 保留'aid_x', 'aid_y', 'wgt'三列
df.wgt = df.wgt.astype('float32')  # 将'wgt'列转换为float32类型
df = df.groupby(['aid_x', 'aid_y']).wgt.sum()  # 根据'aid_x', 'aid_y'分组，并求'wgt'列的和

“Buy2Buy” Co-visitation Matrix

"""
构建思路:
按照用户和时间进行排序
df.groupby('session').cumcount()求出每个用户的行数
保留每个用户最近的30个行为
df.merge(df, on='session')创建每个用户的行为对其他行为的关系(目的是构建商品和商品的关系)
筛选出时间间隔小于14天 且商品编号不相同的行,删除重复的值
权重全部是1(加购和加购之间的关系),对商品和商品之间的关系进行分组求和
得到了每个商品之间的权重系数
"""
# 
df = df.loc[df['type'].isin([1,2])] # 仅保留购物车和订单
df = df.sort_values(['session','ts'],ascending=[True,False]) # 按'session'和'ts'降序排序
# 使用SESSION的TAIL
df = df.reset_index(drop=True) # 重新设置索引并丢弃原索引
df['n'] = df.groupby('session').cumcount() # 计算每个'session'组的行数
# 保留用户最近的30个行为
df = df.loc[df.n<30].drop('n',axis=1) # 仅保留每个'session'组的前30行，并删除'n'列
# 创建成对关系
df = df.merge(df, on='session') # 在'session'上合并数据框自身，创建成对关系
df = df.loc[((df.ts_x - df.ts_y).abs() < 14 * 24 * 60 * 60) & (df.aid_x != df.aid_y)] # 筛选出时间间隔小于14天且'aid_x'不等于'aid_y'的行
# 内存管理，分部计算
df = df.loc[(df.aid_x >= PART*SIZE) & (df.aid_x < (PART+1)*SIZE)] # 根据'aid_x'的值对数据框进行分块处理
# 分配权重
df = df[['session', 'aid_x', 'aid_y', 'type_y']].drop_duplicates(['session', 'aid_x', 'aid_y']) # 根据'session', 'aid_x', 'aid_y'列去除重复行，并选择指定列
df['wgt'] = 1 # 添加'wgt'列，并赋值为1
df = df[['aid_x', 'aid_y', 'wgt']] # 仅保留'aid_x', 'aid_y', 'wgt'三列
df.wgt = df.wgt.astype('float32') # 将'wgt'列的数据类型转换为float32
df = df.groupby(['aid_x', 'aid_y']).wgt.sum() # 根据'aid_x', 'aid_y'分组，并对'wgt'列求和

“Clicks” Co-visitation Matrix - Time Weighted

"""
构建思路:
按照用户和时间进行排序
df.groupby('session').cumcount()求出每个用户的行数
保留每个用户最近的30个行为
df.merge(df, on='session')创建每个用户的行为对其他行为的关系(目的是构建商品和商品的关系)
筛选出时间间隔小于1天 且商品编号不相同的行,删除重复的值
按时间顺序对商品和商品之间的关系进行赋值,时间越近权重越大
得到了每个商品之间的权重系数

"""
# 
df = df.sort_values(['session', 'ts'], ascending=[True, False])  # 按'session'和'ts'排序，'session'升序，'ts'降序
# 使用SESSION的TAIL
df = df.reset_index(drop=True)  # 重置索引并删除旧索引
df['n'] = df.groupby('session').cumcount()  # 对'session'进行计数
df = df.loc[df.n < 30].drop('n', axis=1)  # 保留每个'session'组的前30行，然后删除计数列
# 创建对关系
df = df.merge(df, on='session')  # 在'session'上合并数据框自身，创建对关系
df = df.loc[((df.ts_x - df.ts_y).abs() < 24 * 60 * 60) & (df.aid_x != df.aid_y)]  # 选择时间间隔小于1天且'aid_x'不等于'aid_y'的行
# 内存管理，分部计算
df = df.loc[(df.aid_x >= PART*SIZE) & (df.aid_x < (PART+1)*SIZE)]  # 根据'aid_x'的值对数据框进行分块处理
# 分配权重
df = df[['session', 'aid_x', 'aid_y', 'ts_x']].drop_duplicates(['session', 'aid_x', 'aid_y'])  # 根据指定列去除重复行
df['wgt'] = 1 + 3*(df.ts_x - 1659304800) / (1662328791 - 1659304800)  # 根据公式计算权重
df = df[['aid_x', 'aid_y', 'wgt']]  # 保留'aid_x', 'aid_y', 'wgt'三列
df.wgt = df.wgt.astype('float32')  # 将'wgt'列的数据类型转换为float32
df = df.groupby(['aid_x', 'aid_y']).wgt.sum()  # 根据'aid_x', 'aid_y'分组，并对'wgt'列求和

# 商品的矩阵构建完成之后都经历下面的步骤
tmp = tmp.reset_index()
# 按照权重进行排序
tmp = tmp.sort_values(['aid_x','wgt'],ascending=[True,False])
tmp = tmp.reset_index(drop=True)
# 保留每个商品的前15(不定)个关系
tmp['n'] = tmp.groupby('aid_x').aid_y.cumcount()
tmp = tmp.loc[tmp.n<15].drop('n',axis=1)
tmp.groupby('aid_x').aid_y.apply(list).to_dict()

'''
总结:
1. "Carts Orders" Co-visitation Matrix - Type Weighted
含义:根据用户的历史行为分配商品之间的权重系数对应top_20_buys
2. "Buy2Buy" Co-visitation Matrix
含义:根据用户的加购和购买行为分配商品之间的权重系数对应top_20_buy2buy
3. "Clicks" Co-visitation Matrix - Time Weighted
含义:根据用户的行为时间分配商品之间的权重系数对应top_20_clicks
'''

重排名与最终选择

type_weight_multipliers = {0: 1, 1: 6, 2: 3}

def suggest_clicks(df):
   # USER HISTORY AIDS AND TYPES
   aids = df.aid.tolist()  # 获取用户历史商品id
   ty = df.type.tolist()  # 获取用户历史行为
   unique_aids = list(dict.fromkeys(aids[::-1]))  # 获取不重复的商品id
   # 如果历史商品id大于等于20个,则返回前20个商品id
   if len(unique_aids) >= 20:
       # 根据商品id出现的时间,对商品id进行排序
       weights = np.logspace(0.1, 1, len(aids), base=2, endpoint=True) - 1
       # 记录商品id:权重
       aids_temp = Counter() 
       # 商品,权重,类型
       for aid, w, t in zip(aids, weights, ty): 
           # id权重=时间权重*类型权重
           aids_temp[aid] += w * type_weight_multipliers[t]
        # 返回权重最大的前20个商品id
       sorted_aids = [k for k,v in aids_temp.most_common(20)]
       return sorted_aids
   # 如果不够20个, 从点击最高的商品关系中找到最近点击的商品 然后取出最关联的20个商品
   aids2 = list(itertools.chain(*[top_20_clicks[aid] for aid in unique_aids if aid in top_20_click]))
   # 从最近点击的商品的最关联的商品中找打出现次数最多的商品
   top_aids2 = [aid2 for aid2, cnt in Counter(aids2).most_common(20) if aid2 not in unique_aids]
   # 将这些商品和用户历史商品合并
   result = unique_aids + top_aids2[:20 - len(unique_aids)]
   # 如果历史商品加上最关联的商品还不够20个,则从test中的历史数据中找到点击最高的商品
   return result + list(top_clicks)[:20 - len(result)]

def suggest_buys(df):
   # USER HISTORY AIDS AND TYPES
   aids = df.aid.tolist()  # 获取用户历史商品id
   ty = df.type.tolist()  # 获取用户历史行为
   unique_aids = list(dict.fromkeys(aids[::-1]))  # 获取所有行为不重复的商品id
   # 只保留用户历史加购和购买的行为(商品id)
   df = df.loc[(df['type'] == 1) | (df['type'] == 2)]  
   # 获取加购和购买的不重复商品id
   unique_buys = list(dict.fromkeys(df.aid.tolist()[::-1]))  
   # 如果历史行为商品id大于等于20个,则根据历史行为返回商品id
   if len(unique_aids) >= 20:
       # 时间权重
       weights = np.logspace(0.5, 1, len(aids), base=2, endpoint=True) - 1 
       aids_temp = Counter() 
       # id权重=时间权重*类型权重
       for aid, w, t in zip(aids, weights, types): 
           aids_temp[aid] += w * type_weight_multipliers[t] 
       # 找到与用户历史购买/加购行为最相关的20个商品id
       aids3 = list(itertools.chain(*[top_20_buy2buy[aid] for aid in unique_buys if aid in top_20_buy2buy]))  
       # 对这些商品id的权重增加0.1
       for aid in aids3: aids_temp[aid] += 0.1
       # 返回权重最大的前20个商品id
       sorted_aids = [k for k, v in aids_temp.most_common(20)]
       return sorted_aids
   # 找到与用户历史所有行为的商品行为上最相关的20个商品id
   aids2 = list(itertools.chain(*[top_20_buys[aid] for aid in unique_aids if aid in top_20_buys]))  
   # 找到与用户历史购买/加购行为的商品最相关的20个商品id
   aids3 = list(itertools.chain(*[top_20_buy2buy[aid] for aid in unique_buys if aid in top_20_buy2buy]))  
   # 对这些商品出现的次数进行统计
   top_aids2 = [aid2 for aid2, cnt in Counter(aids2 + aids3).most_common(20) if aid2 not in unique_aids]  
    # 获取最终结果,历史行为商品id+最相关的商品id
   result = unique_aids + top_aids2[:20 - len(unique_aids)] 
   # 如果这些商品还不够20个,则从test中的历史数据中找到购买最高的商品
   return result + list(top_orders)[:20 - len(result)]

# 生成预测结果
pred_df_clicks = test_df.sort_values(["session", "ts"]).groupby(["session"]).apply(
    lambda x: suggest_clicks(x)
)

pred_df_buys = test_df.sort_values(["session", "ts"]).groupby(["session"]).apply(
    lambda x: suggest_buys(x)
)
clicks_pred_df = pd.DataFrame(pred_df_clicks.add_suffix("_clicks"), columns=["labels"]).reset_index()
orders_pred_df = pd.DataFrame(pred_df_buys.add_suffix("_orders"), columns=["labels"]).reset_index()
carts_pred_df = pd.DataFrame(pred_df_buys.add_suffix("_carts"), columns=["labels"]).reset_index()
pred_df = pd.concat([clicks_pred_df, orders_pred_df, carts_pred_df])
pred_df.columns = ["session_type", "labels"]
pred_df["labels"] = pred_df.labels.apply(lambda x: " ".join(map(str,x)))
pred_df.to_csv("submission.csv", index=False)
pred_df.head()

Co-visitation Matrix

Step 1 - Generate Candidates

总会有一些商品是经常点击的并且一起买,利用这个思想构建一个协同矩阵
- 首先，我们查看同一会话中在时间上彼此接近（<1天）的所有事件对。我们计算共同访问矩阵 $M_{aid1，aid2}$ 通过对所有会话中的每对事件对的全局数量进行计数。
- 对于每个商品id,我们发现前20个最频繁的 $aid2=argsort（M[aid]）[-20:]$

Step 2 - ReRank and Choose 20

从上面的候选列表中选择出现频率最高的20个作为最终预测结果

候选商品生成

import sys  # 导入sys模块
import gc  # 导入gc模块

def gen_pairs(df):  # 定义一个函数用于生成pairs
   df = df.query('session % @SAMPLING == 0').groupby('session', as_index=False, sort=False).apply(lambda g: g.tail(30)).reset_index(drop=True)  # 根据条件筛选数据并截取每个会话的最后30条记录
   df = pd.merge(df, df, on='session')  # 在会话上进行自连接
   pairs = df.query('abs(ts_x - ts_y) < 24 * 60 * 60 * 1000 and aid_x != aid_y')[['session', 'aid_x', 'aid_y']].drop_duplicates()  # 筛选满足条件的数据对
   return pairs[['aid_x', 'aid_y']].values  # 返回数据对中的'aid_x'和'aid_y'值

def gen_aid_pairs():  # 定义一个函数生成aid_pairs 
   all_pairs = defaultdict(lambda: Counter())  # 初始化一个空字典用于存储所有数据配对
   all_pair_chunks = []  # 初始化一个空列表用于存储所有的数据块
   with tqdm(glob.glob('../input/otto-chunk-data-inparquet-format/*_parquet/*'), desc='Chunks') as prog:  # 使用tqdm来展示进度条，并遍历文件
       for idx, chunk_file in enumerate(prog):  # 遍历文件
           with multiprocessing.Pool() as p:  # 创建多进程池
               chunk = pd.read_parquet(chunk_file).drop(columns=['type'])  # 从parquet文件中读取数据块
               pair_chunks = p.map(gen_pairs, np.array_split(chunk, 120))  # 将数据块拆分并使用gen_pairs函数生成数据对
               pair_chunks = np.concatenate(pair_chunks, axis=0)  # 将数据块连接成一个数组
               all_pair_chunks.append(pair_chunks)  # 将数据块添加到数据块列表中

               if DEBUG and idx >= 3:  # 如果处于DEBUG模式且索引大于等于3时，跳出循环
                   break
               del chunk, pair_chunks  # 删除数据块和数据对
               gc.collect()  # 回收内存

   df = pd.DataFrame(data=np.concatenate(all_pair_chunks), columns=['aid1', 'aid2'])  # 创建包含所有数据对的数据框
   top_aids = df.groupby('aid1').apply(lambda df: Counter(df.aid2).most_common(40)).to_dict()  # 根据'aid1'分组，并获取前40个最常见的'aid2'值
   return top_aids  # 返回每个'aid1'对应的前40个最常见的'aid2'值的字典
#    top_aids的数据结构如下：
#    {aid1: [(aid2, count), (aid2, count), ...], aid1: [(aid2, count), (aid2, count), ...], ...}

重排名与最终选择

import itertools

def suggest_aids(df):
    # 选择用户最后操作的20个商品id
    aids = df.tail(20).aid.tolist()
  
    if len(aids) >= 20:
        return aids
  
    # 最后的行为不够20个,就从top_40_cnt中找到与用户最后行为最相关的商品
    aids = set(aids)
    new_aids = Counter()
    for aid in aids:
        new_aids.update(top_40_cnt.get(aid, Counter()))
    # 选取出现次数最多的商品
    top_aids2 = [aid2 for aid2, cnt in new_aids.most_common(20) if aid2 not in aids]      
    return list(aids) + top_aids2[:20 - len(aids)]

pred_df = test_df.sort_values(["session", "type", "ts"]).groupby(["session"]).apply(
    lambda x: suggest_aids(x)
)
##################
# BELOW IS CODE ADDED BY CHRIS
# 将click, order, cart的预测结果分开处理

How To Build a GBT Ranker Model

🏆 Training an XGBoost Ranker on the GPU 🔥🔥🔥
💡 [polars] Proof of concept: LGBM Ranker🧪🧪🧪
How To Build a GBT Ranker Model

Step 1 - Generate Candidates
使用上面的方法生成候选商品列表
每行一个session一个aid,数据内容如下

session (i.e. user)
aid (i.e. item)
user features
item features
user-item interaction features
click target (i.e 0 or 1)
cart target (i.e. 0 or 1)
order target (i.e. 0 or 1)

Step 2 - ReRank and Choose 20
使用GBT模型最终的20个商品进行预测

Step 1
构建模型的数据集:训练数据是公开数据的前三周,验证数据是第四周.
验证数据又被分为valid A 和valid B, B是ground truth
对每个session先给出50个候选商品id,然后得到了一个(number_of_session * 50, 2)大小的dataframe,类似:

session aid

1 1234

1 9841

2 5845

2 8984
Setp 2
创建商品特征(item feature),使用训练数据和验证数据A

session	aid
1	1234
1	9841
2	5845
2	8984

1
2
3

item_features = train.groupby('aid').agg({'aid':'count','session':'nunique','type':'mean'})
item_features.columns = ['item_item_count','item_user_count','item_buy_ratio']
# 分别是商品-商品的关系,商品-用户的火热度,商品-购买系数概率

Setp 3
创建用户特征(user feature),使用验证数据A

1
2
3

user_features = train.groupby('session').agg({'session':'count','aid':'nunique','type':'mean'})
user_features.columns = ['user_user_count','user_item_count','user_buy_ratio']
# 分别是用户-用户的关系,用户-商品的购买了,用户-购买系数概率

Setp 4
创建用户-商品交互特征(user-item interaction feature),使用验证数据A
思路很多没有具体给出.例如:
创建用户-商品点击交互特征

session aid item_click
Setp 5
将特征添加到candidate dataframe中

1
2

candidates = candidates.merge(item_features, left_on='aid', right_index=True, how='left').fillna(-1)
candidates = candidates.merge(user_features, left_on='session', right_index=True, how='left').fillna(-1)

然后candidate dataframe类似:

session	aid	item_feat1	item_feat2	user_feat1	user_feat2
1	1234	1	2	3	4
1	9841	5	6	7	8
2	5845	9	10	11	12
2	8984	13	14	15	16

Setp 6
构建ground truth,例如 test_labels.parquet😐 session | type | ground_truth |
| ------- | ----- | ---------------------- |
| 1 | carts | [5456,4545,98741,2355] |
| 2 | carts | [1257,8653,2547] |

然后将其转换为如下:

session	aid	cart
1	5456	1
1	4545	1
1	98741	1

然后将其合并到candidate dataframe中

1	candidates = candidates.merge(cart_target,on=['user','item'],how='left').fillna(0)

candidates dataframe类似:

session	aid	item_feat1	item_feat2	user_feat1	user_feat2	cart
1	1234	1	2	3	4	0
1	9841	5	6	7	8	1
2	5845	9	10	11	12	0
2	8984	13	14	15	16	1

Setp 7
训练,不适用user和aid列:

import xgboost as xgb  # 导入XGBoost库
from sklearn.model_selection import GroupKFold  # 从sklearn库中导入GroupKFold模块

skf = GroupKFold(n_splits=5)  # 使用GroupKFold方法划分数据集为5折交叉验证
# groups参数的作用是指定用于分组的特征列。在GroupKFold交叉验证中，通过指定groups参数，可以确保在交叉验证过程中，同一组内的数据样本不会同时出现在训练集和验证集中，以避免数据泄露和提高模型的准确性。在这里，candidates['user']列被用作分组的特征列，以确保每个用户的数据在交叉验证时能够保持独立。
for fold, (train_idx, valid_idx) in enumerate(skf.split(candidates, candidates['click'], groups=candidates['user'])):  # 遍历每一个交叉验证折数

   X_train = candidates.loc[train_idx, FEATURES]  # 获取训练集特征数据
   y_train = candidates.loc[train_idx, 'click']  # 获取训练集标签数据
   X_valid = candidates.loc[valid_idx, FEATURES]  # 获取验证集特征数据
   y_valid = candidates.loc[valid_idx, 'click']  # 获取验证集标签数据

   # 如果有50个候选项，则使用50个作为分组信息
# 创建一个XGBoost中的DMatrix对象，用于存储训练集的特征数据、标签数据以及分组信息。
# 创建DMatrix对象，可以使训练数据更高效地传递给XGBoost模型，并且能够利用分组信息优化模型的学习过程。
   dtrain = xgb.DMatrix(X_train, y_train, group=[50] * (len(train_idx)//50) )  # 创建训练集的DMatrix
   dvalid = xgb.DMatrix(X_valid, y_valid, group=[50] * (len(valid_idx)//50) )  # 创建验证集的DMatrix

   xgb_parms = {'objective':'rank:pairwise', 'tree_method':'gpu_hist'}  # 设置XGBoost的参数
   model = xgb.train(xgb_parms, 
       dtrain=dtrain,
       evals=[(dtrain,'train'),(dvalid,'valid')],
       num_boost_round=1000,
       verbose_eval=100)  # 训练XGBoost模型
   model.save_model(f'XGB_fold{fold}_click.xgb')  # 保存训练好的模型

Setp 8
推理:为了进行推理，我们创建了一个新的候选数据帧，但这次是根据Kaggle的测试数据。然后，我们从Kaggle训练的所有4周加上Kaggle测试的1周中制作项目特征(item feature)。我们通过Kaggle测试制作用户特征(user feature)。我们将这些特征合并到我们的候选者中。然后，我们使用保存的模型来推断点击量的预测。最后，我们通过对预测进行排序并选择20个。

preds = np.zeros(len(test_candidates))  # 初始化一个全零数组用于存储预测值
for fold in range(5):  # 遍历5个交叉验证折数
   model = xgb.Booster()  # 创建一个XGBoost模型
   model.load_model(f'XGB_fold{fold}_click.xgb')  # 载入训练好的XGBoost模型
   model.set_param({'predictor': 'gpu_predictor'})  # 设置模型参数为GPU加速预测
   dtest = xgb.DMatrix(data=test_candidates[FEATURES])  # 创建测试集的DMatrix对象
# 结果是每个候选值的预测得分（概率）
   preds += model.predict(dtest)/5  # 对测试集进行预测，并将结果累加求平均

predictions = test_candidates[['user','item']].copy()  # 复制测试集的'user'和'item'列作为预测结果的基础
predictions['pred'] = preds  # 将预测结果添加到predictions中

predictions = predictions.sort_values(['user','pred'], ascending=[True,False]).reset_index(drop=True)  # 对预测结果按'user'和'pred'进行排序，并重置索引
predictions['n'] = predictions.groupby('user').item.cumcount().astype('int8')  # 根据'user'分组后，计算每个用户的条目数量作为序号
predictions = predictions.loc[predictions.n<20]  # 保留每个用户前20个预测结果
sub = predictions.groupby('user').item.apply(list)  # 根据用户分组，将预测结果转换为列表


sub = sub.to_frame().reset_index()  # 将预测结果转换为DataFrame格式
sub.item = sub.item.apply(lambda x: " ".join(map(str,x)))  # 将列表中的数字转换为字符串并拼接成一个字符串
sub.columns = ['session_type','labels']  # 重命名DataFrame的列名
sub.session_type = sub.session_type.astype('str')+ '_clicks'  # 修改'session_type'列的数据类型

💡 Word2Vec How-to [training and submission]🚀🚀🚀

A session where one action follows another action is very much like a sentence!

类似地，在这里我们可以利用这样一个事实，即在一个紧密的序列中出现的商品id可能有一些相似之处。

所以我们使用word2vec模型来训练商品id的嵌入向量，然后使用这些向量来计算商品id之间的相似度。

这样给定一个商品的id就可以找到和她类似的商品id。

使用word2vec直接获取20个商品


# 将用户的所有行为的商品id转换为一个句子,['aid1','aid2','aid3','aid4']
sentences_df = pl.concat([train, test]).groupby('session').agg(
    pl.col('aid').alias('sentence')
)
sentences = sentences_df['sentence'].to_list()
# 训练word2vec模型
w2vec = Word2Vec(sentences=sentences, vector_size=32, min_count=1, workers=4)


# 构建aid到索引的映射字典
aid2idx = {aid: i for i, aid in enumerate(w2vec. index_to_key)}  
# 创建一个Annoy索引对象，指定向量维度为32，距离度量方式为欧氏距离
index = AnnoyIndex(32, 'euclidean')  

# 遍历aid2idx字典
for aid, idx in aid2idx.items():  
   index.add_item(idx, w2vec.wv.vectors[idx])  # 将向量添加到Annoy索引对象中
   
index.build(10)  # 构建Annoy索引，其中10表示构建索引时要使用的树的数量

# index存储的是 索引->向量
# aid2idx存储的是 商品id->索引
sample_sub = pd.read_csv('../input/otto-recommender-system//sample_submission.csv')  # 从CSV文件中读取sample_sub数据

'''
选择最近的20个商品
'''

# 从测试集中获取每个用户的AID列表和类型列表
test_session_AIDs = test.to_pandas().reset_index(drop=True).groupby('session')['aid'].apply(list)
test_session_types = test.to_pandas().reset_index(drop=True).groupby('session')['type'].apply(list)

labels = []  # 初始化一个空列表用于存储标签结果

type_weight_multipliers = {0: 1, 1: 6, 2: 3}  # 定义类型权重
for AIDs, types in zip(test_session_AIDs, test_session_types):  # 遍历测试集中的每个用户的AID和类型
   if len(AIDs) >= 20:  # 如果AID数量大于等于20
       # 如果我们有足够的AID（大于等于20）我们不需要查找候选项！我们只需使用旧的逻辑
       weights=np.logspace(0.1,1,len(AIDs),base=2, endpoint=True)-1  # 根据AID数量生成对应的权重
       aids_temp=defaultdict(lambda: 0)  # 初始化一个默认值为0的字典
       for aid,w,t in zip(AIDs,weights,types):  # 遍历AID，权重和类型
           aids_temp[aid]+= w * type_weight_multipliers[t]  # 根据AID和类型计算加权得分
     
       sorted_aids=[k for k, v in sorted(aids_temp.items(), key=lambda item: -item[1])]  # 按照加权得分对AID进行排序
       labels.append(sorted_aids[:20])  # 将前20个AID添加到标签中
   else:  # 如果AID数量小于20
       # 这里我们没有20个AID要输出-我们将使用word2vec嵌入来生成候选项！
       AIDs = list(dict.fromkeys(AIDs[::-1]))  # 移除重复项并反转AID列表

       # 获取时间最近的AID
       most_recent_aid = AIDs[0]

        # most_recent_aid是商品id,aid2idx[most_recent_aid]拿到对应的索引
        # index.get_nns_by_item(aid2idx[most_recent_aid], 21)根据索引获取向量并计算最近的21个商品
       # 寻找一些邻居
       nns = [w2vec.wv.index_to_key[i] for i in index.get_nns_by_item(aid2idx[most_recent_aid], 21)[1:]]  # 使用Annoy索引找到最近邻的AID
       labels.append((AIDs+nns)[:20])  # 将AID和邻居的AID组合，取前20个作为标签

使用word2vec 获取候选商品

和covisiation matrix类似,使用word2vec可以获取最相关的商品

数据为:

session	aid	type
1	10	0
1	20	0
2	20	1
2	30	0


假如通过上面的代码我们已经类似的获取到了最相关的商品
| session | aid |
| --- | --- |
| 1 | 11 | 
| 1 | 20 |
| 2 | 25 | 
| 2 | 6 | 


那么下面还有几个步骤

1. Step 1: Add ordering information to our candidates.
word2vec模型是按照相似度评分来排序的,所以我们需要添加一些排序信息

| session | aid | rank |
| --- | --- | --- |
| 1 | 11 | 1 |
| 1 | 20 | 2 |
| 2 | 25 | 1 |
| 2 | 6 | 2 |


2. 将这些信息合并到candidates中
| session | aid | rank | type |
| --- | --- | --- | --- |
| 1 | 11 | 1 | null |
| 1 | 20 | 2 | 0 |
| 1 | 10 | null | 0 |
| 2 | 25 | 1 | null |
| 2 | 6 | 2 | null |
| 2 | 20 | null | 1 |
| 2 | 30 | null | 0 |


3. 使用Ranker模型进行预测,


### 💡 [2 methods] How-to ensemble predictions 🏅🏅🏅
[💡 [2 methods] How-to ensemble predictions 🏅🏅🏅](https://www.kaggle.com/code/radek1/2-methods-how-to-ensemble-predictions)


对预测结果集成:
- 投票集成(voting ensemble)
- 加权投票集成(voting ensemble with weights),对好结果有更大的权重

```python
def read_sub(path, weight=1):  # 定义一个函数用于加载和预处理提交结果
   '''a helper function for loading and preprocessing submissions'''
   return (
       pl.read_csv(path)  # 从路径中读取CSV文件
           .with_column(pl.col('labels').str.split(by=' '))  # 将'labels'列按空格拆分为列表
           .with_column(pl.lit(weight).alias('vote'))  # 新增名为'vote'的列，其中填充权重值
           .explode('labels')  # 展开'labels'列中的列表
           .rename({'labels': 'aid'})  # 重命名'labels'为'aid'
           .with_column(pl.col('aid').cast(pl.UInt32))  # 将'aid'列转换为UInt32类型
           .with_column(pl.col('vote').cast(pl.UInt8))  # 将'vote'列转换为UInt8类型
   )
# 有无权重
subs = [read_sub(path) for path in paths]
subs = [read_sub(path, weight) for path, weight in zip(paths, [1, 0.55, 0.55])]

读取后的数据为:

session_type	aid	vote
1_clicks	1234	1
1_clicks	9841	1
2_clicks	5845	1

由于内存限制,只能进行join:

1 2	subs = subs[0].join(subs[1], how='outer', on=['session_type', 'aid']).join(subs[2], how='outer', on=['session_type', 'aid'], suffix='_right2') subs.head()

合并后的数据

session_type	aid	vote	vote_right	vote_right2
1_clicks	1234	1	1	1
1_clicks	9841	1	null	1
2_clicks	5845	1	1	null
2_clicks	8984	1	null	null

用0填充null值,然后对vote求和,排序:

subs = (subs
    .fill_null(0)
    .with_column((pl.col('vote') + pl.col('vote_right') + pl.col('vote_right2')).alias('vote_sum'))
    .drop(['vote', 'vote_right', 'vote_right2'])
    .sort(by='vote_sum')
    .reverse()
)

数据如下:

session_type	aid	vote_sum
1_clicks	1234	3
2_clicks	5845	2
1_clicks	9841	2

然后对每个类型选择前20个商品,然后聚合成数组:

preds = subs.groupby('session_type').agg([
    pl.col('aid').head(20).alias('labels')
])

preds = preds.with_column(pl.col('labels').apply(lambda lst: ' '.join([str(aid) for aid in lst])))

💡Matrix Factorization [PyTorch+Merlin Dataloader]

与word2vec类似,使用矩阵分解来获取商品的嵌入向量

💡Matrix Factorization with GPU [PyTorch+Merlin Dataloader]

使用Pytorch的Embedding层来训练商品的嵌入向量,然后使用这些向量来计算商品id之间的相似度.

# 构建aid_pairs
train_pairs = cudf.concat([train, test])[['session', 'aid']]
del train, test

train_pairs['aid_next'] = train_pairs.groupby('session').aid.shift(-1)
train_pairs = train_pairs[['aid', 'aid_next']].dropna().reset_index(drop=True)

aid	aid_next
1	2
2	3
3	4

import torch
from torch import nn
# WordEmbedding模型
class MatrixFactorization(nn.Module):
    def __init__(self, n_aids, n_factors):
        super().__init__()
        self.aid_factors = nn.Embedding(n_aids, n_factors, sparse=True)
      
    def forward(self, aid1, aid2):
        aid1 = self.aid_factors(aid1)
        aid2 = self.aid_factors(aid2)
      
        return (aid1 * aid2).sum(dim=1)
# 评价指标
class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self, name, fmt=':f'):
        self.name = name
        self.fmt = fmt
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

    def __str__(self):
        fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
        return fmtstr.format(**self.__dict__)

# 训练
model.to('cuda')
for epoch in range(num_epochs):
    for batch, _ in train_dl_merlin:
        model.train()
        losses = AverageMeter('Loss', ':.4e')
          
        aid1, aid2 = batch['aid'], batch['aid_next']
        aid1 = aid1.to('cuda')
        aid2 = aid2.to('cuda')
        output_pos = model(aid1, aid2)
        output_neg = model(aid1, aid2[torch.randperm(aid2.shape[0])])
      
        output = torch.cat([output_pos, output_neg])
        targets = torch.cat([torch.ones_like(output_pos), torch.zeros_like(output_pos)])
        loss = criterion(output, targets)
        losses.update(loss.item())
      
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
      
    model.eval()
  
    with torch.no_grad():
        accuracy = AverageMeter('accuracy')
        for batch, _ in valid_dl_merlin:
            aid1, aid2 = batch['aid'], batch['aid_next']
            output_pos = model(aid1, aid2)
            output_neg = model(aid1, aid2[torch.randperm(aid2.shape[0])])
            accuracy_batch = torch.cat([output_pos.sigmoid() > 0.5, output_neg.sigmoid() < 0.5]).float().mean()
            accuracy.update(accuracy_batch, aid1.shape[0])
          
    print(f'{epoch+1:02d}: * TrainLoss {losses.avg:.3f}  * Accuracy {accuracy.avg:.3f}')

# 获取嵌入并计算相似度
embeddings = model.aid_factors.weight.detach().cpu().numpy()

knn = NearestNeighbors(n_neighbors=21, metric='euclidean')
knn.fit(embeddings)

_, aid_nns = knn.kneighbors(embeddings)

sample_sub = pd.read_csv('../input/otto-recommender-system//sample_submission.csv')
test = cudf.read_parquet('../input/otto-full-optimized-memory-footprint/test.parquet')

session_types = ['clicks', 'carts', 'orders']
gr = test.reset_index(drop=True).to_pandas().groupby('session')
test_session_AIDs = gr['aid'].apply(list)
test_session_types = gr['type'].apply(list)

labels = []

type_weight_multipliers = {0: 1, 1: 6, 2: 3}
for AIDs, types in zip(test_session_AIDs, test_session_types):
    if len(AIDs) >= 20:
        # if we have enough aids (over equals 20) we don't need to look for candidates! we just use the old logic
        weights=np.logspace(0.1,1,len(AIDs),base=2, endpoint=True)-1
        aids_temp=defaultdict(lambda: 0)
        for aid,w,t in zip(AIDs,weights,types): 
            aids_temp[aid]+= w * type_weight_multipliers[t]
          
        sorted_aids=[k for k, v in sorted(aids_temp.items(), key=lambda item: -item[1])]
        labels.append(sorted_aids[:20])
    else:
        # here we don't have 20 aids to output -- we will use approximate nearest neighbor search and our embeddings
        # to generate candidates!
        AIDs = list(dict.fromkeys(AIDs[::-1]))
      
        # let's grab the most recent aid
        most_recent_aid = AIDs[0]
      
        # and look for some neighbors!
        nns = list(aid_nns[most_recent_aid])
                      
        labels.append((AIDs+nns)[:20])

labels_as_strings = [' '.join([str(l) for l in lls]) for lls in labels]

predictions = pd.DataFrame(data={'session_type': test_session_AIDs.index, 'labels': labels_as_strings})

prediction_dfs = []

for st in session_types:
    modified_predictions = predictions.copy()
    modified_predictions.session_type = modified_predictions.session_type.astype('str') + f'_{st}'
    prediction_dfs.append(modified_predictions)

submission = pd.concat(prediction_dfs).reset_index(drop=True)
submission.to_csv('submission.csv', index=False)

226th (?!) Place Solution & Two-cents from a First-timer

Item Co-visitation Matrix
entirely based one Chris’ notebook

Order matrix: Click/cart/order to click/cart/order with type weighting
Buy2buy matrix: Cart/order to cart/order
Click matrix: click/cart/order to clicks with time weighting

Feature Feneration

Item features (for each aid)
- Count of events (click/cart/order)
- Sum of event weight
- Quarter of day (QoD) with most events (0-3)
- Day of week (DoW) with most events (0-6)
User features (for each session)
- Count of events (click/cart/order) and interacted items (aid)
- Sum of event weight
- QoD with most events (0-3)
- DoW with most events (0-6)
- Number of days with events
- Days from first to last events
User-item features (for each session-aid pair)
- Count of events (click/cart/order) and interacted items (aid)
- Sum of event weight
- QoD with most events in both categorical (0-3) and one-hot encoded (0/1 for each) format
- DoW with most events in both categorical (0-6) and one-hot encoded (0/1 for each) format
- last_n = item_chronological_rank / user_total_event_count
- last_ts = (user_item_last_timestamp - start_week_timestamp) / (end_week_timestamp - start_week_timestamp)

Candidate Selection
partly based on Chris’ notebook
For each session,I select top X most relevant items in each event type

会话单击/点选/订购项目的次数
共访权总和
该商品是否为本周点击次数最多/购买次数最多的商品

Ranker

Rule-base ranker in Chris’ notebook
XGBRanker with rank:pairwise objective
- the model is overfitting and requires a lot of hyperparameter tuning

Solution Ensemble

use their public LB scores as weights
Ensemble of XGBRanker above and public submissions
Ensemble of above two ranker methods and public submissions

6, what have I learned

read the discussion and notebooks forums
try as many ideas as possible
know every line of code you write

本文链接： https://gladdduck.github.io/2024/03/14/实习-Kaggle-OTTO比赛回顾/

版权声明： 本博客所有文章除特别声明外，均采用 CC BY 4.0 CN协议许可协议。转载请注明出处！

GladdduckKB Master

我写的个人简介。