1. 比赛任务
The goal of this competition is to predict e-commerce clicks, cart additions, and orders. You’ll build a multi-objective recommender system based on previous events in a user session.
这场比赛的目标是预测电子商务点击量、购物车添加量和订单。您将基于用户会话中以前的事件构建一个多目标推荐系统。
The training data contains full e-commerce session information. For each session in the test data, your task it to predict the aid values for each session type thats occur after the last timestamp ts in the test session. In other words, the test data contains sessions truncated by timestamp, and you are to predict what occurs after the point of truncation.
训练数据包含完整的电子商务会话信息。对于测试数据中的每个会话,您的任务是预测测试会话中最后一个时间戳ts之后出现的每个会话类型的商品编号(20个)。换句话说,测试数据包含按时间戳截断的会话,您要预测截断点之后会发生什么。
总结: 给定每个用户每个时刻的行为(点击,加购,付款)的商品编号,给出下一个时刻,该用户三种行为最可能的20个商品编号.
2. 数据集
数据描述:
2,899,779 sessions
1,855,603 items
216,716,096 events
194,720,954 clicks
16,896,191 carts
5,098,951 orders
训练数据:
提交样例文件:
3. 思路
Candidate Generation
Candidate Generation方法
原因:像这个数据量的数据直接放到模型里是不可能的事,所以按照上面的流程一步步来.
Step 1 - Generate Candidates
用来选择候选商品的一些标准:
以前购买的物品
回购的物品
总体上最受欢迎的项目
基于某种聚类技术的相似项目
基于共同访问矩阵等类似项目
Step 2 - ReRank and Choose 20
通过上一步,商品会少很多,然后可以按照一些规则来选择20个商品.
Ranker Model
Handcrafted Rules
What is the co-visitation matrix, really?
What is the co-visitation matrix, really?
It is very interesting to think of modern techniques in the context of their roots.
“Radek is a _”.当我们预测横线上的词的时候, 三元模型会从"Radek", “is”, and “a” 看,然后统计哪个单词和着三个单词出现的次数最多.
但是可能并没有那么多的"Radek", “is”, and "a"的出现过.
所以这就是RNN(or word2vec)的出现,他们会看 “Radek was an _”, "Tommy is a __"作为例子.
那么,这与共访矩阵有什么关系呢?
共访问矩阵统计两个动作在非常接近的情况下的共出现。
如果用户购买了a,在购买B后不久,我们将这些值存储在一起。
我们计算计数,并根据最近的历史来估计未来行动的概率。
理解共同访问矩阵方法中发生的事情是非常重要的…
Candidate ReRank Model - [LB 0.575]
Candidate ReRank Model - LB 0.575
Step 1 - Generate Candidates
对每一个用户生成候选商品,五种方法:
用户点击、购物车、订单的用户历史记录
测试数据一周内最受欢迎的20次点击、购物车、订单
点击/购物车/订单到购物车/订单 的共同访问矩阵(带有类型权重)
称为buy2buy的购物车/订单到购物车/订单的共访问矩阵
点击/购物车/订单与点击的共访问矩阵(带时间权重)
Step 2 - ReRank and Choose 20
从上面的候选列表中选择20个作为最终预测结果,选取的顺序为:
最近访问过的项目
以前多次访问的项目
以前在购物车或订单中的项目
购物车/订单到购物车/订单的共同访问矩阵
当前热门项目
候选商品生成
“Carts Orders” Co-visitation Matrix - Type Weighted
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 """ 构建思路: 按照用户和时间进行排序 df.groupby('session').cumcount()求出每个用户的行数 保留每个用户最近的30个行为 df.merge(df, on='session')创建每个用户的行为对其他行为的关系(目的是构建商品和商品的关系) 筛选出时间间隔小于1天 且商品编号不相同的行,删除重复的值 根据行为的权重系数,对商品之间的关系进行赋值 得到了每个商品之间的权重系数 """ type_weight = {0 :1 , 1 :6 , 2 :3 } df = df.sort_values(['session' , 'ts' ], ascending=[True , False ]) df = df.reset_index(drop=True ) df['n' ] = df.groupby('session' ).cumcount() df = df.loc[df.n < 30 ].drop('n' , axis=1 ) df = df.merge(df, on='session' ) df = df.loc[((df.ts_x - df.ts_y).abs () < 24 * 60 * 60 ) & (df.aid_x != df.aid_y)] df = df.loc[(df.aid_x >= PART*SIZE) & (df.aid_x < (PART+1 )*SIZE)] df = df[['session' , 'aid_x' , 'aid_y' , 'type_y' ]].drop_duplicates(['session' , 'aid_x' , 'aid_y' ]) df['wgt' ] = df.type_y.map (type_weight) df = df[['aid_x' , 'aid_y' , 'wgt' ]] df.wgt = df.wgt.astype('float32' ) df = df.groupby(['aid_x' , 'aid_y' ]).wgt.sum ()
“Buy2Buy” Co-visitation Matrix
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 """ 构建思路: 按照用户和时间进行排序 df.groupby('session').cumcount()求出每个用户的行数 保留每个用户最近的30个行为 df.merge(df, on='session')创建每个用户的行为对其他行为的关系(目的是构建商品和商品的关系) 筛选出时间间隔小于14天 且商品编号不相同的行,删除重复的值 权重全部是1(加购和加购之间的关系),对商品和商品之间的关系进行分组求和 得到了每个商品之间的权重系数 """ df = df.loc[df['type' ].isin([1 ,2 ])] df = df.sort_values(['session' ,'ts' ],ascending=[True ,False ]) df = df.reset_index(drop=True ) df['n' ] = df.groupby('session' ).cumcount() df = df.loc[df.n<30 ].drop('n' ,axis=1 ) df = df.merge(df, on='session' ) df = df.loc[((df.ts_x - df.ts_y).abs () < 14 * 24 * 60 * 60 ) & (df.aid_x != df.aid_y)] df = df.loc[(df.aid_x >= PART*SIZE) & (df.aid_x < (PART+1 )*SIZE)] df = df[['session' , 'aid_x' , 'aid_y' , 'type_y' ]].drop_duplicates(['session' , 'aid_x' , 'aid_y' ]) df['wgt' ] = 1 df = df[['aid_x' , 'aid_y' , 'wgt' ]] df.wgt = df.wgt.astype('float32' ) df = df.groupby(['aid_x' , 'aid_y' ]).wgt.sum ()
“Clicks” Co-visitation Matrix - Time Weighted
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 """ 构建思路: 按照用户和时间进行排序 df.groupby('session').cumcount()求出每个用户的行数 保留每个用户最近的30个行为 df.merge(df, on='session')创建每个用户的行为对其他行为的关系(目的是构建商品和商品的关系) 筛选出时间间隔小于1天 且商品编号不相同的行,删除重复的值 按时间顺序对商品和商品之间的关系进行赋值,时间越近权重越大 得到了每个商品之间的权重系数 """ df = df.sort_values(['session' , 'ts' ], ascending=[True , False ]) df = df.reset_index(drop=True ) df['n' ] = df.groupby('session' ).cumcount() df = df.loc[df.n < 30 ].drop('n' , axis=1 ) df = df.merge(df, on='session' ) df = df.loc[((df.ts_x - df.ts_y).abs () < 24 * 60 * 60 ) & (df.aid_x != df.aid_y)] df = df.loc[(df.aid_x >= PART*SIZE) & (df.aid_x < (PART+1 )*SIZE)] df = df[['session' , 'aid_x' , 'aid_y' , 'ts_x' ]].drop_duplicates(['session' , 'aid_x' , 'aid_y' ]) df['wgt' ] = 1 + 3 *(df.ts_x - 1659304800 ) / (1662328791 - 1659304800 ) df = df[['aid_x' , 'aid_y' , 'wgt' ]] df.wgt = df.wgt.astype('float32' ) df = df.groupby(['aid_x' , 'aid_y' ]).wgt.sum ()
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 tmp = tmp.reset_index() tmp = tmp.sort_values(['aid_x' ,'wgt' ],ascending=[True ,False ]) tmp = tmp.reset_index(drop=True ) tmp['n' ] = tmp.groupby('aid_x' ).aid_y.cumcount() tmp = tmp.loc[tmp.n<15 ].drop('n' ,axis=1 ) tmp.groupby('aid_x' ).aid_y.apply(list ).to_dict() ''' 总结: 1. "Carts Orders" Co-visitation Matrix - Type Weighted 含义:根据用户的历史行为分配商品之间的权重系数对应top_20_buys 2. "Buy2Buy" Co-visitation Matrix 含义:根据用户的加购和购买行为分配商品之间的权重系数对应top_20_buy2buy 3. "Clicks" Co-visitation Matrix - Time Weighted 含义:根据用户的行为时间分配商品之间的权重系数对应top_20_clicks '''
重排名与最终选择
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 type_weight_multipliers = {0 : 1 , 1 : 6 , 2 : 3 } def suggest_clicks (df ): aids = df.aid.tolist() ty = df.type .tolist() unique_aids = list (dict .fromkeys(aids[::-1 ])) if len (unique_aids) >= 20 : weights = np.logspace(0.1 , 1 , len (aids), base=2 , endpoint=True ) - 1 aids_temp = Counter() for aid, w, t in zip (aids, weights, ty): aids_temp[aid] += w * type_weight_multipliers[t] sorted_aids = [k for k,v in aids_temp.most_common(20 )] return sorted_aids aids2 = list (itertools.chain(*[top_20_clicks[aid] for aid in unique_aids if aid in top_20_click])) top_aids2 = [aid2 for aid2, cnt in Counter(aids2).most_common(20 ) if aid2 not in unique_aids] result = unique_aids + top_aids2[:20 - len (unique_aids)] return result + list (top_clicks)[:20 - len (result)] def suggest_buys (df ): aids = df.aid.tolist() ty = df.type .tolist() unique_aids = list (dict .fromkeys(aids[::-1 ])) df = df.loc[(df['type' ] == 1 ) | (df['type' ] == 2 )] unique_buys = list (dict .fromkeys(df.aid.tolist()[::-1 ])) if len (unique_aids) >= 20 : weights = np.logspace(0.5 , 1 , len (aids), base=2 , endpoint=True ) - 1 aids_temp = Counter() for aid, w, t in zip (aids, weights, types): aids_temp[aid] += w * type_weight_multipliers[t] aids3 = list (itertools.chain(*[top_20_buy2buy[aid] for aid in unique_buys if aid in top_20_buy2buy])) for aid in aids3: aids_temp[aid] += 0.1 sorted_aids = [k for k, v in aids_temp.most_common(20 )] return sorted_aids aids2 = list (itertools.chain(*[top_20_buys[aid] for aid in unique_aids if aid in top_20_buys])) aids3 = list (itertools.chain(*[top_20_buy2buy[aid] for aid in unique_buys if aid in top_20_buy2buy])) top_aids2 = [aid2 for aid2, cnt in Counter(aids2 + aids3).most_common(20 ) if aid2 not in unique_aids] result = unique_aids + top_aids2[:20 - len (unique_aids)] return result + list (top_orders)[:20 - len (result)]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 pred_df_clicks = test_df.sort_values(["session" , "ts" ]).groupby(["session" ]).apply( lambda x: suggest_clicks(x) ) pred_df_buys = test_df.sort_values(["session" , "ts" ]).groupby(["session" ]).apply( lambda x: suggest_buys(x) ) clicks_pred_df = pd.DataFrame(pred_df_clicks.add_suffix("_clicks" ), columns=["labels" ]).reset_index() orders_pred_df = pd.DataFrame(pred_df_buys.add_suffix("_orders" ), columns=["labels" ]).reset_index() carts_pred_df = pd.DataFrame(pred_df_buys.add_suffix("_carts" ), columns=["labels" ]).reset_index() pred_df = pd.concat([clicks_pred_df, orders_pred_df, carts_pred_df]) pred_df.columns = ["session_type" , "labels" ] pred_df["labels" ] = pred_df.labels.apply(lambda x: " " .join(map (str ,x))) pred_df.to_csv("submission.csv" , index=False ) pred_df.head()
Co-visitation Matrix
Co-visitation Matrix
Step 1 - Generate Candidates
总会有一些商品是经常点击的并且一起买,利用这个思想构建一个协同矩阵
首先,我们查看同一会话中在时间上彼此接近(<1天)的所有事件对。我们计算共同访问矩阵M a i d 1 , a i d 2 M_{aid1,aid2} M a i d 1 , a i d 2 通过对所有会话中的每对事件对的全局数量进行计数。
对于每个商品id,我们发现前20个最频繁的a i d 2 = a r g s o r t ( M [ a i d ] ) [ − 20 : ] aid2=argsort(M[aid])[-20:] a i d 2 = a r g s o r t ( M [ a i d ] ) [ − 2 0 : ]
Step 2 - ReRank and Choose 20
从上面的候选列表中选择出现频率最高的20个作为最终预测结果
候选商品生成
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 import sys import gc def gen_pairs (df ): df = df.query('session % @SAMPLING == 0' ).groupby('session' , as_index=False , sort=False ).apply(lambda g: g.tail(30 )).reset_index(drop=True ) df = pd.merge(df, df, on='session' ) pairs = df.query('abs(ts_x - ts_y) < 24 * 60 * 60 * 1000 and aid_x != aid_y' )[['session' , 'aid_x' , 'aid_y' ]].drop_duplicates() return pairs[['aid_x' , 'aid_y' ]].values def gen_aid_pairs (): all_pairs = defaultdict(lambda : Counter()) all_pair_chunks = [] with tqdm(glob.glob('../input/otto-chunk-data-inparquet-format/*_parquet/*' ), desc='Chunks' ) as prog: for idx, chunk_file in enumerate (prog): with multiprocessing.Pool() as p: chunk = pd.read_parquet(chunk_file).drop(columns=['type' ]) pair_chunks = p.map (gen_pairs, np.array_split(chunk, 120 )) pair_chunks = np.concatenate(pair_chunks, axis=0 ) all_pair_chunks.append(pair_chunks) if DEBUG and idx >= 3 : break del chunk, pair_chunks gc.collect() df = pd.DataFrame(data=np.concatenate(all_pair_chunks), columns=['aid1' , 'aid2' ]) top_aids = df.groupby('aid1' ).apply(lambda df: Counter(df.aid2).most_common(40 )).to_dict() return top_aids
重排名与最终选择
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 import itertoolsdef suggest_aids (df ): aids = df.tail(20 ).aid.tolist() if len (aids) >= 20 : return aids aids = set (aids) new_aids = Counter() for aid in aids: new_aids.update(top_40_cnt.get(aid, Counter())) top_aids2 = [aid2 for aid2, cnt in new_aids.most_common(20 ) if aid2 not in aids] return list (aids) + top_aids2[:20 - len (aids)] pred_df = test_df.sort_values(["session" , "type" , "ts" ]).groupby(["session" ]).apply( lambda x: suggest_aids(x) )
How To Build a GBT Ranker Model
🏆 Training an XGBoost Ranker on the GPU 🔥🔥🔥
💡 [polars] Proof of concept: LGBM Ranker🧪🧪🧪
How To Build a GBT Ranker Model
Step 1 - Generate Candidates
使用上面的方法生成候选商品列表
每行一个session一个aid,数据内容如下
session (i.e. user)
aid (i.e. item)
user features
item features
user-item interaction features
click target (i.e 0 or 1)
cart target (i.e. 0 or 1)
order target (i.e. 0 or 1)
Step 2 - ReRank and Choose 20
使用GBT模型最终的20个商品进行预测
Step 1
构建模型的数据集:训练数据是公开数据的前三周,验证数据是第四周.
验证数据又被分为valid A 和valid B, B是ground truth
对每个session先给出50个候选商品id,然后得到了一个(number_of_session * 50, 2)大小的dataframe,类似:
session
aid
1
1234
1
9841
2
5845
2
8984
Setp 2
创建商品特征(item feature),使用训练数据和验证数据A
1 2 3 item_features = train.groupby('aid' ).agg({'aid' :'count' ,'session' :'nunique' ,'type' :'mean' }) item_features.columns = ['item_item_count' ,'item_user_count' ,'item_buy_ratio' ]
Setp 3
创建用户特征(user feature),使用验证数据A
1 2 3 user_features = train.groupby('session' ).agg({'session' :'count' ,'aid' :'nunique' ,'type' :'mean' }) user_features.columns = ['user_user_count' ,'user_item_count' ,'user_buy_ratio' ]
Setp 4
创建用户-商品交互特征(user-item interaction feature),使用验证数据A
思路很多没有具体给出.例如:
创建用户-商品点击交互特征
Setp 5
将特征添加到candidate dataframe中
1 2 candidates = candidates.merge(item_features, left_on='aid' , right_index=True , how='left' ).fillna(-1 ) candidates = candidates.merge(user_features, left_on='session' , right_index=True , how='left' ).fillna(-1 )
然后candidate dataframe类似:
session
aid
item_feat1
item_feat2
user_feat1
user_feat2
1
1234
1
2
3
4
1
9841
5
6
7
8
2
5845
9
10
11
12
2
8984
13
14
15
16
Setp 6
构建ground truth,例如 test_labels.parquet
😐 session | type | ground_truth |
| ------- | ----- | ---------------------- |
| 1 | carts | [5456,4545,98741,2355] |
| 2 | carts | [1257,8653,2547] |
然后将其转换为如下:
session
aid
cart
1
5456
1
1
4545
1
1
98741
1
然后将其合并到candidate dataframe中
1 candidates = candidates.merge(cart_target,on=['user' ,'item' ],how='left' ).fillna(0 )
candidates dataframe类似:
session
aid
item_feat1
item_feat2
user_feat1
user_feat2
cart
1
1234
1
2
3
4
0
1
9841
5
6
7
8
1
2
5845
9
10
11
12
0
2
8984
13
14
15
16
1
Setp 7
训练,不适用user和aid列:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 import xgboost as xgb from sklearn.model_selection import GroupKFold skf = GroupKFold(n_splits=5 ) for fold, (train_idx, valid_idx) in enumerate (skf.split(candidates, candidates['click' ], groups=candidates['user' ])): X_train = candidates.loc[train_idx, FEATURES] y_train = candidates.loc[train_idx, 'click' ] X_valid = candidates.loc[valid_idx, FEATURES] y_valid = candidates.loc[valid_idx, 'click' ] dtrain = xgb.DMatrix(X_train, y_train, group=[50 ] * (len (train_idx)//50 ) ) dvalid = xgb.DMatrix(X_valid, y_valid, group=[50 ] * (len (valid_idx)//50 ) ) xgb_parms = {'objective' :'rank:pairwise' , 'tree_method' :'gpu_hist' } model = xgb.train(xgb_parms, dtrain=dtrain, evals=[(dtrain,'train' ),(dvalid,'valid' )], num_boost_round=1000 , verbose_eval=100 ) model.save_model(f'XGB_fold{fold} _click.xgb' )
Setp 8
推理:为了进行推理,我们创建了一个新的候选数据帧,但这次是根据Kaggle的测试数据。然后,我们从Kaggle训练的所有4周加上Kaggle测试的1周中制作项目特征(item feature)。我们通过Kaggle测试制作用户特征(user feature)。我们将这些特征合并到我们的候选者中。然后,我们使用保存的模型来推断点击量的预测。最后,我们通过对预测进行排序并选择20个。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 preds = np.zeros(len (test_candidates)) for fold in range (5 ): model = xgb.Booster() model.load_model(f'XGB_fold{fold} _click.xgb' ) model.set_param({'predictor' : 'gpu_predictor' }) dtest = xgb.DMatrix(data=test_candidates[FEATURES]) preds += model.predict(dtest)/5 predictions = test_candidates[['user' ,'item' ]].copy() predictions['pred' ] = preds predictions = predictions.sort_values(['user' ,'pred' ], ascending=[True ,False ]).reset_index(drop=True ) predictions['n' ] = predictions.groupby('user' ).item.cumcount().astype('int8' ) predictions = predictions.loc[predictions.n<20 ] sub = predictions.groupby('user' ).item.apply(list ) sub = sub.to_frame().reset_index() sub.item = sub.item.apply(lambda x: " " .join(map (str ,x))) sub.columns = ['session_type' ,'labels' ] sub.session_type = sub.session_type.astype('str' )+ '_clicks'
💡 Word2Vec How-to [training and submission]🚀🚀🚀
💡 Word2Vec How-to [training and submission]🚀🚀🚀
A session where one action follows another action is very much like a sentence!
类似地,在这里我们可以利用这样一个事实,即在一个紧密的序列中出现的商品id可能有一些相似之处。
所以我们使用word2vec模型来训练商品id的嵌入向量,然后使用这些向量来计算商品id之间的相似度。
这样给定一个商品的id就可以找到和她类似的商品id。
使用word2vec直接获取20个商品
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 sentences_df = pl.concat([train, test]).groupby('session' ).agg( pl.col('aid' ).alias('sentence' ) ) sentences = sentences_df['sentence' ].to_list() w2vec = Word2Vec(sentences=sentences, vector_size=32 , min_count=1 , workers=4 ) aid2idx = {aid: i for i, aid in enumerate (w2vec. index_to_key)} index = AnnoyIndex(32 , 'euclidean' ) for aid, idx in aid2idx.items(): index.add_item(idx, w2vec.wv.vectors[idx]) index.build(10 ) sample_sub = pd.read_csv('../input/otto-recommender-system//sample_submission.csv' ) ''' 选择最近的20个商品 ''' test_session_AIDs = test.to_pandas().reset_index(drop=True ).groupby('session' )['aid' ].apply(list ) test_session_types = test.to_pandas().reset_index(drop=True ).groupby('session' )['type' ].apply(list ) labels = [] type_weight_multipliers = {0 : 1 , 1 : 6 , 2 : 3 } for AIDs, types in zip (test_session_AIDs, test_session_types): if len (AIDs) >= 20 : weights=np.logspace(0.1 ,1 ,len (AIDs),base=2 , endpoint=True )-1 aids_temp=defaultdict(lambda : 0 ) for aid,w,t in zip (AIDs,weights,types): aids_temp[aid]+= w * type_weight_multipliers[t] sorted_aids=[k for k, v in sorted (aids_temp.items(), key=lambda item: -item[1 ])] labels.append(sorted_aids[:20 ]) else : AIDs = list (dict .fromkeys(AIDs[::-1 ])) most_recent_aid = AIDs[0 ] nns = [w2vec.wv.index_to_key[i] for i in index.get_nns_by_item(aid2idx[most_recent_aid], 21 )[1 :]] labels.append((AIDs+nns)[:20 ])
使用word2vec 获取候选商品
和covisiation matrix类似,使用word2vec可以获取最相关的商品
数据为:
session
aid
type
1
10
0
1
20
0
2
20
1
2
30
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 假如通过上面的代码我们已经类似的获取到了最相关的商品 | session | aid | | --- | --- | | 1 | 11 | | 1 | 20 | | 2 | 25 | | 2 | 6 | 那么下面还有几个步骤 1. Step 1 : Add ordering information to our candidates.word2vec模型是按照相似度评分来排序的,所以我们需要添加一些排序信息 | session | aid | rank | | --- | --- | --- | | 1 | 11 | 1 | | 1 | 20 | 2 | | 2 | 25 | 1 | | 2 | 6 | 2 | 2. 将这些信息合并到candidates中| session | aid | rank | type | | --- | --- | --- | --- | | 1 | 11 | 1 | null | | 1 | 20 | 2 | 0 | | 1 | 10 | null | 0 | | 2 | 25 | 1 | null | | 2 | 6 | 2 | null | | 2 | 20 | null | 1 | | 2 | 30 | null | 0 | 3. 使用Ranker模型进行预测,[💡 [2 methods] How-to ensemble predictions 🏅🏅🏅](https://www.kaggle.com/code/radek1/2 -methods-how-to-ensemble-predictions) 对预测结果集成: - 投票集成(voting ensemble) - 加权投票集成(voting ensemble with weights),对好结果有更大的权重 ```python def read_sub (path, weight=1 ): '''a helper function for loading and preprocessing submissions''' return ( pl.read_csv(path) .with_column(pl.col('labels' ).str .split(by=' ' )) .with_column(pl.lit(weight).alias('vote' )) .explode('labels' ) .rename({'labels' : 'aid' }) .with_column(pl.col('aid' ).cast(pl.UInt32)) .with_column(pl.col('vote' ).cast(pl.UInt8)) ) subs = [read_sub(path) for path in paths] subs = [read_sub(path, weight) for path, weight in zip (paths, [1 , 0.55 , 0.55 ])]
读取后的数据为:
session_type
aid
vote
1_clicks
1234
1
1_clicks
9841
1
2_clicks
5845
1
由于内存限制,只能进行join:
1 2 subs = subs[0 ].join(subs[1 ], how='outer' , on=['session_type' , 'aid' ]).join(subs[2 ], how='outer' , on=['session_type' , 'aid' ], suffix='_right2' ) subs.head()
合并后的数据
session_type
aid
vote
vote_right
vote_right2
1_clicks
1234
1
1
1
1_clicks
9841
1
null
1
2_clicks
5845
1
1
null
2_clicks
8984
1
null
null
用0填充null值,然后对vote求和,排序:
1 2 3 4 5 6 7 subs = (subs .fill_null(0 ) .with_column((pl.col('vote' ) + pl.col('vote_right' ) + pl.col('vote_right2' )).alias('vote_sum' )) .drop(['vote' , 'vote_right' , 'vote_right2' ]) .sort(by='vote_sum' ) .reverse() )
数据如下:
session_type
aid
vote_sum
1_clicks
1234
3
2_clicks
5845
2
1_clicks
9841
2
然后对每个类型选择前20个商品,然后聚合成数组:
1 2 3 4 5 preds = subs.groupby('session_type' ).agg([ pl.col('aid' ).head(20 ).alias('labels' ) ]) preds = preds.with_column(pl.col('labels' ).apply(lambda lst: ' ' .join([str (aid) for aid in lst])))
💡Matrix Factorization [PyTorch+Merlin Dataloader]
与word2vec类似,使用矩阵分解来获取商品的嵌入向量
💡Matrix Factorization with GPU [PyTorch+Merlin Dataloader]
使用Pytorch的Embedding层来训练商品的嵌入向量,然后使用这些向量来计算商品id之间的相似度.
1 2 3 4 5 6 train_pairs = cudf.concat([train, test])[['session' , 'aid' ]] del train, testtrain_pairs['aid_next' ] = train_pairs.groupby('session' ).aid.shift(-1 ) train_pairs = train_pairs[['aid' , 'aid_next' ]].dropna().reset_index(drop=True )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 import torchfrom torch import nnclass MatrixFactorization (nn.Module): def __init__ (self, n_aids, n_factors ): super ().__init__() self.aid_factors = nn.Embedding(n_aids, n_factors, sparse=True ) def forward (self, aid1, aid2 ): aid1 = self.aid_factors(aid1) aid2 = self.aid_factors(aid2) return (aid1 * aid2).sum (dim=1 ) class AverageMeter (object ): """Computes and stores the average and current value""" def __init__ (self, name, fmt=':f' ): self.name = name self.fmt = fmt self.reset() def reset (self ): self.val = 0 self.avg = 0 self.sum = 0 self.count = 0 def update (self, val, n=1 ): self.val = val self.sum += val * n self.count += n self.avg = self.sum / self.count def __str__ (self ): fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})' return fmtstr.format (**self.__dict__)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 model.to('cuda' ) for epoch in range (num_epochs): for batch, _ in train_dl_merlin: model.train() losses = AverageMeter('Loss' , ':.4e' ) aid1, aid2 = batch['aid' ], batch['aid_next' ] aid1 = aid1.to('cuda' ) aid2 = aid2.to('cuda' ) output_pos = model(aid1, aid2) output_neg = model(aid1, aid2[torch.randperm(aid2.shape[0 ])]) output = torch.cat([output_pos, output_neg]) targets = torch.cat([torch.ones_like(output_pos), torch.zeros_like(output_pos)]) loss = criterion(output, targets) losses.update(loss.item()) optimizer.zero_grad() loss.backward() optimizer.step() model.eval () with torch.no_grad(): accuracy = AverageMeter('accuracy' ) for batch, _ in valid_dl_merlin: aid1, aid2 = batch['aid' ], batch['aid_next' ] output_pos = model(aid1, aid2) output_neg = model(aid1, aid2[torch.randperm(aid2.shape[0 ])]) accuracy_batch = torch.cat([output_pos.sigmoid() > 0.5 , output_neg.sigmoid() < 0.5 ]).float ().mean() accuracy.update(accuracy_batch, aid1.shape[0 ]) print (f'{epoch+1 :02d} : * TrainLoss {losses.avg:.3 f} * Accuracy {accuracy.avg:.3 f} ' )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 embeddings = model.aid_factors.weight.detach().cpu().numpy() knn = NearestNeighbors(n_neighbors=21 , metric='euclidean' ) knn.fit(embeddings) _, aid_nns = knn.kneighbors(embeddings) sample_sub = pd.read_csv('../input/otto-recommender-system//sample_submission.csv' ) test = cudf.read_parquet('../input/otto-full-optimized-memory-footprint/test.parquet' ) session_types = ['clicks' , 'carts' , 'orders' ] gr = test.reset_index(drop=True ).to_pandas().groupby('session' ) test_session_AIDs = gr['aid' ].apply(list ) test_session_types = gr['type' ].apply(list ) labels = [] type_weight_multipliers = {0 : 1 , 1 : 6 , 2 : 3 } for AIDs, types in zip (test_session_AIDs, test_session_types): if len (AIDs) >= 20 : weights=np.logspace(0.1 ,1 ,len (AIDs),base=2 , endpoint=True )-1 aids_temp=defaultdict(lambda : 0 ) for aid,w,t in zip (AIDs,weights,types): aids_temp[aid]+= w * type_weight_multipliers[t] sorted_aids=[k for k, v in sorted (aids_temp.items(), key=lambda item: -item[1 ])] labels.append(sorted_aids[:20 ]) else : AIDs = list (dict .fromkeys(AIDs[::-1 ])) most_recent_aid = AIDs[0 ] nns = list (aid_nns[most_recent_aid]) labels.append((AIDs+nns)[:20 ]) labels_as_strings = [' ' .join([str (l) for l in lls]) for lls in labels] predictions = pd.DataFrame(data={'session_type' : test_session_AIDs.index, 'labels' : labels_as_strings}) prediction_dfs = [] for st in session_types: modified_predictions = predictions.copy() modified_predictions.session_type = modified_predictions.session_type.astype('str' ) + f'_{st} ' prediction_dfs.append(modified_predictions) submission = pd.concat(prediction_dfs).reset_index(drop=True ) submission.to_csv('submission.csv' , index=False )
226th (?!) Place Solution & Two-cents from a First-timer
226th (?!) Place Solution & Two-cents from a First-timer
Item Co-visitation Matrix
entirely based one Chris’ notebook
Order matrix: Click/cart/order to click/cart/order with type weighting
Buy2buy matrix: Cart/order to cart/order
Click matrix: click/cart/order to clicks with time weighting
Feature Feneration
Item features (for each aid)
Count of events (click/cart/order)
Quarter of day (QoD) with most events (0-3)
Day of week (DoW) with most events (0-6)
User features (for each session)
Count of events (click/cart/order) and interacted items (aid)
QoD with most events (0-3)
DoW with most events (0-6)
Number of days with events
Days from first to last events
User-item features (for each session-aid pair)
Count of events (click/cart/order) and interacted items (aid)
QoD with most events in both categorical (0-3) and one-hot encoded (0/1 for each) format
DoW with most events in both categorical (0-6) and one-hot encoded (0/1 for each) format
last_n = item_chronological_rank / user_total_event_count
last_ts = (user_item_last_timestamp - start_week_timestamp) / (end_week_timestamp - start_week_timestamp)
Candidate Selection
partly based on Chris’ notebook
For each session,I select top X most relevant items in each event type
会话单击/点选/订购项目的次数
共访权总和
该商品是否为本周点击次数最多/购买次数最多的商品
Ranker
Rule-base ranker in Chris’ notebook
XGBRanker with rank:pairwise
objective
the model is overfitting and requires a lot of hyperparameter tuning
Solution Ensemble
use their public LB scores as weights
Ensemble of XGBRanker above and public submissions
Ensemble of above two ranker methods and public submissions
6, what have I learned
read the discussion and notebooks forums
try as many ideas as possible
know every line of code you write