KDD Cup2012简单回顾

来源:互联网 发布:搜空地雷升级数据 编辑:程序博客网 时间:2024/06/03 13:41

经过2个月辛苦的拼搏,我们最终获得了KDD Cup 2012比赛的亚军(track1)!
能赢得数据挖掘界这个著名赛事的好名次,我认为靠的是 实力+运气+坚持到底的意志
比赛使用的算法,整理成论文后,会发表在今年SIGKDD workshop上

最终的Leaderboard截图如下,以资留念,我们的队名是Shanda Innovations,最终分数是0.41874



比赛中所使用的算法已经整理成论文,标题和摘要如下

Context-aware Ensemble of Multifaceted Factorization Models for Recommendation Prediction in Social Networks


Abstract

This paper describes the solution of Shanda Innovations team to Task 1 of KDD-Cup 2012. A novel approach called Multifaceted Factorization Models is proposed to incorporate a great variety of features in social networks. Social relationships and actions between users are integrated as implicit feedbacks to improve the recommendation accuracy. Keywords, tags, profiles, time and some other features are also utilized for modeling user interests. In addition, user behaviors are modeled from the durations of recommendation records. A context-aware ensemble framework is then applied to combine multiple predictors and produce final recommendation results. The proposed approach obtained $0.43959$ (public score)/$0.41874$(private score) on the testing dataset, which achieved the 2nd place in the KDD-Cup competition.

 

Introduction

Social Networking Services (SNS) have gain tremendous popularity in recent years, and voluminous information is generated from social networks every day. It is desirable to build an intelligent recommender system to identify what interests users efficiently. The task of KDD-Cup 2012 Track 1 is to develop such a system which aims to capture users' interests,  find out the items that fit to users' taste and  most likely to be followed. The datasets are provided by Tencent Weibo, one of the largest social networking website in China, is made up of 2,320,895 users, 6,095 items, 73,209,277 training records, and 34,910,937 testing records, which is relatively larger than other publicly released datasets. Besides, it provides richer information in multiple domains, including user profiles, item categories, keywords, and social graph. Timestamps for recommendations are also given for performing session analysis. For each user in the testing dataset, an ordered list of the recommender results is demanded. Mean Average Precision (MAP)  is used to evaluate the results provided by 658 teams around the world.

Compared to traditional recommender problems, e.g.,the Netflix Prize, where the scores users rate movies are predicted, the settings of KDD-Cup 2012 appears more complex. Firstly, there are much richer features between users on social networking website. In the social graph, users can follow each other. Besides, three kinds of actions, including ``comment'' (add comments to someone's tweet), ``retweet'' (repost a tweet and append some comments) and ``at'' (notify another user), can be taken between users. User profiles contain rich information, such as gender, age, category, keywords and tags. So models that are capable to integrate various features are required. Secondly, items to be recommended are specific users, which can be a person, a group, or an organization. Compared to the items of traditional recommender systems, e.g. books on Amazon or movies on Netflix, items on social network sites not only have profiles, but also have their behaviors and social relations. As a result, item modeling turns out more complicated. Thirdly, the training data in the social networks is quite noisy, and the cold-start problem also poses severe challenge due to the very limited information for a large number of users in testing dataset. It is demanding to have an effective preprocessing to cope with this challenge.

 

 

In this paper we present a novel approach called Context-aware Ensemble of Multifaceted Factorization Models. Various features are extracted from the training data and integrated into the proposed models. A two stage training framework and a context-aware ensemble method are introduced, which helped us to gain a higher accuracy. We also give a brief introduction to the session analysis method and the supplement strategy that we used in the competition to improve the quality of training data.

The rest of the paper is organized as follows. Section 2 introduces preliminary of our methods. Section 3 presents the preprocessing method we used. In Section 4, we will propose Multifaceted Factorization Models, which is adopted in the final solution. A context-aware ensemble and user behavior modeling methods are proposed in Section 5. Experimental results are given in Section 6 and conclusions and future work are given in Section 7.

 

论文的全文可以见如下链接:

https://kaggle2.blob.core.windows.net/competitions/kddcup2012/2748/media/Shanda3.pdf



整个比赛过程长达2个多月,酸甜苦辣五味杂陈,

只有亲身经历过的同学才能体会

KDD Cup期间,我换过很多签名档,都是当时心情的真实写照,摘录一些,记录在下面,以资留念


征途

好走的路越走越难,难走的路越走越容易

山炮

leaderboard's bug

龙门鱼府

隐式反馈很有效

芝麻开花节节高

pairwise

K歌达人

逆水行舟,不进则退

日拱一卒

无限收敛到0.39

越merge越绝望

回到数据源头

Recsys2012

凯顿小妹

当幸福来敲门

in the money!

刀法

叹息的墙壁

昨天、今天、和明天 

no.1!

steffen的三板斧

坚持就是胜利

最后一夜的守候

重启14

最后的救世主

提交,在最后5分钟 

尘归尘土归土


原创粉丝点击