Webcollector + Spring + MVC 搭建应用初探(六)(Lenskit 推荐系统实例)
来源:互联网 发布:微信朋友圈网络陷阱 编辑:程序博客网 时间:2024/06/17 15:32
前述推荐系统实例 Webcollector + Spring + MVC 搭建应用初探(五)(Crab推荐系统实例)实际上是很粗糙的 (仅根据指定类别选取10个user样本进行推荐)这里根据特定类别限定样本数量的原因是数据的稀疏性(多重共线性) 过多的稀疏样本很可能造成推荐系统得到空的结果 而仅仅限定类别而采取大样本(即增大user数量为10的限制)会导致求解过程缓慢(这里速度的制约原因是多方面的,一方面可能是无用标签对结果的影响如若干标签可能仅在该user收藏的视频中仅出现一次;另一方面可能是Crab Python框架的不成熟,这可以从作者安装时要处理一些代码问题(参见Crab 初探(一))及其与其他一些推荐系统的比较评论中看到)。为解决上述问题,已期其能够应用于大样本场合,下面的工作组要主要围绕,构建基础数据时去掉频数为1的标签及换框架改用一个比较成熟的框架(这里使用Lenskit Java推荐系统框架进行)所以下面的处理过程可以简单地看成将数据处理过程变化一下(trivial),之后直接套用Lenskit的接口。调用的Lenskit接口代码是https://github.com/lenskit/lenskit-hello所示的示例代码。
使用的groovy模型构建脚本都是相同的。故仅仅需要生成工程中data文件夹下的csv文件即可。(原工程的这些文件都是知名的数据集的相关简单数据,其相应信息参见该工程描述)下面给出生成使用的文件的Python代码
#coding: utf-8import csvimport redisfrom time import timefrom scikits.crab.datasets.base import Bunchredis = redis.StrictRedis(**{"host": "127.0.0.1", "port": 6379})videosFile = file("videos.csv", "wb")ratingsFile = file("ratings.csv", "wb")tagsFile = file("tags.csv", "wb")videosWriter = csv.writer(videosFile)videosWriter.writerow(["videoId", "aid", "genres"])#videosData = [(0, "", "")]videosData = []ratingsWriter = csv.writer(ratingsFile)ratingsWriter.writerow(["userId", "videoId", "rating", "timestamp"])#ratingsData = [(0, 0, 0.0, "")]ratingsData = []tagsWriter = csv.writer(tagsFile)tagsWriter.writerow(["userId", "videoId", "tag", "timestamp"])#tagsData = [(0, 0, "", "")]tagsData = []def generateDict(key): dictRequire = dict() for aidHash in redis.lrange(key, 0, -1): tagList = aidHash.split(",")[:-1] for tag in tagList: if dictRequire.get(tag): dictRequire[tag] += 1 else: dictRequire[tag] = 1 max_value = max(dictRequire.values()) countDict = dict() for tag, count in dictRequire.items(): if count != 1: countDict[tag] = int(float(count) / max_value * 10) return countDictdef appendDataTest(key, item_index, user_index, bunch, countDict): user_ori_id = key.split(":")[-1] user_dict = bunch.user_ids ori_user_id = user_index if user_ori_id not in user_dict.values(): user_dict[user_index] = user_ori_id user_index += 1 item_dict = bunch.item_ids for tag, count in countDict.items(): if tag not in item_dict.values(): ori_item_id = item_index item_dict[item_index] = tag item_index += 1 videosData.append((ori_item_id, tag, tag)) else: for kk, vv in item_dict.items(): if tag == vv: ori_item_id = kk break ratingsData.append((ori_user_id, ori_item_id, count,str(int(time())))) tagsData.append((ori_user_id, ori_item_id, tag, str(int(time())))) return (item_index, user_index, bunch)bunch = Bunch(DESCR = "", data = dict(), item_ids = dict(), user_ids = dict())item_index = 1user_index = 1item_max_count = 2000fanTagsAidListFormat = "Aid:BiliBili:%s"for key in redis.keys(fanTagsAidListFormat % "*")[:item_max_count]: countDict = generateDict(key) item_index, user_index, bunch = appendDataTest(key, item_index, user_index, bunch, countDict)videosWriter.writerows(videosData)videosFile.close()ratingsWriter.writerows(ratingsData)ratingsFile.close()tagsWriter.writerows(tagsData)tagsFile.close()
Python的好处在于简单的数据结构,为数据处理创造方便。这里需要提到的一点是generateDict函数在提供对于每一个标签的偏好时进行了归一化(将每一个标签频数除以该user所关注的所有标签的频数的最大值,并分成[1, 10])这样做的原因是防止某些user的标签极端值(较大的数)导致推荐系统对于不同的推荐对象得到相似的推荐结果(极端标签),这步的处理是遇到了此问题加上的。同时也进行了标签为1的过滤。appendDataTest与上一篇文章所使用的 appendData作用相同。有了这些基本数据,就可以运行推荐系统代码了,这里使用的代码就是上述github中的hello示例代码,为了完整叙述,给出
package BiliBiliRecommender;/** * Created by ehang on 2017/2/5. */import com.google.common.base.Throwables;import org.lenskit.LenskitConfiguration;import org.lenskit.LenskitRecommender;import org.lenskit.LenskitRecommenderEngine;import org.lenskit.api.ItemRecommender;import org.lenskit.api.Result;import org.lenskit.api.ResultList;import org.lenskit.config.ConfigHelpers;import org.lenskit.data.dao.DataAccessObject;import org.lenskit.data.dao.file.StaticDataSource;import org.lenskit.data.entities.CommonAttributes;import org.lenskit.data.entities.CommonTypes;import org.lenskit.data.entities.Entity;import org.slf4j.Logger;import org.slf4j.LoggerFactory;import java.io.File;import java.io.IOException;import java.nio.file.Path;import java.nio.file.Paths;import java.util.ArrayList;import java.util.List;public class BiliBiliLenskit implements Runnable { private static final Logger logger = LoggerFactory.getLogger(BiliBiliLenskit.class); public static void main(String []args){ BiliBiliLenskit hello = new BiliBiliLenskit(new String[] {"8"}); hello.run(); } private Path dataFile = Paths.get("data/videolens.yml"); private List<Long> users; public BiliBiliLenskit(String[] args){ users = new ArrayList<>(args.length); for (String arg: args){ users.add(Long.parseLong(arg)); } } @Override public void run(){ DataAccessObject dao; try { StaticDataSource data = StaticDataSource.load(dataFile); dao = data.get(); }catch (IOException e){ logger.error("cannot load data", e); throw Throwables.propagate(e); } LenskitConfiguration config = null; try{ config = ConfigHelpers.load(new File("etc/item-item.groovy")); }catch(IOException e){ throw new RuntimeException("could not load configuration", e); } LenskitRecommenderEngine engine = LenskitRecommenderEngine.build(config, dao); logger.info("built recommender engine"); try(LenskitRecommender rec = engine.createRecommender(dao)){ logger.info("obtained recommender from engine"); ItemRecommender irec = rec.getItemRecommender(); assert irec != null; for(long user: users){ ResultList recs = irec.recommendWithDetails(user, 10, null, null); System.out.format("Recommendations for user %d:\n", user); for(Result item: recs){ Entity itemData = dao.lookupEntity(CommonTypes.ITEM, item.getId()); String name = null; if (itemData != null){ name = itemData.maybeGet(CommonAttributes.NAME); } System.out.format("\t%d (%s): %.2f\n", item.getId(), name, item.getScore()); } } } }}
这里对于user_id为 8 的用户进行了推荐,与前述Crab的情况不同的是,这里使用了2000个样本作为模型构建。(相对于10算是大样本了),而且经过了这些整体处理后速度不逊于Crab小样本的情况作者做实验时Crab当使用50个样本时就已经速度感人了。Python换Java的原因不仅有速度 还有小样本可能造成的偏移,即推荐结果不靠谱。user 8已有的标签包括
8,87,吃货,14862725048,153,原创,14862725048,154,彩妆,14862725048,103,YOUTUBE,14862725048,155,教程,14862725048,156,萌妹子,14862725048,157,化妆教程,14862725048,158,吃播,14862725048,58,美食视频,14862725048,159,美妆,14862725048,160,大胃王,14862725048,161,化妆,14862725048,162,大胃王密子君,14862725048,163,日常,1486272504
应该是一个能吃的妹子Lenskit对其推荐的标签及preference如下
10408 (吱星人): 15.877425 (张副官): 14.941727 (大胃女王吃遍日本): 13.543388 (长征): 13.121133 (CHRISTINA AGUILERA): 13.07681 (虎牙): 13.056605 (单兵口粮): 12.824345 (靖凰): 12.3610361 (包花夫夫): 12.3410343 (ESPAÑOL): 12.24大致是符合实际的。在大样本应用场景下“去噪声”就显得比较重要,这里去噪声指的是将数目比较小的标签在生成数据的时候过滤掉,这主要有两层考虑,第一是减少无效结果对整体推断的影响,另一个是数据处理的速度。(有时候后者较前者更为重要,在自己做实验时发现对于25000个user数据当仅仅使用上述标签过滤方法时,使用Python生成数据文件的过程就要花费6小时以上,而Lenskit推荐算法的完成仅仅不到1分钟)上面例子仅仅是对2000个数据将频数为1的标签去掉,下面在25000个user数据的大样本场合以最少标签频数3进行标签过滤(相应数据生成的时间缩短到2h)并对 Webcollector + Spring + MVC 搭建应用初探(五)(Crab推荐系统实例)给出的作者收藏的标签数据运行推荐系统,由于使用了频数过滤,已收藏的标签减少如下2683,40,TV动画,14863521332683,4541,堺雅人,14863521332683,802,BBC,14863521332683,122,治愈,14863521332683,52,剑网三,14863521332683,61,日剧,14863521332683,428,喜剧,1486352133得到的推荐系统结果为15821 (小林贤太郎): 12.6526030 (松田优作): 11.8625146 (睡前): 11.1125147 (放松心情): 11.1125148 (雨声): 11.1126132 (LYINGMAN): 10.6615817 (Richard Ayoade): 10.657943 (郑敬淏): 10.6418097 (嘶吼): 10.3010162 (翘课迟到): 10.141425 (伍迪艾伦): 10.1118741 (羞辱2): 10.071188 (ONE OK ROCK): 9.9324049 (琉辉LIUKI): 9.857979 (田英章): 9.815410 (鬼泣): 9.6824497 (PELLEK): 9.64953 (老梁): 9.6119402 (丁日): 9.5929704 (森次晃嗣): 9.5429705 (七爷): 9.546557 (马特·波莫): 9.546562 (孔雀): 9.547226 (铃木このみ): 9.447227 (舞武器舞乱伎): 9.4431226 (台球): 9.441550 (赵粤): 9.4311144 (尹正): 9.40这个推荐结果与 Webcollector + Spring + MVC 搭建应用初探(五)(Crab 推荐系统实例)中的可以进行对比 在那里用于进行推荐的users数据为包含“历史”标签的10个users这里是未进行类别过滤的大样本场合。在前者的结果中有很多概括性类别标签(“日剧 手绘”等),而这里并未给出抽象的类别标签,而是代之以具体的如影视剧相关(“伍迪艾伦” “松田优作” 等)并将前者的推荐标签 “铃木このみ”, “舞武器舞乱伎” 包含在尾部。可见后者教精确 全面。从数据处理的速度上来,进一步优化的方式是将上述python脚本的数据结构(若干List)搬到redis数据库中,并利用redis集合来判重,这样实现上述3个标签数量过滤所花费的时间可以从2h减到2min左右,python速度感人啊这里优化完成的原因至少有两方面,一方面是redis数据结构对python数据结构构成的优势另一方面是任务的分解,以双核CPU为例,默认的单进程python脚本运行于一个CPU核内,故在运行python脚本时相应进程占用一般为CPU的50%左右(单核满负荷),这时如果能将一部分IO分解到redis-server中会增加20%左右的CPU占用,70%的CPU性能强于50%的性能。下面给出修改后的数据处理代码#coding: utf-8import csvimport redisfrom time import timestart_time = time()redis = redis.StrictRedis(**{"host": "127.0.0.1", "port": 6379})videosDataKey = "videosData"ratingsDataKey = "ratingsData"tagsDataKey = "tagsData"userSetKey = "userSet"itemSetKey = "itemSet"itemHashKey = "itemHash"def generateDict(key): dictRequire = dict() for aidHash in redis.lrange(key, 0, -1): tagList = aidHash.split(",")[:-1] for tag in tagList: if dictRequire.get(tag): dictRequire[tag] += 1 else: dictRequire[tag] = 1 max_value = max(dictRequire.values()) countDict = dict() for tag, count in dictRequire.items(): if count >= filter_count: countDict[tag] = int(float(count) / max_value * 10) return countDictdef appendDataTest(key, item_index, user_index, countDict): user_ori_id = key.split(":")[-1] ori_user_id = user_index if redis.sadd(userSetKey, user_ori_id): user_index += 1 for tag, count in countDict.items(): if redis.sadd(itemSetKey, tag): ori_item_id = item_index item_index += 1 redis.hset(itemHashKey, tag, item_index) redis.lpush(videosDataKey, (ori_item_id, tag.decode("utf-8"), tag.decode("utf-8"))) else: ori_item_id = redis.hget(itemHashKey, tag) redis.lpush(ratingsDataKey, (ori_user_id, ori_item_id, count,str(int(time())))) redis.lpush(tagsDataKey, (ori_user_id, ori_item_id, tag.decode("utf-8"), str(int(time())))) return (item_index, user_index)item_index = 1user_index = 1item_max_count = 30000filter_count = 3now_count = 1fanTagsAidListFormat = "Aid:BiliBili:%s"for key in redis.keys(fanTagsAidListFormat % "*")[:item_max_count]: countDict = generateDict(key) item_index, user_index = appendDataTest(key, item_index, user_index, countDict) if not bool(now_count % 100): print now_count now_count += 1print "end :"print time() - start_timevideosFile = file("videos.csv", "wb")ratingsFile = file("ratings.csv", "wb")tagsFile = file("tags.csv", "wb")videosWriter = csv.writer(videosFile)videosWriter.writerow(["videoId", "aid", "genres"])ratingsWriter = csv.writer(ratingsFile)ratingsWriter.writerow(["userId", "videoId", "rating", "timestamp"])tagsWriter = csv.writer(tagsFile)tagsWriter.writerow(["userId", "videoId", "tag", "timestamp"])for Writer, DataKey, File in [[videosWriter, videosDataKey, videosFile], [ratingsWriter, ratingsDataKey, ratingsFile], [tagsWriter, tagsDataKey, tagsFile]]: for element in redis.lrange(DataKey, 0, -1): Writer.writerow(map(lambda x: x.encode("utf-8") if hasattr(x, "encode") else x, eval(element))) File.close()掌握了框架的基本使用就可以看一下Lenskit的细节了。上面是使用item-based的协同过滤方法的结果,有关user-based的比较实例见 推荐系统 Lenskit 初探(一)有了推荐系统的数据支持 可以尝试使用简单的前端手法(简单的Servle/JSP/HTML5)将推荐结果展示出来,下面给出相应简单代码 这里使用了在Servlet启动时(init)更新运行 推荐系统的方式 并用redis进行数据交互Servlet类package main.com.bilibili;/** * Created by ehangzhou on 2017/2/18. */import redis.clients.jedis.Jedis;import java.io.IOException;import java.util.*;import javax.servlet.RequestDispatcher;import javax.servlet.ServletException;import javax.servlet.annotation.WebServlet;import javax.servlet.http.HttpServlet;import javax.servlet.http.HttpServletRequest;import javax.servlet.http.HttpServletResponse;import main.Recommender.BiliBiliLenskit;@WebServlet( urlPatterns = {"/recommend"})public class RecommendPageServlet extends HttpServlet { private final String recommendLabelsHash = "Recommend:BiliBili"; private Jedis jedis = new Jedis("127.0.0.1", 6379); private Map<String, String> recommendMap = new HashMap<>(); @Override public void init(){ String realPath = getServletContext().getRealPath("/"); new BiliBiliLenskit(new String[]{"802"}, realPath).start(); } @Override public void doGet(HttpServletRequest request, HttpServletResponse response)throws IOException, ServletException{ recommendMap = jedis.hgetAll(recommendLabelsHash); request.setAttribute("recommendMap", recommendMap); RequestDispatcher rd = request.getRequestDispatcher("/recommendPage.jsp"); rd.forward(request, response); }}
推荐系统类package main.Recommender;/** * Created by ehang on 2017/2/5. */import com.google.common.base.Throwables;import org.lenskit.LenskitConfiguration;import org.lenskit.LenskitRecommender;import org.lenskit.LenskitRecommenderEngine;import org.lenskit.api.ItemRecommender;import org.lenskit.api.Result;import org.lenskit.api.ResultList;import org.lenskit.config.ConfigHelpers;import org.lenskit.data.dao.DataAccessObject;import org.lenskit.data.dao.file.StaticDataSource;import org.lenskit.data.entities.CommonAttributes;import org.lenskit.data.entities.CommonTypes;import org.lenskit.data.entities.Entity;import org.slf4j.Logger;import org.slf4j.LoggerFactory;import java.io.File;import java.io.IOException;import java.nio.file.Path;import java.nio.file.Paths;import java.util.ArrayList;import java.util.List;import java.util.HashMap;import redis.clients.jedis.Jedis;public class BiliBiliLenskit extends Thread { private static final Logger logger = LoggerFactory.getLogger(BiliBiliLenskit.class); private Jedis jedis = new Jedis("127.0.0.1", 6379); private final String recommendLabelsHash = "Recommend:BiliBili"; private String realPath; private HashMap<String, String> scoreMap = new HashMap<>(); private Path dataFile = null; private List<Long> users; public BiliBiliLenskit(String[] args, String realPath) { users = new ArrayList<>(args.length); for (String arg: args){ users.add(Long.parseLong(arg)); } this.realPath = realPath; dataFile = Paths.get(realPath, "data/videolens.yml"); } @Override public void run(){ DataAccessObject dao; try { StaticDataSource data = StaticDataSource.load(dataFile); dao = data.get(); }catch (IOException e){ logger.error("cannot load data", e); throw Throwables.propagate(e); } LenskitConfiguration config = null; try{ //config = ConfigHelpers.load(new File("etc/item-item.groovy")); //config = ConfigHelpers.load(new File("etc/user-user.groovy")); config = ConfigHelpers.load( new File(realPath, "etc/user-user.groovy")); }catch(IOException e){ throw new RuntimeException("could not load configuration", e); } LenskitRecommenderEngine engine = LenskitRecommenderEngine.build(config, dao); logger.info("built recommender engine"); try(LenskitRecommender rec = engine.createRecommender(dao)){ logger.info("obtained recommender from engine"); ItemRecommender irec = rec.getItemRecommender(); assert irec != null; for(long user: users){ ResultList recs = irec.recommendWithDetails(user, 10, null, null); System.out.format("Recommendations for user %d:\n", user); for(Result item: recs){ Entity itemData = dao.lookupEntity(CommonTypes.ITEM, item.getId()); String name = null; if (itemData != null){ name = itemData.maybeGet(CommonAttributes.NAME); } System.out.format("\t%d (%s): %.2f\n", item.getId(), name, item.getScore()); scoreMap.put(name, String.valueOf(item.getScore())); } } } if (jedis.exists(recommendLabelsHash)) { jedis.del(recommendLabelsHash); } for(String category: scoreMap.keySet()){ jedis.hset(recommendLabelsHash, category, scoreMap.get(category)); } }}JSP文件<%@ page import="java.util.HashMap" %><%-- Created by IntelliJ IDEA. User: ehangzhou Date: 2017/2/18 Time: 17:37 To change this template use File | Settings | File Templates.--%><%@ page contentType="text/html;charset=UTF-8" language="java" isELIgnored="false" %><%@ taglib uri ="http://java.sun.com/jsp/jstl/core" prefix="c" %><html><link rel="stylesheet" type="text/css" href="static/RecommendPageStyle.css" /><meta http-equiv="refresh" content="0.5"/><head> <title>The RecommendPage of Application</title></head><script> function loadPage() { var canvasList = document.getElementsByTagName('canvas') for(var i = 0;i < canvasList.length; i++) { var id = canvasList.item(i).getAttribute("id"); var width = canvasList.item(i).getAttribute("width"); var height = canvasList.item(i).getAttribute("height"); draw(id, width, height); } } function draw(id, width, height) { var canvas = document.getElementById(id); var context = canvas.getContext('2d'); context.globalAlpha = 0.8; context.fillStyle = "white"; context.fillRect(0, 0, width, height); for (var i = 1; i < 10; i++) { context.beginPath(); context.arc(Math.random() * width, Math.random() * height, i * 10 * Math.random(), 0, Math.PI * 2, true); context.closePath(); context.fillStyle = "rgba(" + parseInt(255 * Math.random()).toString() + "," + parseInt(255 * Math.random()).toString() + "," + parseInt(255 * Math.random()).toString() + "," + "0.25)"; context.fill(); } var mult = width / 300; context.font = mult * 60 + "px Georgia"; context.fillText(id, 10, 100); }</script><body onLoad="loadPage();"> <c:forEach items="${recommendMap}" var="Map" varStatus="status"> <div id = "recommend_area"> <a href="http://search.bilibili.com/all?keyword=${Map.key}"><canvas id="${Map.key}" width="${300 * (Map.value - 9)}" height="${200 * (Map.value - 9)}"/></a> <br/> </div> <br/> </c:forEach></body></html>
这里使用了HTML5中的canvas画布 画出推荐标签的矩形(其中矩形的大小与推荐得分成正比) 并进行随机色彩填充为简单起见 推荐标签链接到哔站搜索引擎背景logo 采自 人类衰退之后样式#recommend_area {float: left; }body{ background: lightgoldenrodyellow url("/static/image/recommend.jpg")no-repeat; background-position: 0% 10%}显示效果如下
0 0
- Webcollector + Spring + MVC 搭建应用初探(六)(Lenskit 推荐系统实例)
- Webcollector + Spring + MVC 搭建应用初探(五)(Crab 推荐系统实例)
- Webcollector + Spring + MVC 搭建应用初探(一)
- Webcollector + Spring + MVC 搭建应用初探(二)
- Webcollector + Spring + MVC 搭建应用初探(三)
- Webcollector + Spring + MVC 搭建应用初探(四)
- 推荐系统 Lenskit 初探(一)
- lenskit (开源推荐系统) 简介
- webcollector 初探(一)
- webcollector 初探(二)
- webcollector 初探(三)
- 推荐系统学习06-LensKit
- Spring 初探(九)(Spring JPA 应用实例)
- Spring 初探(六)(Spring AOP及DAO基本概念与实例)
- LensKit<开源推荐系统框架Java>学习笔记
- Spring搭建Web应用(MVC)起步(Tomcat环境)
- Spring MVC的初步搭建(应用篇)
- 推荐系统 Crab 初探(一)
- 蜕变2017-还能孩子多久?
- 代码总结
- java什么叫事务,事务有什么用
- centos安装github环境
- (1)AngularJS 1.X 之 认识AngularJS
- Webcollector + Spring + MVC 搭建应用初探(六)(Lenskit 推荐系统实例)
- ShiroFilter
- FunDA(6)- Reactive Streams:Play with Iteratees、Enumerator and Enumeratees
- Linux基础知识(一)
- PHP调用外部程序的方法
- 财务系统迁云案例
- dao设计用户登录系统 总结 strus2+hibernate(1)
- cmd上在python2与python3之间自由切换
- 区块链开源项目