Spark2.1.0+Hadoop2.7.3关联规则平台(三)
来源:互联网 发布:linux locale命令 编辑:程序博客网 时间:2024/05/23 18:31
第三部分主要是数据来源和web前端的开发
数据来源以及处理
数据来源于云音乐的爬虫,我们在分析的时候比较适合用id而不是歌曲的名字,另外用户在输入的时候也只会输入歌名而不会知道歌曲的id,所以我们首先要建立一个数据库负责处理歌曲与id之间的转换。另外一部分就是要存储歌单,每个歌单为一行,只存储歌曲的id。 所以首先我们需要能爬取云音乐的歌单,然后一部分存储id到一个文件中,另一部分获取歌曲名和id存储到数据库。爬虫部分感谢知乎用户路人甲的分享,可以直接得到需要发送的post请求以及url,头文件等。感谢用户路人甲https://www.zhihu.com/question/41505181?sort=created。我们现在只需要对获取到的数据进行保存,即处理格式、连接到数据库就可以了。代码如下:
#-*-encoding=utf-8-*-#317781import requestsfrom bs4 import BeautifulSoupimport sysreload(sys)sys.setdefaultencoding( "utf-8" )import reimport MySQLdbdef insert_sql(num): play_url_prefix = 'http://music.163.com/playlist?id=' url_list = num conn= MySQLdb.connect( host='localhost', port = 3306, user='root', passwd='12345678aa', db ='music', charset='utf8' ) cur = conn.cursor() headers = { 'Referer':'http://music.163.com/', 'Host':'music.163.com', 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0 Iceweasel/38.3.0', 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' } play_url = play_url_prefix + str(url_list) s = requests.session() s = BeautifulSoup(s.get(play_url,headers = headers).content,'lxml') # print s.decode('utf-8') main = s.find('ul',{'class':'f-hide'}) # print main # for music in main.find_all('a'): # a = ('{} : {}'.format(music.text, music['href'])) musicname = music.text.encode("utf8") musicid = music['href'] # print a # print type(a) match = re.match(r'.*id=(\d+)',musicid) if match: ids = match.group(1) sql = "insert ignore music_copy values('"+ids+"','"+MySQLdb.escape_string(musicname)+"');" aa = cur.execute(sql) conn.commit() cur.close() conn.close()for i in range(0,100): temp = 317881 + i print i insert_sql(temp)
上面是将歌曲保存到数据库中的代码,歌曲id也是随机选了连续的100个,新用户可能需要加载BeautifulSoup和requests两个包。可以直接用pip install安装。歌曲的数据库一共两个字段,一个id,一个name。
#-*-encoding=utf-8-*-import requestsfrom bs4 import BeautifulSoupimport sysreload(sys)sys.setdefaultencoding( "utf-8" )import ref = file('d:/testfile.txt','a+')headers = { 'Referer':'http://music.163.com/', 'Host':'music.163.com', 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0 Iceweasel/38.3.0', 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' }initialNum = 317781list_url_prefix = 'http://music.163.com/playlist?id='for i in range(0,100): print i list_url = list_url_prefix + str(initialNum) s = requests.session() s = BeautifulSoup(s.get(list_url, headers=headers).content, 'lxml') main = s.find('ul', {'class': 'f-hide'}) line_flag = 0 for music in main.find_all('a'): line_flag = 1 a = music['href'] match = re.match(r'.*id=(\d+)', a) if match: ids = match.group(1) f.write(ids + ',') if(line_flag == 1): f.write('\n') initialNum+=1f.close()
这是写入文件的代码,每个歌单为一行,共存储了100个歌单。若歌单为空那么就不换行,避免文件中出现太多的空行。
然后将文本文件给spark处理,处理完成以后输出文件我们也要读取存储到数据库中。
如图所示为得到的结果,我们需要存储三个部分,前面是关联基础,也就是我们后面查询到匹配输入,后面是需要返回去的结果。所有三个字段都在一个小括号中。所以我们可以先用正则匹配(),用split(‘,]’) 分开三个字段,然后将前面两个字段用lstrip(‘[‘)去除左边的中括号。然后导入数据库。
#-*-encoding=utf-8-*-import reimport MySQLdbconn = MySQLdb.connect( host='localhost', port=3306, user='root', passwd='12345678aa', db='music', charset='utf8' )cur = conn.cursor()f = file('f:/result/result6').read()pattern = re.compile(r'\((.*?)\)')match = pattern.findall(f)# print matchfor each in match: list = each.split('],') if(len(list)==3): temp0 = list[0].lstrip('[') temp1 = list[1].lstrip('[') temp2 = list[2] sql = "insert into result_copy1 values('" + MySQLdb.escape_string(temp0) + "','" + MySQLdb.escape_string(temp1) + "','"+MySQLdb.escape_string(temp2)+"');" aa = cur.execute(sql) conn.commit()cur.close()conn.close()
这步处理完了以后,我们会得到可以直接使用的数据库,后面只需要做一个web页面连接到就好了。
web页面
使用myeclipse开发。页面只需要两个,一个是输入页面,一个是输出页面。页面只用一个表单就行。
<form action="QueryServlet" method="post"> <center><input type="text" size="30" name="music" value=""></center><br> <center><input type=submit value="查询"></center> </form>
QueryServlet用来处理post请求,要将用户输入的中文歌单转为id然后排序后进入数据库查询,最后输出结果。
QueryServlet的post方法如下
protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { ServletContext context = getServletContext(); request.setCharacterEncoding("utf-8"); response.setCharacterEncoding("utf-8"); System.out.println(request.getParameter("music")); String list = TransNameID.transNameToId(request.getParameter("music")); System.out.println(list); Query query = new Query(); String sql = "select music from result_copy1 where related="+"\"" +list+ "\";"; // System.out.println(sql); ArrayList<String> result = TransNameID.transIdToName(query.search(sql)); request.getSession(false).setAttribute("result", result); context.getRequestDispatcher("/result.jsp").forward(request, response); }
处理数据库的DBHelper
package myutil;import java.sql.Connection;import java.sql.DriverManager;public class DBHelper { private static final String driver="com.mysql.jdbc.Driver"; private static final String url="jdbc:mysql://localhost:3306/music?useUnicode=true&characterEncoding=utf-8"; private static final String username="root"; private static final String password="12345678aa"; private static Connection conn = null; static { try { Class.forName(driver); } catch (Exception e) { // TODO: handle exception } } public static Connection getConnection() throws Exception { if(conn==null) { conn = DriverManager.getConnection(url,username,password); return conn; } return conn; } public static void main(String[] args) { try{ Connection conn = DBHelper.getConnection(); if(conn!=null) System.out.println("成功"); else System.out.println("失败"); } catch(Exception e) { e.printStackTrace(); } }}
负责查询的Query类
package myutil;import myutil.DBHelper;import java.sql.Connection;import java.sql.PreparedStatement;import java.sql.ResultSet;import java.util.ArrayList;import java.util.Iterator;public class Query { private String sql = ""; public Query(){ } public Query(String s){ sql = s; } public String getSql() { return sql; } public void setSql(String sql) { this.sql = sql; } public ArrayList<String> search(String sql){ String str = ""; Connection conn = null; PreparedStatement stmt = null; ResultSet rs = null; ArrayList<String> arr = new ArrayList<String>(); try { conn = DBHelper.getConnection(); stmt = conn.prepareStatement(sql); rs = stmt.executeQuery(); while(rs.next()){ str = rs.getString("music"); arr.add(str); } } catch (Exception e) { e.printStackTrace(); } finally { if(rs!=null) try { rs.close(); rs = null; } catch (Exception ex) { ex.printStackTrace(); } } // 释放语句对象 if (stmt != null) { try { stmt.close(); stmt = null; } catch (Exception ex) { ex.printStackTrace(); } } System.out.println("这是在Query的serach"); for(String s : arr) System.out.println(s); return arr; } public static ArrayList<Integer> trans(String s){ String[] str = s.split(","); Connection conn = null; PreparedStatement stmt = null; ResultSet rs = null; ArrayList<Integer> arr = new ArrayList<Integer>(); String sql = ""; for(int i=0;i<str.length;i++){ try { conn = DBHelper.getConnection(); sql = "select id from music_copy where name="+"\""+str[i]+"\";"; System.out.println(sql); stmt = conn.prepareStatement(sql); rs = stmt.executeQuery(); int n = 0; if(rs.next()){ n = rs.getInt("id"); System.out.println(n); arr.add(n); } } catch (Exception e) { e.printStackTrace(); } } if(rs!=null) try { rs.close(); rs = null; } catch (Exception ex) { ex.printStackTrace(); } // 释放语句对象 if (stmt != null) { try { stmt.close(); stmt = null; } catch (Exception ex) { ex.printStackTrace(); } } return arr; } public static ArrayList<String> trans2(ArrayList<String> s){ Connection conn = null; PreparedStatement stmt = null; ResultSet rs = null; ArrayList<String> ret = new ArrayList<String>(); Iterator<String> iter = s.iterator(); String sql =""; while(iter.hasNext()){ try { conn = DBHelper.getConnection(); sql = "select name from music_copy where id="+Integer.parseInt(iter.next())+";"; System.out.println(sql); stmt = conn.prepareStatement(sql); rs = stmt.executeQuery(); String n = ""; while(rs.next()){ n = rs.getString("name"); System.out.println(n); ret.add(n); } } catch (Exception e) { e.printStackTrace(); } if(rs!=null) try { rs.close(); rs = null; } catch (Exception ex) { ex.printStackTrace(); } // 释放语句对象 if (stmt != null) { try { stmt.close(); stmt = null; } catch (Exception ex) { ex.printStackTrace(); } } } return ret; } public static void main(String[] args) { ArrayList<String> a = new ArrayList<String>(); ArrayList<String> b = new ArrayList<String>(); a.add("59970"); a.add("60238"); b = trans2(a); Iterator<String> it = b.iterator(); while(it.hasNext()){ System.out.println(it.next()); } }}
负责转换歌名id
package myutil;import java.util.ArrayList;import java.util.Iterator;import myutil.Query;import java.util.Collections;public class TransNameID { public static String transNameToId(String s){ String ret = ""; ArrayList<Integer> arr = Query.trans(s); Collections.sort(arr); int i = 0; System.out.println("这是在transNameToId中"); for(Integer in : arr) System.out.println(in); System.out.println("标记"); for(i=0;i<(arr.size()-1);i++){ ret = ret + Integer.toString(arr.get(i))+", "; } ret = ret + Integer.toString(arr.get(i)); return ret; } public static ArrayList<String> transIdToName(ArrayList<String> arr){ ArrayList<String> ret = Query.trans2(arr); return ret; } public static void main(String[] args) {// String a = "桂花香,坚持到底";// String b = transNameToId(a);// // System.out.println(b); ArrayList<String> a = new ArrayList<String>(); ArrayList<String> b = new ArrayList<String>(); a.add("59970"); a.add("60238"); b = transIdToName(a); Iterator<String> it = b.iterator(); while(it.hasNext()){ System.out.println(it.next()); } }}
结果的页面
<% ArrayList<String> arr = (ArrayList<String>)session.getAttribute("result"); for(String str : arr){ %> <div><%=str %></div> <% } %>
到此全部结果完成。
0 0
- Spark2.1.0+Hadoop2.7.3关联规则平台(三)
- Spark2.1.0+Hadoop2.7.3关联规则平台(二)
- Spark2.1.0+Hadoop2.7.3搭建关联规则分析算法(一)
- Hadoop2.7.3+Spark2.1.0 完全分布式环境
- Hadoop2.7.3+Spark2.1.0 完全分布式环境 搭建全过程
- Hadoop2.7.3+Spark2.1.0完全分布式集群搭建过程
- Hadoop2.7.3+Spark2.1.0完全分布式集群搭建过程
- Hadoop2.7.3和Spark2.1.0集群详细搭建教程
- spark2.1.0完全分布式集群搭建-hadoop2.7.3
- Hadoop2.7.3+Spark2.1.0 完全分布式环境 搭建全过程
- Hadoop2.7.3+Spark2.1.0完全分布式集群搭建过程
- hadoop2.7.3下spark2.1.0安装_yarn作业提交
- Hadoop2.7.3+Spark2.1.0 完全分布式环境 搭建全过程
- Hadoop2.7.3+Spark2.1.0 完全分布式环境 搭建全过程
- Hadoop2.7.3+Spark2.1.0 完全分布式环境 搭建全过程
- # hadoop2.7.3-spark2.0.2集群部署(备忘)
- 虚拟机安装spark2.2+hadoop2.7.3
- CentOS7下 Hadoop2.7.3+Spark2.1.0 集群环境搭建(1NN+2DN)
- Leetcode 506(Java)
- JAVA-13-IO流之字符流;设计模式讲解
- 深度学习这么火,但要小心不要被它忽悠了
- Android app接入百度地图:禁用手势、模式切换、根据经纬度获取位置信息
- CSS3中的动画功能
- Spark2.1.0+Hadoop2.7.3关联规则平台(三)
- webpack + vue 笔记
- Result Maps collection does not contain value for java.lang.Long
- Hibernate第十二篇【二级缓存介绍、缓存策略、查询缓存、集合缓存】
- 简约版.vimrc
- 浅谈CSRF攻击方式
- PHPStorm2016.3永久激活方法之本地破解激活
- 二叉搜索树转换为双向链表
- 架构师之DNS实战