Spark2.1.0+Hadoop2.7.3关联规则平台(三)

来源:互联网 发布:linux locale命令 编辑:程序博客网 时间:2024/05/23 18:31

第三部分主要是数据来源和web前端的开发

数据来源以及处理

   数据来源于云音乐的爬虫,我们在分析的时候比较适合用id而不是歌曲的名字,另外用户在输入的时候也只会输入歌名而不会知道歌曲的id,所以我们首先要建立一个数据库负责处理歌曲与id之间的转换。另外一部分就是要存储歌单,每个歌单为一行,只存储歌曲的id。   所以首先我们需要能爬取云音乐的歌单,然后一部分存储id到一个文件中,另一部分获取歌曲名和id存储到数据库。爬虫部分感谢知乎用户路人甲的分享,可以直接得到需要发送的post请求以及url,头文件等。感谢用户路人甲https://www.zhihu.com/question/41505181?sort=created。我们现在只需要对获取到的数据进行保存,即处理格式、连接到数据库就可以了。代码如下:
#-*-encoding=utf-8-*-#317781import requestsfrom bs4 import BeautifulSoupimport sysreload(sys)sys.setdefaultencoding( "utf-8" )import reimport MySQLdbdef insert_sql(num):    play_url_prefix = 'http://music.163.com/playlist?id='    url_list = num    conn= MySQLdb.connect(            host='localhost',            port = 3306,            user='root',            passwd='12345678aa',            db ='music',            charset='utf8'            )    cur = conn.cursor()    headers = {        'Referer':'http://music.163.com/',        'Host':'music.163.com',        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0 Iceweasel/38.3.0',        'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'                }    play_url = play_url_prefix + str(url_list)    s = requests.session()    s = BeautifulSoup(s.get(play_url,headers = headers).content,'lxml')    # print s.decode('utf-8')    main = s.find('ul',{'class':'f-hide'})    # print main    #    for music in main.find_all('a'):        # a = ('{} : {}'.format(music.text, music['href']))        musicname = music.text.encode("utf8")        musicid = music['href']        # print a        # print type(a)        match = re.match(r'.*id=(\d+)',musicid)        if match:            ids = match.group(1)            sql = "insert ignore  music_copy values('"+ids+"','"+MySQLdb.escape_string(musicname)+"');"        aa = cur.execute(sql)    conn.commit()    cur.close()    conn.close()for i in range(0,100):    temp = 317881 + i    print i    insert_sql(temp)

上面是将歌曲保存到数据库中的代码,歌曲id也是随机选了连续的100个,新用户可能需要加载BeautifulSoup和requests两个包。可以直接用pip install安装。歌曲的数据库一共两个字段,一个id,一个name。

#-*-encoding=utf-8-*-import requestsfrom bs4 import BeautifulSoupimport sysreload(sys)sys.setdefaultencoding( "utf-8" )import ref = file('d:/testfile.txt','a+')headers = {            'Referer':'http://music.163.com/',            'Host':'music.163.com',            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0 Iceweasel/38.3.0',            'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'            }initialNum = 317781list_url_prefix = 'http://music.163.com/playlist?id='for i in range(0,100):    print i    list_url = list_url_prefix + str(initialNum)    s = requests.session()    s = BeautifulSoup(s.get(list_url, headers=headers).content, 'lxml')    main = s.find('ul', {'class': 'f-hide'})    line_flag = 0    for music in main.find_all('a'):        line_flag = 1        a = music['href']        match = re.match(r'.*id=(\d+)', a)        if match:            ids = match.group(1)            f.write(ids + ',')    if(line_flag == 1):        f.write('\n')    initialNum+=1f.close()

这是写入文件的代码,每个歌单为一行,共存储了100个歌单。若歌单为空那么就不换行,避免文件中出现太多的空行。
然后将文本文件给spark处理,处理完成以后输出文件我们也要读取存储到数据库中。这里写图片描述
如图所示为得到的结果,我们需要存储三个部分,前面是关联基础,也就是我们后面查询到匹配输入,后面是需要返回去的结果。所有三个字段都在一个小括号中。所以我们可以先用正则匹配(),用split(‘,]’) 分开三个字段,然后将前面两个字段用lstrip(‘[‘)去除左边的中括号。然后导入数据库。

#-*-encoding=utf-8-*-import reimport MySQLdbconn = MySQLdb.connect(        host='localhost',        port=3306,        user='root',        passwd='12345678aa',        db='music',        charset='utf8'    )cur = conn.cursor()f = file('f:/result/result6').read()pattern = re.compile(r'\((.*?)\)')match = pattern.findall(f)# print matchfor each in match:    list = each.split('],')    if(len(list)==3):        temp0 = list[0].lstrip('[')        temp1 = list[1].lstrip('[')        temp2 = list[2]        sql = "insert into  result_copy1 values('" + MySQLdb.escape_string(temp0) + "','" + MySQLdb.escape_string(temp1) + "','"+MySQLdb.escape_string(temp2)+"');"        aa = cur.execute(sql)    conn.commit()cur.close()conn.close()

这步处理完了以后,我们会得到可以直接使用的数据库,后面只需要做一个web页面连接到就好了。

web页面

使用myeclipse开发。页面只需要两个,一个是输入页面,一个是输出页面。页面只用一个表单就行。

<form action="QueryServlet" method="post">    <center><input type="text" size="30" name="music" value=""></center><br>    <center><input type=submit value="查询"></center>    </form>

QueryServlet用来处理post请求,要将用户输入的中文歌单转为id然后排序后进入数据库查询,最后输出结果。
QueryServlet的post方法如下

protected void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {        ServletContext context = getServletContext();        request.setCharacterEncoding("utf-8");        response.setCharacterEncoding("utf-8");        System.out.println(request.getParameter("music"));        String list = TransNameID.transNameToId(request.getParameter("music"));        System.out.println(list);        Query query = new Query();        String sql = "select music from result_copy1 where related="+"\"" +list+ "\";"; //      System.out.println(sql);        ArrayList<String> result = TransNameID.transIdToName(query.search(sql));        request.getSession(false).setAttribute("result", result);        context.getRequestDispatcher("/result.jsp").forward(request, response);    }

处理数据库的DBHelper

package myutil;import java.sql.Connection;import java.sql.DriverManager;public class DBHelper {    private static final String driver="com.mysql.jdbc.Driver";    private static final String url="jdbc:mysql://localhost:3306/music?useUnicode=true&characterEncoding=utf-8";    private static final String username="root";    private static final String password="12345678aa";    private static Connection conn = null;    static    {        try {            Class.forName(driver);        } catch (Exception e) {            // TODO: handle exception        }    }    public static Connection getConnection() throws Exception    {        if(conn==null)        {            conn = DriverManager.getConnection(url,username,password);            return conn;        }        return conn;    }    public static void main(String[] args) {        try{            Connection conn = DBHelper.getConnection();            if(conn!=null)                System.out.println("成功");            else                System.out.println("失败");            }        catch(Exception e)        {            e.printStackTrace();        }    }}

负责查询的Query类

package myutil;import myutil.DBHelper;import java.sql.Connection;import java.sql.PreparedStatement;import java.sql.ResultSet;import java.util.ArrayList;import java.util.Iterator;public class Query {    private String sql = "";    public Query(){    }    public Query(String  s){        sql  = s;    }    public String getSql() {        return sql;    }    public void setSql(String sql) {        this.sql = sql;    }    public ArrayList<String> search(String sql){        String str = "";        Connection conn = null;        PreparedStatement stmt = null;        ResultSet rs = null;        ArrayList<String> arr = new ArrayList<String>();        try {            conn = DBHelper.getConnection();            stmt = conn.prepareStatement(sql);            rs = stmt.executeQuery();            while(rs.next()){                str = rs.getString("music");                arr.add(str);                }            }        catch (Exception e) {            e.printStackTrace();        }        finally        {            if(rs!=null)                try {                    rs.close();                    rs = null;                } catch (Exception ex) {                    ex.printStackTrace();                }            }            // 释放语句对象            if (stmt != null) {                try {                    stmt.close();                    stmt = null;                } catch (Exception ex) {                    ex.printStackTrace();                }        }        System.out.println("这是在Query的serach");        for(String s : arr)            System.out.println(s);        return arr;    }        public static ArrayList<Integer> trans(String s){            String[] str = s.split(",");            Connection conn = null;            PreparedStatement stmt = null;            ResultSet rs = null;            ArrayList<Integer> arr = new ArrayList<Integer>();            String sql = "";            for(int i=0;i<str.length;i++){                try {                    conn = DBHelper.getConnection();                    sql = "select id from music_copy where name="+"\""+str[i]+"\";";                    System.out.println(sql);                    stmt = conn.prepareStatement(sql);                    rs = stmt.executeQuery();                    int n = 0;                    if(rs.next()){                        n = rs.getInt("id");                        System.out.println(n);                        arr.add(n);                        }                    }                catch (Exception e) {                    e.printStackTrace();                }            }            if(rs!=null)                try {                    rs.close();                    rs = null;                } catch (Exception ex) {                    ex.printStackTrace();                }                // 释放语句对象            if (stmt != null) {                try {                    stmt.close();                    stmt = null;                } catch (Exception ex) {                    ex.printStackTrace();                }            }            return arr;    }    public static ArrayList<String> trans2(ArrayList<String> s){        Connection conn = null;        PreparedStatement stmt = null;        ResultSet rs = null;        ArrayList<String> ret = new ArrayList<String>();        Iterator<String> iter = s.iterator();        String sql ="";        while(iter.hasNext()){            try {                conn = DBHelper.getConnection();                sql = "select name from music_copy where id="+Integer.parseInt(iter.next())+";";                System.out.println(sql);                stmt = conn.prepareStatement(sql);                rs = stmt.executeQuery();                String n = "";                while(rs.next()){                    n = rs.getString("name");                    System.out.println(n);                    ret.add(n);                    }                }            catch (Exception e) {                e.printStackTrace();            }            if(rs!=null)                try {                    rs.close();                    rs = null;                } catch (Exception ex) {                    ex.printStackTrace();                }                // 释放语句对象            if (stmt != null) {                try {                    stmt.close();                    stmt = null;                } catch (Exception ex) {                    ex.printStackTrace();                }            }        }                               return ret;    }    public static void main(String[] args) {        ArrayList<String> a = new ArrayList<String>();        ArrayList<String> b = new ArrayList<String>();        a.add("59970");        a.add("60238");        b = trans2(a);        Iterator<String> it = b.iterator();        while(it.hasNext()){            System.out.println(it.next());        }    }}

负责转换歌名id

package myutil;import java.util.ArrayList;import java.util.Iterator;import myutil.Query;import java.util.Collections;public class TransNameID {    public static String transNameToId(String s){        String ret = "";        ArrayList<Integer> arr = Query.trans(s);        Collections.sort(arr);        int i = 0;        System.out.println("这是在transNameToId中");        for(Integer in : arr)            System.out.println(in);        System.out.println("标记");        for(i=0;i<(arr.size()-1);i++){            ret = ret + Integer.toString(arr.get(i))+", ";        }        ret = ret +  Integer.toString(arr.get(i));        return ret;    }    public static ArrayList<String> transIdToName(ArrayList<String> arr){        ArrayList<String> ret = Query.trans2(arr);        return ret;    }    public static void main(String[] args) {//      String a = "桂花香,坚持到底";//      String b = transNameToId(a);//      //      System.out.println(b);        ArrayList<String> a = new ArrayList<String>();        ArrayList<String> b = new ArrayList<String>();        a.add("59970");        a.add("60238");        b = transIdToName(a);        Iterator<String> it = b.iterator();        while(it.hasNext()){            System.out.println(it.next());        }    }}

结果的页面

<%    ArrayList<String> arr = (ArrayList<String>)session.getAttribute("result");    for(String str : arr){ %>    <div><%=str %></div> <%  }  %>

到此全部结果完成。

0 0
原创粉丝点击