CBSsport的NBA直播数据整理小结一下……

来源:互联网 发布:鱼子酱 知乎 编辑:程序博客网 时间:2024/05/01 03:18

忘记了是几个月前的哪一天,我偶然发现CBS的直播数据是可以直接从html文件中获得出手点数据的,当时应该是一阵狂喜呢,那时候我还不知道该怎么搞定ESPN的xml数据……

现在回头看以前处理过的CBS出手数据,不得不说很鸡肋。


处理后的文件包括CBSplayerID和球员名对应表,03-11年8个赛季的shotdata,shotType解释表。

CBS出手数据总数上和赛季整体统计有不小的差距,总数上经常有几百上千的多少,总数的比例都有98%以上,应该算不错了,但具体到单场比赛,会发现有shotdata的时间轴数据不准和出手球员错误的问题(主要是和NBA官网和ESPN的PBP数据时间轴做比较),这和之后获得的ESPNxml出手数据相比就有明显的不足了。


但另外值得一提的一点是,CBS和NBA官网的出手类型描述还是很丰富的,而ESPN的分类相对粗一点。

有次偶然发现一个特别的补扣


本来好奇的是这个球算助攻空接还是投篮不中前板补扣,结果却意外发现只有CBS描述这球是扣篮,而ESPN和NBA官网记的是上篮。

这么看来应该存在其它不一致的投篮描述,但也应该只是少数。考虑到时间轴不一致,统一起来应该还是比较麻烦的,暂未处理这个问题。


简单记录一下基本的抓取和处理过程:

1,03-11,8个赛季,分别保存一个某一天的scoreboard文件,抽取出8个赛季的全部比赛日。

例如:http://www.cbssports.com/nba/scoreboard/20110101。

主要是匹配页面中的“<a href=\"/nba/scoreboard/”,抽取其后的8位数字串加入比赛日集合。


2,用全部的比赛日链接做种子,配置Heritrix任务抓回所有比赛场次的shotchart页面。

主要是匹配“NBA_[0-9]+_[A-Z]*@[A-Z]*”,添加到等待抓取队列中。

原理上可以不用自己写个简单继承的Extractor,那需要另外在任务中设置链接过滤规则,而默认的链接抽取模块会抽出很多无用的链接来作判断,花费的抓取时间要多一些。

另外还可以先用下载工具抓取比赛日列表,然后用正则表达式提取所有比赛的特征字符串(需要编程),再用抽出的链接抓取shotchart页面。抓取部分用迅雷就可以轻松搞定,文件命名就是比赛特征字符串。

例如:http://www.cbssports.com/nba/gametracker/shotchart/NBA_20110101_CLE@CHI,抓取下来的文件名就是“NBA_20110101_CLE@CHI”。

不过我还是选择了编程的方法……

import java.io.IOException;import java.util.logging.Level;import java.util.logging.Logger;import java.util.regex.Matcher;import java.util.regex.Pattern;import org.apache.commons.httpclient.URIException;import org.archive.crawler.datamodel.CrawlURI;import org.archive.crawler.extractor.Extractor;import org.archive.crawler.extractor.Link;import org.archive.io.ReplayCharSequence;import org.archive.util.HttpRecorder;public class CBSScoreboardExtractor extends Extractor {private static final long serialVersionUID = 5855731422080471017L;private static Logger logger =        Logger.getLogger(CBSScoreboardExtractor.class.getName());public CBSScoreboardExtractor(String name) {        this(name, "CBSSport Scoreboard Extractor");}public CBSScoreboardExtractor(String name, String description) {        super(name, description);}//从scoreboard页面抽取CBS每场的比赛特征字符串private static final String CBS_FEATURE = "NBA_[0-9]+_[A-Z]*@[A-Z]*";private static final String SHOTCHART = "http://www.cbssports.com/nba/gametracker/shotchart/";protected void extract(CrawlURI curi){        //下面一段代码主要用于取得当前链接的返回 字符串,以便对内容进行分析        ReplayCharSequence cs = null;        try {            HttpRecorder hr = curi.getHttpRecorder();            if (hr == null) {                throw new IOException("Why is recorder null here?");            }            cs = hr.getReplayCharSequence();        } catch (IOException e) {            curi.addLocalizedError(this.getName(), e,                    "Failed get of replay char sequence " + curi.toString()                            + " " + e.getMessage());            logger.log(Level.SEVERE, "Failed get of replay char sequence in "                    + Thread.currentThread().getName(), e);        }        if (cs == null) {            return;        }        // 将链接返回的内容转成字符串        String content = cs.toString();                try {                       // 将字符串内容进行正则匹配            // 取出其中的链接信息            Pattern pattern = Pattern.compile(CBS_FEATURE);            Matcher matcher = pattern.matcher(content);            // 若找到了一个链接            while (matcher.find()) {            int start = matcher.start();            int end = matcher.end();            String aShotchartLink = SHOTCHART + content.substring(start, end);                addLinkFromString(curi, aShotchartLink, "", Link.NAVLINK_HOP);            }            curi.linkExtractorFinished();        } catch (Exception e) {            e.printStackTrace();        }}    // 将链接保存记录下来,以备后续处理    private void addLinkFromString(CrawlURI curi, String uri,            CharSequence context, char hopType) {        try {            curi.createAndAddLinkRelativeToBase(uri, context.toString(),                    hopType);        } catch (URIException e) {            if (getController() != null) {                getController().logUriError(e, curi.getUURI(), uri);            } else {                logger.info("Failed createAndAddLinkRelativeToBase "                 + curi + ", " + uri + ", " + context + ", "                 + hopType + ": " + e);            }        }    }}
这样下来共抓取了10000+场比赛的shotchart数据。


3,手工为每个赛季的比赛集中一个文件夹,剔除全明星赛和延期的比赛,还有10来比赛因为某一个页面链接错误没有抓取,手动保存了一些页面。


4,在单一的shotchart页面里抽取球员信息(CBSplayerID和球员名)和出手信息,分赛季写入文本。

package CBS;import java.io.*;import java.util.Comparator;import java.util.Iterator;import java.util.TreeSet;/* * 2003-11每个赛季的总出手数据分别保存为一个文本 * 20031028-20040615 1189 + 82 * 20041102-20050623 1230 + 84 * 20051101-20060620 1230 + 89 * 20061031-20070614 1230 + 79 * 20071030-20080617 1230 + 86 * 20081028-20090618 1230 + 85 * 20091027-20100617 1230 + 82 * 20101026-20110612 1230 + 81 * Damon Jones & Dwayne Jones 2007-08 Cavaliers * James Jones & Jumaine Jones 2006-07 Suns *  * rescheduled game *  * 源数据中存在错误的球员信息 * 同球员不同ID,Awvee Scorey ;同ID不同姓名,如Yao Ming、Ming Yao*/public class CBSShotchartParser {public static void main(String[] args) throws Exception{File directory = new File("E:\\NBA\\data\\2003-2011CBSshotchart\\06-07\\");String[] shotcharts = directory.list();//FileWriter fr0304 = new FileWriter("E:\\2003-04shotdata.txt");//FileWriter fr0405 = new FileWriter("E:\\2004-05shotdata.txt");//FileWriter fr0506 = new FileWriter("E:\\2005-06shotdata.txt");FileWriter fr0607 = new FileWriter("E:\\2006-07shotdata.txt");//FileWriter fr0708 = new FileWriter("E:\\2007-08shotdata.txt");//FileWriter fr0809 = new FileWriter("E:\\2008-09shotdata.txt");//FileWriter fr0910 = new FileWriter("E:\\2009-10shotdata.txt");//FileWriter fr1011 = new FileWriter("E:\\2010-11shotdata.txt");//延期安排的比赛,或出手数据为空FileWriter frReschGames = new FileWriter("E:\\rescheduledGames.txt");//球员姓名中出现特殊空格字符FileWriter frSpecialName = new FileWriter("E:\\SpecialName.txt");TreeSet<CBSplayerInfo> playerInfoSet = new TreeSet<CBSplayerInfo>();//FileWriter frID = new FileWriter("E:\\CBSplayerInfo.txt");for(int i=0; i < shotcharts.length; i++){String pageFile = "E:\\NBA\\data\\2003-2011CBSshotchart\\06-07\\" + shotcharts[i];String gameKey = shotcharts[i].substring(4).replaceAll("_|@", "");String pageContent = "";BufferedReader br = new BufferedReader(new FileReader(pageFile));String aLine = br.readLine();while(aLine != null){pageContent = pageContent + aLine;aLine = br.readLine();}br.close();int cur = pageContent.indexOf("currentShotData = new String");int lcur = pageContent.indexOf("\"", cur);int rcur = pageContent.indexOf("\"", lcur+1);String rawShotdata = pageContent.substring(lcur+1, rcur);if(rawShotdata.equals("")){//处理可能出现的重排比赛(出手数据为空)frReschGames.append(shotcharts[i] + "\r\n");continue;}String shotData = gameKey + "," + pageContent.substring(lcur+1, rcur).replaceAll("~", "\r\n" + gameKey + ",");//player信息索引集(只保留CBSplayerId,first name,last name)//例如(240304:Tony Parker,9,PG,8-20,1-3,0-0,17|)保留(240304,Tony,Parker)cur = pageContent.indexOf("playerDataHomeString = new String",rcur);lcur = pageContent.indexOf("\"", cur);rcur = pageContent.indexOf("\"", lcur+1);String homePlayers = pageContent.substring(lcur+1,rcur);cur = pageContent.indexOf("playerDataAwayString = new String",rcur);lcur = pageContent.indexOf("\"", cur);rcur = pageContent.indexOf("\"", lcur+1);String awayPlayers = pageContent.substring(lcur+1,rcur);String players = homePlayers + "|" + awayPlayers;for(int j = 0; j < players.length(); j++){CBSplayerInfo aPlayer = new CBSplayerInfo();int cur1 = players.indexOf(":",j);aPlayer.id = players.substring(j,cur1);int cur2 = players.indexOf(" ",cur1);//出现特例:20071103DALSAC中空格是" ";//20071211INDCLE中空格是字符集导致的乱码(先保存,暂不处理),cur2返回-1.int SPACE_LEN = 6;if(cur2 == -1){frSpecialName.append(shotcharts[i] + "\r\n");break;//cur2 = players.indexOf(" ",cur1);//SPACE_LEN = 1;}aPlayer.firstName = players.substring(cur1 + 1,cur2);int cur3 = players.indexOf(",",cur2);aPlayer.lastName = players.substring(cur2 + SPACE_LEN,cur3);playerInfoSet.add(aPlayer);//添加球员ID信息j = players.indexOf("|",cur3);if(j == -1) break;}//保存shotchart数据if(gameKey.compareTo("200407") < 0){//fr0304.append(shotData + "\r\n");}else if(gameKey.compareTo("200507") < 0){//fr0304.close();//fr0405.append(shotData + "\r\n");}else if(gameKey.compareTo("200607") < 0){//fr0405.close();//fr0506.append(shotData + "\r\n");}else if(gameKey.compareTo("200707") < 0){//fr0506.close();fr0607.append(shotData + "\r\n");}else if(gameKey.compareTo("200807") < 0){fr0607.close();//fr0708.append(shotData + "\r\n");}else if(gameKey.compareTo("200907") < 0){//fr0708.close();//fr0809.append(shotData + "\r\n");}else if(gameKey.compareTo("201007") < 0){//fr0809.close();//fr0910.append(shotData + "\r\n");}else if(gameKey.compareTo("201107") < 0){//fr0910.close();//fr1011.append(shotData + "\r\n");}System.out.println(shotcharts[i]);}//fr1011.close();//保存球员ID数据Iterator<CBSplayerInfo> it = playerInfoSet.iterator();while(it.hasNext()){CBSplayerInfo nextPlayer = it.next();String playerInfo = nextPlayer.id + "\t" + nextPlayer.firstName + "\t" + nextPlayer.lastName;//frID.append(playerInfo + "\r\n");}frReschGames.close();frSpecialName.close();//frID.close();}}
碰到一些页面空格不一致的编码问题,单独处理。

package CBS;import java.io.*;import java.util.Iterator;import java.util.TreeSet;public class CBSspecialName {public static void main(String[] args) throws Exception{TreeSet<CBSplayerInfo> playerInfoSet = new TreeSet<CBSplayerInfo>();FileWriter frID = new FileWriter("E:\\CBSplayerInfo.txt");//球员姓名中出现特殊空格字符的文件FileWriter frSpecialName = new FileWriter("E:\\SpecialNameSpace.txt");BufferedReader br = new BufferedReader(new FileReader("E:\\NBA\\data\\SpecialName.txt"));String str = br.readLine();int cnt = 1;while(str != null){String page = "E:\\NBA\\data\\2003-2011CBSshotchart\\" + str;BufferedReader br2 = new BufferedReader(new FileReader(page));String pageContent = "";String aLine = br2.readLine();while(aLine != null){pageContent = pageContent + aLine;aLine = br2.readLine();}br2.close();int cur = pageContent.indexOf("playerDataHomeString = new String");int lcur = pageContent.indexOf("\"", cur);int rcur = pageContent.indexOf("\"", lcur+1);String homePlayers = pageContent.substring(lcur+1,rcur);cur = pageContent.indexOf("playerDataAwayString = new String",rcur);lcur = pageContent.indexOf("\"", cur);rcur = pageContent.indexOf("\"", lcur+1);String awayPlayers = pageContent.substring(lcur+1,rcur);String players = homePlayers + "|" + awayPlayers;players = new String(players.getBytes("iso-8859-1"));for(int j = 0; j < players.length(); j++){CBSplayerInfo aPlayer = new CBSplayerInfo();int cur1 = players.indexOf(":",j);aPlayer.id = players.substring(j,cur1);int cur2 = players.indexOf(" ",cur1);int cur2p = players.indexOf("|",cur1);if(cur2 == -1 || (cur2 > cur2p && cur2p != -1)){cur2 = players.indexOf("?",cur1);//iso-8859-1下的空格}aPlayer.firstName = players.substring(cur1 + 1,cur2);int cur3 = players.indexOf(",",cur2);aPlayer.lastName = players.substring(cur2 + 1,cur3);playerInfoSet.add(aPlayer);//添加球员ID信息System.out.println(str + ":" + aPlayer.display());j = players.indexOf("|",cur3);if(j == -1) break;}str = br.readLine();}frSpecialName.close();br.close();//保存球员ID数据Iterator<CBSplayerInfo> it = playerInfoSet.iterator();while(it.hasNext()){CBSplayerInfo nextPlayer = it.next();String playerInfo = nextPlayer.id + ";" + nextPlayer.firstName + ";" + nextPlayer.lastName;frID.append(playerInfo + "\r\n");}frID.close();}}

5,CBS默认shotchart数据里的第四节以及加时赛都是用3表示的period,编程修正。

package CBS;/* * 默认情况下,CBS的period数据中的第4节和加时赛都是3,本程序依次改为4,5,6…… * 20101026HOULAL,0,5.0,3,1622542,1,0,25,40,25 * 20101026HOULAL,0,11:41,3,1622542,5,1,0,42,0 * period >= 3,同一gameID,当前一条shot时间为秒“.”,下一条包含分“:”时,period++ */import java.io.BufferedReader;import java.io.File;import java.io.FileReader;import java.io.FileWriter;import java.sql.Date;import java.sql.Time;import java.text.ParseException;import java.text.SimpleDateFormat;public class CBSTime {public static void main(String args[]) throws Exception{String directoryPath = "E:\\2006-07shotdata\\";File directory = new File(directoryPath);String[] shotdata = directory.list();for(int i = 0; i < shotdata.length; i++){BufferedReader br = new BufferedReader(new FileReader(directoryPath + shotdata[i]));String aLine = br.readLine();FileWriter fr = new FileWriter(directoryPath + "CBS" + shotdata[i]);String[] lastShot = new String[]{"","","","","","","","","",""};while(aLine != null){String[] newShot = aLine.split(",");if(lastShot[0].equals(newShot[0]) && lastShot[3].compareTo("3") >= 0 && lastShot[2].contains(".") && newShot[2].contains(":")){Integer tmp = Integer.parseInt(lastShot[3])+1;newShot[3] = tmp.toString();}if(lastShot[0].equals(newShot[0]) && newShot[3].compareTo(lastShot[3]) < 0)newShot[3] = lastShot[3];lastShot = newShot;String aShot = lastShot[0]+","+lastShot[1]+","+lastShot[2]+","+lastShot[3]+","+lastShot[4]+","+lastShot[5]+","+lastShot[6]+","+lastShot[7]+","+lastShot[8]+","+lastShot[9];fr.append(aShot+"\r\n");System.out.println(aShot);aLine = br.readLine();}br.close();fr.close();}}}

6,shotdata文本导入数据库就可以做一些简单的查询了~

原创粉丝点击