Apache Hadoop Pig 源代码分析（2）

来源：互联网发布：游戏无线鼠标推荐知乎编辑：程序博客网时间：2024/04/30 02:16

Pig的核心代码剥离出来后，我们可以慢慢深入到代码内部去了。

网上大多数源代码分析的文章，都是从几个核心类开始分析，画类图、流程图等等。现在让我们换个方式，像剥洋葱那样，从外围开始入手，

一步步深入到最核心的代码，这样可以有个坡度，降低分析难度。

我们首先观察一下Pig的源代码文件名，可以发现，有许多文件，从名字上就能看出它是干什么的，比如IsDouble.java，显然是判断

是否Double值的；XPath.java,显然是处理XML中XPath相关工作的。以下是IsDouble类和XPath类的代码：

/** * This UDF is used to check whether the String input is a Double. * Note this function checks for Double range. * If range is not important, use IsNumeric instead if you would like to check if a String is numeric.  * Also IsNumeric performs slightly better compared to this function. */public class IsDouble extends EvalFunc<Boolean> {    @Override    public Boolean exec(Tuple input) throws IOException {        if (input == null || input.size() == 0) return false;        try {            String str = (String)input.get(0);            if (str == null || str.length() == 0) return false;            Double.parseDouble(str);        } catch (NumberFormatException nfe) {            return false;        } catch (ClassCastException e) {            warn("Unable to cast input "+input.get(0)+" of class "+                    input.get(0).getClass()+" to String", PigWarning.UDF_WARNING_1);            return false;        }        return true;    }        @Override    public Schema outputSchema(Schema input) {        return new Schema(new Schema.FieldSchema(null, DataType.BOOLEAN));     }}

/** * XPath is a function that allows for text extraction from xml */public class XPath extends EvalFunc<String> {    /** Hold onto last xpath & xml in case the next call to xpath() is feeding the same xml document     * The reason for this is because creating an xpath object is costly. */    private javax.xml.xpath.XPath xpath = null;    private String xml = null;    private Document document;        private static boolean cache = true;        /**     * input should contain: 1) xml 2) xpath 3) optional cache xml doc flag     *      * Usage:     * 1) XPath(xml, xpath)     * 2) XPath(xml, xpath, false)      *      * @param 1st element should to be the xml     *        2nd element should be the xpath     *        3rd optional boolean cache flag (default true)     *             * This UDF will cache the last xml document. This is helpful when multiple consecutive xpath calls are made for the same xml document.     * Caching can be turned off to ensure that the UDF's recreates the internal javax.xml.xpath.XPath for every call     *      * @return chararrary result or null if no match     */    @Override    public String exec(final Tuple input) throws IOException {        if (input == null || input.size() <= 1) {            warn("Error processing input, not enough parameters or null input" + input,                    PigWarning.UDF_WARNING_1);            return null;        }        if (input.size() > 3) {            warn("Error processing input, too many parameters" + input,                    PigWarning.UDF_WARNING_1);            return null;        }        try {            final String xml = (String) input.get(0);                        if(input.size() > 2)                cache = (Boolean) input.get(2);                        if(!cache || xpath == null || !xml.equals(this.xml))            {                final InputSource source = new InputSource(new StringReader(xml));                                this.xml = xml; //track the xml for subsequent calls to this udf                final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();                final DocumentBuilder db = dbf.newDocumentBuilder();                                this.document = db.parse(source);                final XPathFactory xpathFactory = XPathFactory.newInstance();                this.xpath = xpathFactory.newXPath();                            }                        final String xpathString = (String) input.get(1);            final String value = xpath.evaluate(xpathString, document);            return value;        } catch (Exception e) {            warn("Error processing input " + input.getType(0),                     PigWarning.UDF_WARNING_1);                        return null;        }    }@Overridepublic List<FuncSpec> getArgToFuncMapping() throws FrontendException {final List<FuncSpec> funcList = new ArrayList<FuncSpec>();/*either two chararray arguments*/List<FieldSchema> fields = new ArrayList<FieldSchema>();fields.add(new Schema.FieldSchema(null, DataType.CHARARRAY));fields.add(new Schema.FieldSchema(null, DataType.CHARARRAY));Schema twoArgInSchema = new Schema(fields);funcList.add(new FuncSpec(this.getClass().getName(), twoArgInSchema));/*or two chararray and a boolean argument*/fields = new ArrayList<FieldSchema>();fields.add(new Schema.FieldSchema(null, DataType.CHARARRAY));fields.add(new Schema.FieldSchema(null, DataType.CHARARRAY));fields.add(new Schema.FieldSchema(null, DataType.BOOLEAN));Schema threeArgInSchema = new Schema(fields);funcList.add(new FuncSpec(this.getClass().getName(), threeArgInSchema));return funcList;}}

可以看出，它们都扩展了一个叫做EvalFunc的泛型类，使用过Pig人都知道，Pig可以进行UDF（用户自定义函数）的开发，以便实现自己的计算函数，而那些计算函数就需要继承这个泛型类。Pig中含有很多这种函数，说白了就是Pig已经写好的UDF。

这些UDF类的结构都比较简单，主要区别在于：

1. 返回参数类型不同

从上面可以看出，IsDouble的exec方法返回Boolean类型，用来判定输入是否是Double值；XPath的exec返回String类型，用来从XML中得到一个String值。

2. 具体的算法，即exec()方法的实现不同。

不同的功能的计算函数，实现方法当然不同。

这些类属于Pig源代码中处于辅助地位的类，简单看看它们的结构和算法实现即可，然后删除。

有同学问这么多代码文件，我是怎么删除掉Pig的UDF类的。由于我是在Windows下分析代码，所以使用了Visual Studio的Find In Files功能来查找含有“extends EvalFunc”的java文件，如果你在Linux下操作，Shell命令有类似功能，Eclipse也有。

以下是在Visual Studio中搜索的结果：

Matching lines: 319 Matching files: 248 Total files searched: 1157

可以看到，Pig含有248个这样的文件，有些扩展类本身仍然是泛型类，比如AccumulatorEvalFunc<T>

对于这些类，我把它们根据用途，分为以下几个组：

1. 类型判断组

特征是IsXXX命名，用于判断一个输入值是否是某种类型，比如上面提到的IsDouble，这一组的类都很简单。

2.格式转化组

特征是XXXToYYY命名，把输入类型XXX转化为输出类型YYY，比如ISOToUnix，代码如下：

/** * <p>ISOToUnix converts ISO8601 datetime strings to Unix Time Longs</p> * <ul> * <li>Jodatime: http://joda-time.sourceforge.net/</li> * <li>ISO8601 Date Format: http://en.wikipedia.org/wiki/ISO_8601</li> * <li>Unix Time: http://en.wikipedia.org/wiki/Unix_time</li> * </ul> * <br /> * <pre> * Example usage: * * REGISTER /Users/me/commiter/piggybank/java/piggybank.jar ; * REGISTER /Users/me/commiter/piggybank/java/lib/joda-time-1.6.jar ; * * DEFINE ISOToUnix org.apache.pig.piggybank.evaluation.datetime.convert.ISOToUnix(); * * ISOin = LOAD 'test.tsv' USING PigStorage('\t') AS (dt:chararray, dt2:chararray); * * DESCRIBE ISOin; * ISOin: {dt: chararray,dt2: chararray} * * DUMP ISOin; * * (2009-01-07T01:07:01.000Z,2008-02-01T00:00:00.000Z) * (2008-02-06T02:06:02.000Z,2008-02-01T00:00:00.000Z) * (2007-03-05T03:05:03.000Z,2008-02-01T00:00:00.000Z) * ... * * toUnix = FOREACH ISOin GENERATE ISOToUnix(dt) AS unixTime:long; * * DESCRIBE toUnix; * toUnix: {unixTime: long} * * DUMP toUnix; * * (1231290421000L) * (1202263562000L) * (1173063903000L) * ... *</pre> */public class ISOToUnix extends EvalFunc<Long> {    @Override    public Long exec(Tuple input) throws IOException    {        if (input == null || input.size() < 1) {            return null;        }                // Set the time to default or the output is in UTC        DateTimeZone.setDefault(DateTimeZone.UTC);        DateTime result = new DateTime(input.get(0).toString());        return result.getMillis();    }@Overridepublic Schema outputSchema(Schema input) {        return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), input), DataType.LONG));}    @Override    public List<FuncSpec> getArgToFuncMapping() throws FrontendException {        List<FuncSpec> funcList = new ArrayList<FuncSpec>();        funcList.add(new FuncSpec(this.getClass().getName(), new Schema(new Schema.FieldSchema(null, DataType.CHARARRAY))));        return funcList;    }}

<strong></strong><pre name="code" class="html">

它把日期字符串，转化为Unix长整型风格的日期表示。

3. 数学运算组

特征是以数学运算名字来命名，比如SIN正弦函数，ASIN反正弦函数，MAX最大值函数，IntAbs整型绝对值函数，RANDOM随机数函数等等，实现都很简单。

这一组中还有关于大数运算的，输出是BigDecimal类型，比如BigDecimalAbs求大数绝对值，BigDecimalAvg求大数平均值等。

另外，日期运算是这一组的特殊情况，比如ISODaysBetween，计算两个日期之间的天数差。

4. 字符串处理组

特征是输入是String，命名是一个字符串的操作，比如UPPER转化为大写字符串，Reverse反转字符串，Trim剔除首尾空格等。

注意的是，还有HashFNV，HashFNV1等类，是用来根据一个字符串来求Hash值的，RegexMatch是根据一个正则表达式返回匹配字符串的。

5. 断言组

Assert类，用于判断一个表达式是否为True。请看代码：

public class Assert extends EvalFunc<Boolean>{  @Override  public Boolean exec(Tuple tuple)      throws IOException  {    if (!(Boolean) tuple.get(0)) {      if (tuple.size() > 1) {        throw new IOException("Assertion violated: " + tuple.get(1).toString());      }      else {        throw new IOException("Assertion violated. ");      }    }    else {      return true;    }  }}

6.脚本执行组

这一组可以执行一个其他脚本语言写的方法，比如Jruby类，执行Ruby脚本，JsFunction类，执行Javascript脚本

经过以上分析，计算函数可以删除掉了，现在剩下大概820个java文件。

0 0