Akka 在Bagging投票算法中的简单应用

来源：互联网发布：淘宝首页装修素材编辑：程序博客网时间：2024/05/17 06:05

Akka作为优秀的并发框架适用于处理并行计算等问题。

在Akka 初探（一）中对其java接口进行了简单的介绍，

在AkkaCrawler 翻译（一）中对其在并发爬虫方面的应用进行了简单的介绍。

使用该框架可以简单地完成异步及并行这些较复杂的代码逻辑。

本文简要提一下其在Bagging投票算法中的应用，目的是为了充分使用资源，提升算法的实现效率。

Bagging投票算法很简单，基本上就是BootStrap样本产生，及模型估计，

是典型的生产者消费者模型。

基本的样本产生及模型估计都使用python完成。（想用java 奈何没有方便的包）

一般的数据共享，可使用redis List来完成。

一般地 jython并不能调用由C++写的大量sklearn库故使用java akka api的方法要局限在

对python进程调用下。

第一步考虑要使用到的python文件进行任务的简要介绍

Generate.py

产生总体数据的脚本

Produce.py

对Generate.py 的总体数据进行BootStrap抽样并将样本 push 到redis的 List中。

Consume.py

从redis List 中pop样本并进行Bagging投票并以redis hash进行储存。

Count.py

对投票hash 结果进行分析得到Bagging算法结果。

上面涉及的redis与 numpy ndarray（一般python对象）的转换可以通过pickle完成。

有了上面的分工python脚本需要的仅仅是利用java akka 进行任务调度。

下面给出代码。

首先是若干message类 message担当进行信息传递的作用但是对于Bagging投票算法

基本并不需要在不同的Actor间传递太多信息（系统太简单）

仅仅需要确定数据是否产生完及是否消耗完即可

故下面的message类至多记录使用的BootStrap样本编号

package metaClasses;/** * Created by ehangzhou on 2017/3/18. */public class StartMessage {}package metaClasses;/** * Created by ehangzhou on 2017/3/18. */public class ProduceMessage {    public int id;    public ProduceMessage(int id){this.id = id;}}package metaClasses;/** * Created by ehangzhou on 2017/3/18. */public class ConsumeMessage {    public int id;    public ConsumeMessage(int id){this.id = id;}}package metaClasses;/** * Created by ehangzhou on 2017/3/18. */public class EndMessage {    public int id;    public EndMessage(int id){this.id = id;}}package metaClasses;/** * Created by ehangzhou on 2017/3/18. */public class FinishMessage {}

Actor仅仅用来启用上述python进程

package actors;/** * Created by ehangzhou on 2017/3/18. */import akka.actor.UntypedActor;import metaClasses.StartMessage;public class BeginActor extends UntypedActor {    @Override    public void onReceive(Object o)throws Exception{        if(o instanceof StartMessage){            Process p = Runtime.getRuntime().exec("python Generate.py");            p.waitFor();            p.destroy();            getSender().tell(o, getSelf());        }    }}package actors;/** * Created by ehangzhou on 2017/3/18. */import akka.actor.UntypedActor;import metaClasses.ConsumeMessage;import metaClasses.ProduceMessage;public class GenerateSampleActor extends UntypedActor {    @Override    public void onReceive(Object o)throws Exception{        if(o instanceof ProduceMessage){            ProduceMessage produceMessage = (ProduceMessage) o;            Process p = Runtime.getRuntime().exec("python Produce.py");            p.waitFor();            p.destroy();            getSender().tell(new ConsumeMessage(produceMessage.id), getSelf());        }    }}package actors;/** * Created by ehangzhou on 2017/3/18. */import akka.actor.UntypedActor;import metaClasses.ConsumeMessage;import metaClasses.EndMessage;public class EstimateActor extends UntypedActor {    @Override    public void onReceive(Object o)throws Exception{        if(o instanceof ConsumeMessage){            ConsumeMessage consumeMessage = (ConsumeMessage) o;            Process p = Runtime.getRuntime().exec("python Consume.py");            p.waitFor();            p.destroy();            getSender().tell(new EndMessage(consumeMessage.id), getSelf());        }    }}package actors;/** * Created by ehangzhou on 2017/3/18. */import akka.actor.UntypedActor;import metaClasses.FinishMessage;public class EndActor extends UntypedActor {    @Override    public void onReceive(Object o)throws Exception{        if(o instanceof FinishMessage){            Process p = Runtime.getRuntime().exec("python Count.py");            p.waitFor();            p.destroy();            getSender().tell(o, getSelf());        }    }}

使用akka 路由(router) 可以初始化多个槽点用于并发

package routers;/** * Created by ehangzhou on 2017/3/18. */import akka.actor.ActorSystem;import akka.actor.Props;import akka.routing.ActorRefRoutee;import akka.routing.CustomRouterConfig;import akka.routing.RoundRobinRoutingLogic;import akka.routing.Routee;import akka.routing.Router;import java.util.ArrayList;import java.util.List;public class ConstructRouter extends CustomRouterConfig {    private int noOfInstances;    private Class initClass;    public ConstructRouter(int noOfInstances, Class initClass){        this.noOfInstances = noOfInstances;        this.initClass = initClass;    }    @Override    public Router createRouter(ActorSystem system){        List<Routee> routees = new ArrayList<Routee>(noOfInstances);        for(int i = 0;i < noOfInstances;i++){            routees.add(new ActorRefRoutee(system.actorOf(Props.create(initClass))));        }        return new Router(new RoundRobinRoutingLogic(), routees);    }}

最后通过一个总的Manager进行调度设定Produce.py及Consume.py的 rootees数量都为5

package manager;/** * Created by ehangzhou on 2017/3/18. */import akka.actor.ActorRef;import akka.actor.ActorSystem;import akka.actor.Props;import akka.actor.UntypedActor;import metaClasses.StartMessage;import metaClasses.ProduceMessage;import metaClasses.ConsumeMessage;import metaClasses.EndMessage;import metaClasses.FinishMessage;import actors.BeginActor;import actors.EndActor;import actors.EstimateActor;import actors.GenerateSampleActor;import routers.ConstructRouter;import java.util.Set;import java.util.HashSet;import java.util.Date;import redis.clients.jedis.Jedis;public class Manager extends UntypedActor {    private final ActorRef estimator;    private final ActorRef generator;    private final ActorRef beginnor;    private final ActorRef endor;    private final Jedis redis = new Jedis("127.0.0.1");    private final String factor_dict_final = "FACTOR_DICT";    private final int sample_num = 100;    private Set<Integer> idsSet = new HashSet<>();    private Set<Integer> finishSet = new HashSet<>();    private long start_time;    public Manager(int estimatorNum, int generatorNum){        generator = getContext().actorOf(Props.create(GenerateSampleActor.class).withRouter(new ConstructRouter(estimatorNum, GenerateSampleActor.class)));        estimator = getContext().actorOf(Props.create(EstimateActor.class).withRouter(new ConstructRouter(generatorNum, EstimateActor.class)));        beginnor = getContext().actorOf(Props.create(BeginActor.class));        endor = getContext().actorOf(Props.create(EndActor.class));        for(int i = 0;i < sample_num; i++){            idsSet.add(i);        }    }    public ActorRef getEstimator(){return estimator;}    public ActorRef getGenerator(){return generator;}    public ActorRef getBeginnor(){return beginnor;}    public ActorRef getEndor(){return endor;}    @Override    public void onReceive(Object o){        System.out.println(o);        System.out.println("idsSet size :");        System.out.println(idsSet.size());        System.out.println("finishSet size :");        System.out.println(finishSet.size());        if(o instanceof String){            getBeginnor().tell(new StartMessage(), getSelf());        }else if(o instanceof StartMessage){            start_time = new Date().getTime();            int id = idsSet.iterator().next();            getGenerator().tell(new ProduceMessage(id), getSelf());            idsSet.remove(id);        }else if(o instanceof ConsumeMessage){            getEstimator().tell(o, getSelf());            if(!idsSet.isEmpty()){                int id = idsSet.iterator().next();                getGenerator().tell(new ProduceMessage(id), getSelf());                idsSet.remove(id);            }        }else if(o instanceof EndMessage){            EndMessage endMessage = (EndMessage) o;            finishSet.add(endMessage.id);            if(finishSet.size() == sample_num){                getEndor().tell(new FinishMessage(), getSelf());            }        }else if(o instanceof FinishMessage){            System.out.println("all finish");            System.out.println(redis.hgetAll(factor_dict_final));            System.out.println(new Date().getTime() - start_time);        }    }    public static void main(String[] args){        ActorSystem actorSystem = ActorSystem.create();        ActorRef master = actorSystem.actorOf(Props.create(Manager.class, 5, 5));        master.tell("start", actorSystem.guardian());        actorSystem.awaitTermination();    }}

对于机器学习因子估计相关模型

运行上述代码可以看到实现效率的提升

当数据量足够大

直接使用python脚本进行序列化Bagging模型估计时(阻塞) 耗时

311.243s

对上述脚本模型估计部分使用python parallel对模型估计部分进行改写（阻塞）耗时

359.842s

使用下面的akka 进行非阻塞并行改进耗时

206.756s

减少了三分之一的运行时间当然这仅仅是单机的测试速度

由于共用的是redis对列分布式的期望是可观的。

0 0