GraphLab Integration with Spark Open Source Release

来源:互联网 发布:昆明癫闲军海网络援助 编辑:程序博客网 时间:2024/05/20 03:43

Due to it’s ability to support a wide variety of data engineering tasks across a growing range data sources, Apache Spark has become an integral part of the Hadoop eco-system. In this post, we introduce the new spark-sframe package which unites the data ingestion and processing capabilities of Apache Spark with the sophisticated machine learning tools of GraphLab Create enabling simplified development of rich machine learning models on a wide variety of data sources.  

Often the most challenging part of machine learning is getting the right data in the right form. Apache Spark provides rich Java, Scala, SQL, and Python APIs for bulk data and leverages fault tolerant distributed processing to accelerate IO and CPU intensive operations. However, once the data has been cleaned and transformed, the process of training models is often most efficiently achieved using specialized ML tools that leverage the structure of ML algorithms.

Over the past several years we have been developing a column based data frame that is specifically optimized for ML algorithms called SFrame. A few weeks ago, we announced the open source release of SFrame and today we are excited to announce the open source release of the spark-sframe package. The spark-sframe package unifies the bulk data processing capabilities of Apache Spark with the optimized open-source machine learning SFrame data-structure by providing a simple and efficient API to move between SFrame and RDD respresentations of data.

spark-integration2

 

The following code snippet shows how to use the new APIs defined in the spark-sframe package to easily convert RDD to SFrame and back in just a few lines of python code. 

from pyspark import SparkContextfrom pyspark.sql import SQLContextsc = SparkContext()sql = SQLContext(sc)rdd = sc.parallelize([(x,str(x), "Hi") for x in range(0,5)]) df = sql.createDataFrame(rdd) sframe = graphlab.SFrame.from_rdd(df, sc)df_back = sframe.to_spark_dataframe(sc,sql)

The spark-sframe package exposes both python and scala bindings enabling user to pick their preferred language. The spark-sframe repository can be found on GitHub: https://github.com/dato-code/spark-sframe and is released under the BSD License, which means you have full freedom to use, extend, and share the code. 

Spark_Dato_Integration

In Using GraphLab Create with Apache Spark notebook, we demonstrate how GraphLab Create integrates with Spark in python bindings. Let's illustrate the power of the spark-sframe packaging by re-implementing the notebook using scala bindings where the data engineering task are coded in scala using the spark-shell.

Train a Topic Model on Wikipedia Dataset

Let's imagine a scenario where Joey, my colleague, and I started a project. The goal is to learn a topic model on the wikipedia dataset. The raw wikipedia corpus is an ostensibly large dataset in a format that will be difficult to use directly.  One of the great applications of Apache Spark is as a funnel to ingest large amounts of text and semi-structured data and emit often substantially smaller cleaned datasets that we can then use for more advanced analytics.  In this example we will use Spark to process the raw wikipedia data and emit a substantially smaller training dataset as an SFrame that we can then process easily on a single machine to train models and apply sophisticated analytics.  

Data Engineering using Scala Spark Shell

Joey is an expert in scala programming. As an Apache Spark Committer, he prefers writing code in scala. This is one way he can prepare wikipedia dataset for further analysis: 

Joey downloads or builds the spark-unity.jar from the spark-sframe packages and then starts the spark shell as follows:

joey> $SPARK_HOME/bin/spark-shell --jars <path-to-spark_unity.jar>

This brings a scala spark-shell with spark_unity.jar package loaded. Next he ingests the raw data into an RDD:

scala> import sys.process._ scala> import java.net.URL scala> import java.io.Filescala> var url = "http://s3.amazonaws.com/dato-datasets/wikipedia/raw/w16"scala> new URL(url) #> new File("wiki_file") !! scala> "here we save|load to|from local file. hdfs reads & writes are okay too."scala> var rawRdd = sc.textFile("<path to wiki_file>").zipWithIndex()

The above example is just a sample wikipedia file. Full dataset includes 37 files {w0,...,w36}. After loading data into an RDD, Joey runs a series of map, flatmap, groupby operations to generate a DataFrame encoding a bag-of-words:

scala> var splitRdd = rawRdd.map(t => (t._2, t._1.replaceAll("[ ]+", " ").trim.split(" ")))scala> var zipRdd = splitRdd.flatMap(t => List.fill(t._2.length)(t._1) zip t._2)scala> val add = (x :Int, y :Int)  =>  x + yscala> var wordRdd = zipRdd.map(comp_word =>(comp_word, 1)).reduceByKey(add)scala> var bagRdd = wordRdd.map(t => (t._1._1,(t._1._2,t._2))).groupByKey().map(t =>(t._1,t._2.toMap))scala> var df = bagRdd.toDFscala> df.show()

Finally, he uses the spark_unity.jar API to convert the DataFrame of a bag-of-words to its corresponding SFrame. Note that the saved bag-of-words SFrame is much smaller than the original wikipedia dataset.

scala> import org.graphlab.create.GraphLabUtilscala> val outputDir = "/tmp/graphlab_testing"scala> val prefix = "wiki"scala> val sframeFileName = GraphLabUtil.toSFrame(df, outputDir, prefix)scala> sframeFileNameres65: String = /tmp/graphlab_testing/wiki.frame_idx

 

Topic Model Training using GraphLab Create

I prefer programming in Python and GraphLab Create is a great tool to quickly train a model in a few lines of code. First, using the GraphLab Create, I load the saved SFrame from the previous phase, remove the stop words, and extract some basic statistics to gain insight into this dataset. 

import graphlab as gldata = gl.SFrame("/tmp/graphlab_testing/wiki.frame_idx")data = data.rename({'_1':'index','_2':'bag_of_words'})data['bag_of_words'] = data['bag_of_words'].dict_trim_by_keys(gl.text_analytics.stopwords(), exclude=True)

gl.canvas.set_target('ipynb')data.show()

Now I am ready to train a topic model on the wikipedia dataset. In the following single line of code, I train a model to learn thirty topics over fifty iterations.

model = gl.topic_model.create(data['bag_of_words'],num_topics=30,num_iterations=50)model.show()

After training the model, I can extract which topics are assigned to each document as well as the topic frequencies:

pred = model.predict(data['bag_of_words'])results = gl.SFrame({'doc_id':data['index'], 'topic_id':pred,'bag_of_words':data['bag_of_words']})results.print_rows(max_column_width=60)

results['topic_id'].show('Categorical')

topics = model.get_topics()result_sf = topics.groupby(['topic'], {'topic_words':gl.aggregate.CONCAT("word")})result_sf.print_rows(max_column_width=80)

Topic 19 is the most frequent topic. What is topic 19 about? Let's find out what are the words associated to each topic. 

As you see topic 19 is associated to words [music, album, band, song, released] which totally makes sense.

The purpose of this story was to show how the spark-sframe package works in a real scenario. Real-world applications often have multiple pipelines of data analysis spanning a range of different tools and frameworks.  It is not uncommon for data engineers and data scientistists to collaborate across platforms like Spark and Kafka and optimized machine learning libraries in Python like GraphLab Create. By releasing the spark-sframe as an open source package, we hope to build a bridge across these technologies enabling the respective platforms to seamlessly work together in bulding truely intelligent applications and services. 

0 0
原创粉丝点击