Mapreduce & Hadoop Algorithms in Academic Papers (3rd update)

来源:互联网 发布:response.json 编辑:程序博客网 时间:2024/05/29 10:18

Atbrox is startup company providing technology and services for Search and Mapreduce/Hadoop. Our background is from Google, IBM and research. Contact us if you need help with algorithms for mapreduce

This posting is the May 2010 update to the similar posting from February 2010, with 30 new papers compared to the prior posting, new ones are marked with *.

Motivation
Learn from academic literature about how the mapreduce parallel model and hadoop implementation is used to solve algorithmic problems.

Which areas do the papers cover?

    Ads Analysis
    *Improving ad relevance in sponsored search
    *Predicting the Click-Through Rate for Rare/New Ads
    *Learning Influence Probabilities in Social Networks
    *Mining advertiser-specific user behavior using adfactors
    *Extracting user profiles from large scale data
    Large-Scale Behavioral Targeting (2009)
    Search Advertising using Web Relevance Feedback (2008)
    Predicting Ads’ ClickThrough Rate with Decision Rules (2008)

    Bioinformatics/Medical Informatics
    *A novel approach to multiple sequence alignment using hadoop data grids
    MapReduce-Based Pattern Finding Algorithm Applied in Motif Detection for Prescription Compatibility Network (2009)
    MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees

    Machine Translation
    *Training Phrase-Based Machine Translation Models on the Cloud Open Source Machine Translation Toolkit Chaski
    Grammar based statistical MT on Hadoop (2009)
    Large Language Models in Machine Translation (2008)

    Spatial Data Processing
    Experiences on Processing Spatial Data with MapReduce

    Information Extraction and Text Processing
    *Statistical Sentence Chunking Using Map Reduce
    Data-intensive text processing with MapReduce
    Web-Scale Distributional Similarity and Entity Set Expansion (2009)
    The infinite HMM for unsupervised PoS tagging (2009)

    Artificial Intelligence/Machine Learning/Data Mining
    *LogMaster: Mining Event Correlations in Logs of Large Scale Cluster Systems
    *Stateful Bulk Processing for Incremental Analytics
    *Mining dependency in distributed systems through unstructured logs analysis
    *Beyond online aggregation: parallel and incremental data mining with online mapreduce
    *Learning based opportunistic admission control algorithm for mapreduce as a service
    *OWL reasoning with WebPIE: calculating the closure of 100 billion triples
    *Scaling ECGA model building via data-intensive computing
    *SPARQL basic graph pattern processing with iterative mapreduce
    Residual Splash for Optimally Parallelizing Belief Propagation
    Stochastic gradient boosted distributed decision trees
    Distributed Algorithms for Topic Models
    When Huge is Routine: Scaling Genetic Algorithms and Estimation of Distribution Algorithms via Data-Intensive Computing
    Cloud Computing Boosts Business Intelligence of Telecommunication Industry
    Parallel K-Means Clustering Based on MapReduce
    Large-scale multimedia semantic concept modeling using robust subspace bagging and MapReduce
    Parallel algorithms for mining large-scale rich-media data
    Scaling Simple and Compact Genetic Algorithms using MapReduce
    Scalable Distributed Reasoning using Mapreduce
    Scaling Up Classifiers to Cloud Computers (2008)

      For an example of Parallel Machine Learning with Hadoop/Mapreduce, check out ourprevious blog post.

    Search Query Analysis
    *Parallelizing Random Walk with Restart for large-scale query recommendation
    BBM: Bayesian Browsing Model from Petabyte-scale Data (2009)
    AIDE: Ad-hoc Intents Detection Engine over Query Logs (2009)

    Information Retrieval (Search)
    *Automatically Incorporating New Sources in Keyword Search-Based Data Integration
    *Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
    *Learning URL patterns for webpage de-duplication
    *Information Seeking with Social Signals: Anatomy of a Social Tag-based EXploratory Search Browser
    *MIREX: Mapreduce Information Retrieval Experiments
    Efficient Clustering of Web Derived Data Sets
    The PageRank algorithm and application on searching of academic papers
    A Parallel Algorithm for Finding Related Pages in the Web by Using Segmented Link Structures
    On Single-Pass Indexing with MapReduce (2009)
    A Data Parallel Algorithm for XML DOM Parsing (2009)
    Semantic Sitemaps: Efficient and Flexible Access to Datasets on the Semantic Web(2008)

    Spam & Malware Detection
    Characterizing Botnets from Email Spam Records (2008)
    - Clustering of emails into spam campaign
    - Finding probability that 2 spam messages are sent form same machine
    - Estime likelihood of botnets based on common senders in spam campaigns
    The Ghost In The Browser Analysis of Web-based Malware (2007)

    Image and Video Processing
    *Font rendering on a GPU-based raster image processor
    MapReduce Optimization Using Regulated Dynamic Prioritization (2009)
    - Video Stream Re-Rendering
    Map-Reduce Meets Wider Varieties of Applications (2008)
    - Location detection in images

    Networking
    Reducible Complexity in DNS

    Simulation
    Map-Reduce Meets Wider Varieties of Applications (2008)
    - Simulation of earthquakes (geology)

    Statistics
    *User-based collaborative filtering recommendation algorithms on hadoop
    Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce (2009)
    Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce (2009)
    MapReduce Optimization Using Regulated Dynamic Prioritization (2009)
    - Digg.com story recommendations
    Calculating the Jaccard Similarity Coefficient with Map Reduce for Entity Pairs in Wikipedia (2008)
    - Measuring Wikipedia Editor similarity
    Map-Reduce Meets Wider Varieties of Applications (2008)
    - Netflix video recommendation
    Large-scale Parallel Collaborative Filtering for the Netflix Prize (2008)

    Numerical Mathematics
    *Distributed non-negative matrix factorization for dyadic data analysis on mapreduce
    *A mapreduce algorithm for SC
    *Multi-GPU Volume Rendering using MapReduce
    Mapreduce for Integer Factorization

    Sets & Graphs
    *Towards scalable RDF graph analytics on MapReduce 
    *Efficient Parallel Set-Similarity Joins using Mapreduce
    *Max-cover algorithm in map-reduce
    Distributed Algorithm for Computing Formal Concepts Using Map-Reduce Framework
    Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
    Graph Twiddling in a MapReduce World
    DOULION: Counting Triangles in Massive Graphs with a Coin (2009)
    Fast counting of triangles in real-world networks: proofs, algorithms and observations(2008)

Who wrote the above papers?
Companies: China Mobile, eBay, Google, Hewlett Packard and Intel, Microsoft, Wikipedia, Yahoo and Yandex.
Government Institutions and Universities: US National Security Agency (NSA)
, Carnegie Mellon University, TU Dresden, University of Pennsylvania, University of Central Florida, National University of Ireland, University of Missouri, University of Arizona, University of Glasgow, Berkeley University and National Tsing Hua University, University of California, Poznan University, Florida International University, Zhejiang University, Texas A&M University, University of California at Irvine, University of Illinois, Chinese Academy of Sciences, Vrije Universiteit, Engenharia University, State University of New York, Palacky University, University of Texas at Dallas

原创粉丝点击