大数据课程笔记4:摘要结构,streaming algorithm
来源:互联网 发布:全国网络110报案中心 编辑:程序博客网 时间:2024/05/20 17:11
这是大数据课程第四节的笔记,笔者自己的理解使用斜体注明,正确性有待验证。
This is the note of lecture 4 in Big Data Algorithm class. The use of italics indicates the author’s own understanding, whose correctness needs to be verified.
1. Synopsis Structure
Most contents of the synopsis structure are collected from the lecture note in http://www.cohenwang.com/edith/bigdataclass2013 .
1.1. Definition
A small summary of a large data set that (approximately) captures some statistics/properties we are interested in.
Example: random samples, sketches/projections, histograms,…
1.2. Functionality
Some operations, such as insertion, deletion, query and merging databases, may be required.
1.3. Motivation/ Why?
Data can be too large to
- Keep for long or even short term
- Transmit across the network
- Process queries over in reasonable time/computation
1.4. Limitation
The size of working cpu/memory can be limited compared with the size of data.
Only one or two passes of data in cpu are affordable.
1.5. Applications:
network traffic management
I/O Efficiency
Real Time Data
2. Frequent Elements: Misra Gries Algorithm
2.1. Purpose
In brief, the system reads the data in one pass and outputs the frequencies of the top-k most frequent elements.
2.2. Motivation/Application
zipf law: Typical frequency distributions are highly skewed: with few very frequent elements. Say top 10% of elements have 90% of total occurrences. We are interested in finding the heaviest elements.
According to the zipf law, the most frequent elements are significant to represent the data.
Some applications:
- Networking: Find “elephant” flows
- Search: Find the most frequent queries
2.3. A simple algorithm
Simply create a counter for each distinct element and count it in its following occurrence.
However, this algorithm requires about
2.4. Misra Gries Algorithm
2.4.1. Insert, Query
1. Place a counter on the first k distinct elements.2. On the (k + 1)-st elements, reduce each counter by 1 and remove counters of value zero.3. Report counter value on any query.
2.4.2. One simple example for Insertion and Query
The input stream is “abccbcbae” (simply chosen randomly), and the maximum counter number k is 2.
2.4.3. Analyze the output
As illustrated in the previous example, the output is obviously an underestimate. However the maximum error is limited. When zipf law / power law property of data holds, the error is acceptable.
The error is proportional to the inverse of
2.4.4. Merge
The merge operation is similar to insertion, as following:
1. Merge the common element counter, keep distinct counters.2. Remove small counters to keep k largest (by reducing counter then remove counters of value zero.
The estimate error is also at most
3. Stream Counting
3.1. Purpose
The system reads the data in one pass and output the total number of elements. It’s simply a counter.
3.2. A simple algorithm
One simple way is to keep one number as the counter, which increases by one for each element read.
The required space
3.3. Morris’s idea
Instead of counting store
It also uses the powerful tool: randomization, and guarantee that the expectation of the output is the same as the correct output, while saving space.
The algorithm is as follows:
- Keep a counter
x of value “logn ”.- Increase the counter with probability
p=2−x .- On a query, return
2x−1 .
3.4. One example
3.5. Analyze: the expected return value
The claim: Expected value after reading
Proof: Using the inductive principle, the proof is simple.
- Base case:
n=0 ,x=0 ,n^=2x−1=0=n . Base case approved. - Assume the claim is true for all
n≤k , which meansEX[n^]=n . - When
n=k+1 , we have
According to the principle of induction, the claim is true for all
4. Count Distinct Items
4.1. Some definitions
Input sequence
Define
Then
F0 is the number of distince elements in list, since∅0=0,S0=1 iff. S≠∅ 。F1 is the length of sequnce,m .F2 is called Ginis index of homogeneity.F∗∞=max1≤i≤nmi
4.2. Purpose
The system reads the data in one pass and outputs
4.3. General Theorem
4.4. Improved Performance
The memory size is
The property of output:
4.5. Algorithm
The basic idea is firstly using the finite field to hash elements to reduce size and make the item randomized, and then using the property of
4.5.1. Basic Knowledge: Finite Field and Probability
Here, we only introduce some basic idea and calculation of finite field and probability. More details are available on the Internet.
4.5.1.1. Finite Field
Two websites, which explain the idea of finite field in detail, are found: http://mathworld.wolfram.com/FiniteField.html, http://blog.csdn.net/luotuo44/article/details/41645597
4.5.1.2. Probability Inequalities
Markov Inequality
P[X≥d]≤EX[X]d forX≥0 .since
EX[X]=∫xf(x)dx≥∫x=dxf(x)dx≥d∫x=df(x)dx , wheref(x) is PDF (probability density function) ofX .Chebyshev Inequality
P[|X−μ|≥kδ]≤1k2 ,since
P[|X−μ|≥kδ]=P[(X−μ)2≥k2δ2]≤EX[(x−μ)2]k2δ2=δ2/(k2δ2)=1/k2 , whereμ=EX[X] ,δ=Var[X] ,k is a fixed constant.
4.5.2. The algorithm
- Construct the finite field
F=GF(2d) , wheren<2d . - hash the input element
a,b randomly chosen fromF .∀ai∈A (in the order of the input sequence), hashai tozi=a×ai+b(mod F) represented by ad -vector inF .
- Estimate
F0 using properties ofF0 random values- Define
ri=r(zi)=max{i:2i|zi} , such thatri is right most1 ’s position. For exampler(1010000)=4 andr(1010010)=1 .ri should be less thand , which is roughlylogn . - Define
R to be the largestri over all elements of A, which meansR=max(r1,r2,⋯,rm)
- Define
- Output
Y=2R .
4.5.3. Some analysis of the algorithm
The use of the finite field is to hash input elements
Now let’s focus on the property of the output
As discussed above,
If
If
- 大数据课程笔记4:摘要结构,streaming algorithm
- 大数据算法课程笔记8a:page replacement algorithm
- 大数据课程学习笔记(1)
- 大数据课程笔记3:Kolmogorov Complexity
- 大数据Spark企业级实战版【学习笔记】----Spark Streaming
- 大数据Spark企业级实战版【学习笔记】----Spark Streaming
- 大数据技术--Spark Streaming
- 大数据课程培训
- 大数据 课程
- 大数据算法课程笔记2:2D Convex Hull
- 大数据IMF传奇行动绝密课程第100-101课:使用Spark Streaming+Spark SQL+Kafka+FileSystem综合案例
- 大数据IMF传奇行动绝密课程第104-114课:Spark Streaming电商广告点击综合案例
- 大数据面试部分摘要
- 数据挖掘十大经典算法(4) The Apriori algorithm
- 数据挖掘十大经典算法(4) The Apriori algorithm
- 数据挖掘十大经典算法(4) The Apriori algorithm
- 数据挖掘十大经典算法(4) The Apriori algorithm .
- 数据挖掘十大经典算法(4) The Apriori algorithm
- Servlet---JavaWeb技术的核心基础,JavaWeb框架的基石(一)
- 飞机大战-java
- 利用python进入数据分析之pandas的使用
- 第二章课后习题
- 笔记5
- 大数据课程笔记4:摘要结构,streaming algorithm
- UVA-1609 Foul Play
- C语言
- PCA主成分分析学习笔记 + Matlab实现
- 二傻的线段树~
- Detect to Track and Track to Detect
- python 文件操作,最全的一个(转)
- 并查集的按秩合并
- 软件工程(C编码实践篇) 实验四:用可重用的链表模块来实现命令行菜单小程序V2.5