DataMining(2)_Mining Frequent Patterns
来源:互联网 发布:网络象棋 编辑:程序博客网 时间:2024/06/05 14:22
1.Basic Concepts
Frequent pattern: a pattern (a set of items, subsequences, substructures,etc.) that occurs frequently in a data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the contextof frequent itemsets and association rule mining
Freq. pattern: An intrinsic and important property of
datasets Itemset
A collection of one or more items
Support count
Frequency of occurrence of an itemset
Support
Fraction of transactions that contain
Frequent Itemset
An itemset whose support is greater than or equal to a minsup threshold
Association Rule
An implication expression of the form X -> Y, where X and Y are itemsets
Rule Evaluation Metrics
Support (s)
Confidence (c)
closed patterns:An itemset X is closed if X is frequent and there exists no super-pattern Y כ X, with the same support as X
max-patterns:An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X
2.Frequent Itemset Mining Methods
1.Brute-force approach(computationally prohibitive)
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf
thresholds
Complexity ~ O(NMw) => Expensive since M =2^d
2.Two-step approach
Frequent Itemset Generation:Generate all itemsets whose support >=minsup
Rule Generation:Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset
3.Frequent Itemset Generation Strategies
Reduce the number of candidates (M)
3.1Apriori principle:
If an itemset is frequent, then all of its subsets must also be frequent
Apriori principle holds due to the following property the**support measure**:
The Apriori Algorithm (Pseudo-Code)
Candidate-gen function
How to Count Supports of Candidates?
Method:
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of itemsets and counts
Interior node contains a hash table
Subset function: finds all the candidates contained in a transaction
Generating rules from frequent itemsets
For each frequent itemset X
For each proper nonempty subset A of X,
Let B = X - A
A -> B is an association rule if
Confidence(A -> B) ≥ minconf,
support(A -> B) = support(A->B) = support(X)
confidence(A -> B) = support(A -> B) / support(A)
Reduce the number of transactions (N)
Reduce the number of comparisons (NM)
Improving the Efficiency of Apriori
Further Improvement of the Apriori Method
Major computational challenges
1) Multiple scans of transaction database
Partition: Scan Database Only Twice
Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB
2) Huge number of candidates
DHP: Reduce the Number of Candidates
A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent
3) Tedious workload of support counting for candidates
Improving Apriori: general ideas
1) Reduce passes of transaction database scans
2) Shrink number of candidates
3) Facilitate support counting of candidates
Sampling for Frequent Patterns
Select a sample of original database, mine frequent patterns within sample using Apriori
3.2FPGrowth: A Frequent Pattern-Growth Approach
Depth-first search
Avoid explicit candidate generation
step 1:Construct FP-tree from a Transaction Database
step 2:Partition Patterns and Databases
step 3:Find Patterns Having P From P-conditional Database
step 4:From Conditional Pattern-bases to Conditional FP-trees
step 5:Recursion: Mining Each Conditional FP-tree
3.3ECLAT: Mining by Exploring Vertical Data Format
Determine support of any k-itemset by intersecting tidlists of two of its (k-1) subsets.
Vertical format: t(AB) = {T11, T25, …}
tid-list: list of trans.-ids containing an itemset
4.Mining Close Frequent Patterns and Maxpatterns
Itemset merging: if Y appears in every occurrence of X, then Y
is merged with X
Sub-itemset pruning: if Y כ X, and sup(X) = sup(Y), X and all of
X’s descendants in the set enumeration tree can be pruned
5.Sequential pattern mining
motivation:Association rule mining does not consider the
order of transactions.In many applications such orderings are
significant.
6.Mining class association rules (CAR)
motivation:Normal association rule mining does not have
any target.in some applications, the user is
interested in some targets.
Let T be a transaction data set consisting of n
transactions.
Each transaction is also labeled with a class y.
Let I be the set of all items in T, Y be the set of all
class labels and I intersection Y = null.
7.Which Patterns Are Interesting?—PatternEvaluation Methods
Measure of dependent/correlated events: lift
- DataMining(2)_Mining Frequent Patterns
- DataMining(2)_Data Preprocessing
- DataMining
- DataMining学习2_数据挖掘十大经典算法
- DataMining-Experiment1
- DataMining 重要会议
- DataMining(1)
- DataMining ARCHIVE: 数据挖掘
- Ajax Patterns 读书笔记 --2
- 初识Q-Patterns - 2
- patterns
- patterns
- patterns
- 为什么需要OLEDB For DataMining?
- DataMining(1)_Know Your Data
- DataMining(3)_Classification and Prediction
- Refactoring to Patterns Review 2
- HMM学习(2)-Generating Patterns
- 几个简单的排序算法
- CF初体验---Round #348,B
- [BZOJ1060][ZJOI2007]时态同步(树形dp)
- cocos2d-x屏幕适配
- Android项目的目录结构
- DataMining(2)_Mining Frequent Patterns
- 蛤玮准备礼物
- Spring学习笔记4-AOP
- 专题三 第一题
- PHP中获得mac地址
- MSSQL如何在没有主键的表中删除重复数据
- 常用的adb指令
- Android混淆APK直接copy使用
- 杭电ACM---1002(大数)