DataMining(2)_Mining Frequent Patterns

来源：互联网发布：网络象棋编辑：程序博客网时间：2024/06/05 14:22

1.Basic Concepts

Frequent pattern: a pattern (a set of items, subsequences, substructures,etc.) that occurs frequently in a data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the contextof frequent itemsets and association rule mining
Freq. pattern: An intrinsic and important property of
datasets Itemset
A collection of one or more items
Support count
Frequency of occurrence of an itemset
Support
Fraction of transactions that contain
Frequent Itemset
An itemset whose support is greater than or equal to a minsup threshold
Association Rule
An implication expression of the form X -> Y, where X and Y are itemsets
Rule Evaluation Metrics
Support (s)
Confidence (c)
closed patterns:An itemset X is closed if X is frequent and there exists no super-pattern Y כ X, with the same support as X
max-patterns:An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כ X

2.Frequent Itemset Mining Methods

1.Brute-force approach(computationally prohibitive)
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf
thresholds
这里写图片描述
Complexity ~ O(NMw) => Expensive since M =2^d
2.Two-step approach
Frequent Itemset Generation:Generate all itemsets whose support >=minsup
Rule Generation:Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset

3.Frequent Itemset Generation Strategies
Reduce the number of candidates (M)
3.1Apriori principle:
If an itemset is frequent, then all of its subsets must also be frequent
Apriori principle holds due to the following property the**support measure**:
这里写图片描述

这里写图片描述
The Apriori Algorithm (Pseudo-Code)

Candidate-gen function

How to Count Supports of Candidates?
Method:
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of itemsets and counts
Interior node contains a hash table
Subset function: finds all the candidates contained in a transaction
Generating rules from frequent itemsets
For each frequent itemset X
For each proper nonempty subset A of X,
Let B = X - A
A -> B is an association rule if
Confidence(A -> B) ≥ minconf,
support(A -> B) = support(A->B) = support(X)
confidence(A -> B) = support(A -> B) / support(A)

Reduce the number of transactions (N)
Reduce the number of comparisons (NM)

Improving the Efficiency of Apriori

Further Improvement of the Apriori Method

Major computational challenges
1) Multiple scans of transaction database
Partition: Scan Database Only Twice
Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB
2) Huge number of candidates
DHP: Reduce the Number of Candidates
A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent
3) Tedious workload of support counting for candidates
Improving Apriori: general ideas
1) Reduce passes of transaction database scans
2) Shrink number of candidates
3) Facilitate support counting of candidates

Sampling for Frequent Patterns
Select a sample of original database, mine frequent patterns within sample using Apriori

3.2FPGrowth: A Frequent Pattern-Growth Approach
Depth-first search
Avoid explicit candidate generation
step 1:Construct FP-tree from a Transaction Database
step 2:Partition Patterns and Databases
step 3:Find Patterns Having P From P-conditional Database
step 4:From Conditional Pattern-bases to Conditional FP-trees
step 5:Recursion: Mining Each Conditional FP-tree

3.3ECLAT: Mining by Exploring Vertical Data Format
Determine support of any k-itemset by intersecting tidlists of two of its (k-1) subsets.
Vertical format: t(AB) = {T11, T25, …}
tid-list: list of trans.-ids containing an itemset
这里写图片描述

4.Mining Close Frequent Patterns and Maxpatterns

Itemset merging: if Y appears in every occurrence of X, then Y
is merged with X
Sub-itemset pruning: if Y כ X, and sup(X) = sup(Y), X and all of
X’s descendants in the set enumeration tree can be pruned

5.Sequential pattern mining

motivation:Association rule mining does not consider the
order of transactions.In many applications such orderings are
significant.
这里写图片描述

6.Mining class association rules (CAR)

motivation:Normal association rule mining does not have
any target.in some applications, the user is
interested in some targets.
Let T be a transaction data set consisting of n
transactions.
Each transaction is also labeled with a class y.
Let I be the set of all items in T, Y be the set of all
class labels and I intersection Y = null.

7.Which Patterns Are Interesting?—PatternEvaluation Methods

Measure of dependent/correlated events: lift
这里写图片描述

0 0