Data Mining:Association Analysis

Association Analysis

An important problem in data mining is the discovery of association patterns [1] present in large databases. This problem was originally formulated in the context of market basket data, where the goal is to determine whether the occurrence of certain items in a transaction

can be used to infer the occurrence of other items. If such interesting relationships are found, then they can be put to various proﬁtable uses such as marketing promotions, shelf management, inventory management, etc.

To formalize the problem, let T = {t1, t2, ··· , tN } be the set of all transactions and I = {i1, i2, ··· , id} be the set of all items. Any subset of I is known as an itemset. The support count of an itemset C is deﬁned as the number of transactions in T that contain C, i.e.,

For example, consider the market basket transactions shown in Figure 61.6. The support for the rule {Diaper, Milk} −→ {Beer} is σ(Diaper, M ilk, Beer) / 5 = 2/5 = 40%, whereas its conﬁdence is σ(Diaper, M ilk, Beer)/σ(Diaper, M ilk) = 2/3 = 66%.

Support is useful because it reﬂects the signiﬁcance of a rule. Rules that have very low support are rarely observed, and thus, are more likely to occur by chance. Conﬁdence is useful because it reﬂects the reliability of the inference made by each rule. Given an association rule X −→ Y , the higher the conﬁdence, the more likely it is to ﬁnd Y in trans-actions that contain X. Thus, the goal of association analysis is to automatically discover association rules having relatively high support and high conﬁdence. More speciﬁcally, an association rule is considered to be interesting only if its support is greater than or equal to a minimum support threshold, minsup, and its conﬁdence is greater than or equal to a minimum conﬁdence threshold, minconf .

The association analysis problem is far from trivial because of the exponential number of ways in which items can be grouped together to form a rule. In addition, the rules are constrained by two completely diﬀerent conditions, stated in terms of the minsup and minconf thresholds. A standard way for generating association rules is to divide the process into two steps. The ﬁrst step is to ﬁnd all itemsets that satisfy the minimum support threshold. Such itemsets are known as frequent itemsets in the data mining literature. The second step is to generate high-conﬁdence rules only from those itemsets found to be frequent. The completeness of this two-step approach is guaranteed by the fact that any association rule X −→ Y that satisﬁes the minsup requirement can always be generated from a frequent itemset X ∪ Y .

Frequent itemset generation is the computationally most expensive step because there are 2d possible ways to enumerate all itemsets from I. Much research has therefore been devoted

to developing eﬃcient algorithms for this task. A key feature of these algorithms lies in their strategy for controlling the exponential complexity of enumerating candidate itemsets. Brieﬂy, the algorithms make use of the anti-monotone property of itemset support, which states that all subsets of a frequent itemset must be frequent. Put another way, if a candidate itemset is found to be infrequent, we can immediately prune the search space spanned by supersets of this itemset. The Apriori algorithm, developed by Agrawal et al. [2], pioneered the use of this property to systematically enumerate the candidate itemsets. During each iteration k, it generates only those candidate itemsets of length k whose (k − 1)-subsets are found to be frequent in the previous iteration. The support counts of these candidates are then determined by scanning the transaction database. After counting their supports, candidate k-itemsets that pass the minsup threshold are declared to be frequent.

Well-designed data structures are central to the eﬃcient mining of association rules. The Apriori algorithm, for example, employs a hash-tree structure to facilitate the support counting of candidate itemsets. On the other hand, algorithms such as FP-growth [31] and H-Miner [47] employ eﬃcient data structures to provide a compact representation of the transaction database. A brief description of the hash tree and FP-tree data structures is presented next.

Hash Tree Structure

Apriori is a level-wise algorithm that generates frequent itemsets one level at a time, from itemsets of size-1 up to the longest frequent itemsets. At each level, candidate itemsets are generated by extending the frequent itemsets found at the previous level. Once the candidate itemsets have been enumerated, the transaction database is scanned once to determine their actual support counts. This generate-and-count procedure is repeated until no new frequent itemsets are found.

Support counting of candidate itemsets is widely recognized as the key bottleneck of frequent itemset generation. This is because one has to determine the candidate itemsets contained in each transaction of the database. A naive way for doing this is to simply match each transaction against every candidate itemset. If the candidate is a subset of the transaction, its support count is incremented. This approach can be prohibitively expensive

if the number of candidates and number of transactions are large.

In the Apriori algorithm, candidate itemsets are hashed into diﬀerent buckets and stored in a hash tree structure. During support counting, each transaction is also hashed into its appropriate buckets. This way, instead of matching a transaction against all candidate itemsets, the transaction is matched only to those candidates that belong to the same bucket.

Figure 61.7 illustrates an example of a hash tree for storing candidate itemsets of size 3. Each internal node of the hash tree contains a hash function that determines which branch of the current node is to be followed next. The hash function used by the tree is also shown in this ﬁgure. Speciﬁcally, items 1, 4 and 7 are hashed to the left child of the node; items 2, 5, 8 are hashed to the middle child; and items 3, 6, 9 are hashed to the right child. Candidate itemsets are stored at the leaf nodes of the tree. The hash tree shown in Figure 61.7 contains 15 candidate itemsets, distributed across 9 leaf nodes.

We now illustrate how to enumerate candidate itemsets contained in a transaction. Con- sider a transaction t that contains ﬁve items, {1, 2, 3, 5, 6}. There are 5C3 = 10 distinct itemsets of size 3 contained in this transaction. Some of these itemsets may correspond to the candidate 3-itemsets under investigation, in which case, their support counts are incremented. Other subsets of t that do not correspond to any candidates can be ignored.

Figure 61.8 shows a systematic way for enumerating size-3 itemsets contained in the transaction t by specifying the items one-by-one. It is assumed that items in every 3- itemset are stored in increasing lexicographic order. Because of the ordering constraint, all itemsets of size-3 derived from t must begin with item 1, 2, or 3. No 3-itemset may begin with item 5 or 6 because there are only two items in this transaction that are greater than or equal to 5. This is illustrated by the level 1 structures depicted in Figure 61.8. For example, the structure 1 2 3 5 6 represents an itemset that begins with 1, followed by two

more items chosen from the set {2, 3, 5, 6}.

After identifying the ﬁrst item, the structures at level 2 denote the various ways to select the second item. For example, the structure 1 2 3 5 6 corresponds to itemsets with preﬁx (1 2), followed by either item 3, 5, or 6. Once the ﬁrst two items have been chosen, the structures at level 3 represent the complete set of 3-itemsets derived from transaction t. For example, the three itemsets beginning with the preﬁx {1 2} are shown in the leftmost box at level 3 of this ﬁgure.

The tree-like structure shown in Figure 61.8 is simply meant to demonstrate how subsets of a transaction can be enumerated, i.e., by specifying the items in the 3-itemsets one-by- one, from its left-most item to its right-most item. For support counting, we still have to match each subset to its corresponding candidate. If there is a match, then the support count for the corresponding candidate is incremented.

We now describe how a hash tree can be used to determine candidate itemsets contained in the transaction t = {1, 2, 3, 5, 6}. To do this, the hash tree must be traversed in such a way that all leaf nodes containing candidate itemsets that belong to t are visited. As previously noted, all size-3 candidate itemsets contained in t must begin with item 1, 2, or 3. Therefore, at the root node of the hash tree, we must hash on items 1, 2, and 3 separately. Item 1 is hashed to the left child of the root node; item 2 is hashed to the middle child of the root node; and item 3 is hashed to the right child of the root node. Once we reach a child of the root node, we need to hash on the second item of the level 2 structures given in Figure 61.8. For example, after hashing on item 1 at the root node, we need to hash on items 2, 3, and 5 at level 2. Hashing on items 2 or 5 will lead us to the middle child node while hashing on item 3 will lead us to the right child node, as depicted in Figure 61.9. This process of hashing on items that belong to the transaction continues until we reach the leaf nodes of the hash tree. Once a leaf node is reached, all candidate itemsets stored at the leaf are compared against the transaction. If a candidate belongs to the transaction, its support count is incremented. In this example, 6 out of the 9 leaf nodes are visited and 11 out of the 15 itemsets are matched against the transaction.

FP-Tree Structure

Recently, an interesting algorithm called FP-growth was proposed that takes a radically diﬀerent approach to discovering frequent itemsets. The algorithm does not subscribe to the generate-and count paradigm of Apriori. It encodes the database using a compact data structure called an FP-tree and infers frequent itemsets directly from this structure.

First, the algorithm scans the database once to ﬁnd the frequent singleton items. An order is then imposed on the items based on decreasing support counts.

Figure 61.10

illustrates an example of how to construct an FP-tree from a transaction database that contains ﬁve items, A, B, C, D, and E. Initially, the FP-tree contains only the root node, which is represented by a null symbol. Next, each transaction is used to create a path from the root node to some node in the FP-tree.

After reading the ﬁrst transaction, {A, B}, a path is formed from the root node to its child node, labeled as A, and subsequently, to another node labeled as B. Each node in the tree contains the symbol of the item along with a count of the transactions that reach the particular node. In this case, both nodes A and B would have a count equal to one. After reading the second transaction {B,C,D} a new path extending from null → B → C → D is created. Again, the nodes along this path have support counts equal to one. When the third transaction is read, the algorithm will discover that this transaction shares a common preﬁx A with the ﬁrst transaction. As a result, the path null → A → C → D is merged to the existing path null → A → B. The support count for node A is incremented to two, while the newly-created nodes, C and D, each have a support count equal to one. This process is repeated until all the transactions have been mapped into one of the paths in the FP-tree. For example, the state of the FP-tree after reading the ﬁrst ten transactions is shown at the bottom of Figure 61.10.

By looking at the way the tree is constructed, we can see why an FP-tree provides a compact representation of the database. If the database contains many transactions that share common items, then the size of an FP-tree will be considerably smaller than the size

of the database. The best-case scenario would be that the database contains the same set of items for all transactions. The resulting FP-tree would contain only a single branch of nodes. The worst-case scenario would be that each transaction contains a unique set of items. In this case, there is no sharing of transactions among the nodes and the size of the FP-tree is the same as the size of the database.

During tree construction, the FP-tree structure also creates a linked-list access mechanism for reaching every individual occurrence of each frequent item used to construct the tree. In the above example, there are ﬁve such linked lists, one for each item, A, B, C, D, and E. The algorithm used for generating frequent itemsets from an FP-tree is known as FP- growth. Given the FP-tree shown in Figure 61.10, the algorithm divides the problem into several subproblems, where each subproblem involves ﬁnding frequent itemsets having a particular suﬃx. In this example, the algorithm initially looks for frequent itemsets that end in E by following the linked list connecting the nodes for E. After all frequent itemsets ending in E are found, the algorithm looks for frequent itemsets that end in D by following the linked list for D, and so on.

How does FP-growth ﬁnd all the frequent itemsets ending in E? Recall that an FP-tree stores the support counts of every item along each path, and that these counts reﬂect the number of transactions that are collapsed onto that particular path. In our example, there are only three occurrences of the node E. By collecting the preﬁx paths of E, we can solve the subproblem of ﬁnding frequent itemsets ending in E. The preﬁx paths of E consist of all paths starting from the root node up to the parent nodes of E. These preﬁx paths can form a new FP-tree to which the FP-growth algorithm can be recursively applied.

Before creating a new FP-tree from the preﬁx paths, the support counts of items along each preﬁx path must be updated. This is because the initial preﬁx path may include several transactions that do not contain the item E. For this reason, the support count of each item along the preﬁx path must be adjusted to have the same count as node E for that particular path. After updating the counts along the preﬁx paths of E, some items may no longer be frequent, and thus, must be removed from further consideration (as far as our new subproblem is concerned). An FP-tree of the preﬁx paths is then constructed by removing the infrequent items. This recursive process of breaking the problem into smaller subproblems will continue until the subproblem involves only a single item. If the support count of this item is greater than the minimum support threshold, then the label of this item will be returned by the FP-growth algorithm. The returned label is appended as a preﬁx to the frequent itemset ending in E.

Search This Blog

algorithms

Data Mining:Association Analysis

Association Analysis

Comments

Post a Comment

Popular posts from this blog

0/1 Knapsack Problem Memory function.

Binary Space Partitioning Trees:BSP Tree as a Hierarchy of Regions.

Warshall’s Algorithm -to find TRANSITIVE CLOSURE