Data Mining:Association Analysis

Association Analysis

An important problem in data mining is the discovery of association patterns [1] present in large databases. This problem was originally formulated in the context of market basket data, where the goal is to determine whether the occurrence of certain items in a transaction

image

can be used to infer the occurrence of other items. If such interesting relationships are found, then they can be put to various profitable uses such as marketing promotions, shelf management, inventory management, etc.

To formalize the problem, let T = {t1, t2, ··· , tN } be the set of all transactions and I = {i1, i2, ··· , id} be the set of all items. Any subset of I is known as an itemset. The support count of an itemset C is defined as the number of transactions in T that contain C, i.e.,

image

For example, consider the market basket transactions shown in Figure 61.6. The support for the rule {Diaper, Milk} −→ {Beer} is σ(Diaper, M ilk, Beer) / 5 = 2/5 = 40%, whereas its confidence is σ(Diaper, M ilk, Beer)(Diaper, M ilk) = 2/3 = 66%.

Support is useful because it reflects the significance of a rule. Rules that have very low support are rarely observed, and thus, are more likely to occur by chance. Confidence is useful because it reflects the reliability of the inference made by each rule. Given an association rule X Y , the higher the confidence, the more likely it is to find Y in trans-actions that contain X. Thus, the goal of association analysis is to automatically discover association rules having relatively high support and high confidence. More specifically, an association rule is considered to be interesting only if its support is greater than or equal to a minimum support threshold, minsup, and its confidence is greater than or equal to a minimum confidence threshold, minconf .

The association analysis problem is far from trivial because of the exponential number of ways in which items can be grouped together to form a rule. In addition, the rules are constrained by two completely different conditions, stated in terms of the minsup and minconf thresholds. A standard way for generating association rules is to divide the process into two steps. The first step is to find all itemsets that satisfy the minimum support threshold. Such itemsets are known as frequent itemsets in the data mining literature. The second step is to generate high-confidence rules only from those itemsets found to be frequent. The completeness of this two-step approach is guaranteed by the fact that any association rule X Y that satisfies the minsup requirement can always be generated from a frequent itemset X Y .

Frequent itemset generation is the computationally most expensive step because there are 2d possible ways to enumerate all itemsets from I. Much research has therefore been devoted

to developing efficient algorithms for this task. A key feature of these algorithms lies in their strategy for controlling the exponential complexity of enumerating candidate itemsets. Briefly, the algorithms make use of the anti-monotone property of itemset support, which states that all subsets of a frequent itemset must be frequent. Put another way, if a candidate itemset is found to be infrequent, we can immediately prune the search space spanned by supersets of this itemset. The Apriori algorithm, developed by Agrawal et al. [2], pioneered the use of this property to systematically enumerate the candidate itemsets. During each iteration k, it generates only those candidate itemsets of length k whose (k 1)-subsets are found to be frequent in the previous iteration. The support counts of these candidates are then determined by scanning the transaction database. After counting their supports, candidate k-itemsets that pass the minsup threshold are declared to be frequent.

Well-designed data structures are central to the efficient mining of association rules. The Apriori algorithm, for example, employs a hash-tree structure to facilitate the support counting of candidate itemsets. On the other hand, algorithms such as FP-growth [31] and H-Miner [47] employ efficient data structures to provide a compact representation of the transaction database. A brief description of the hash tree and FP-tree data structures is presented next.

Hash Tree Structure

Apriori is a level-wise algorithm that generates frequent itemsets one level at a time, from itemsets of size-1 up to the longest frequent itemsets. At each level, candidate itemsets are generated by extending the frequent itemsets found at the previous level. Once the candidate itemsets have been enumerated, the transaction database is scanned once to determine their actual support counts. This generate-and-count procedure is repeated until no new frequent itemsets are found.

Support counting of candidate itemsets is widely recognized as the key bottleneck of frequent itemset generation. This is because one has to determine the candidate itemsets contained in each transaction of the database. A naive way for doing this is to simply match each transaction against every candidate itemset. If the candidate is a subset of the transaction, its support count is incremented. This approach can be prohibitively expensive

image

if the number of candidates and number of transactions are large.

In the Apriori algorithm, candidate itemsets are hashed into different buckets and stored in a hash tree structure. During support counting, each transaction is also hashed into its appropriate buckets. This way, instead of matching a transaction against all candidate itemsets, the transaction is matched only to those candidates that belong to the same bucket.

Figure 61.7 illustrates an example of a hash tree for storing candidate itemsets of size 3. Each internal node of the hash tree contains a hash function that determines which branch of the current node is to be followed next. The hash function used by the tree is also shown in this figure. Specifically, items 1, 4 and 7 are hashed to the left child of the node; items 2, 5, 8 are hashed to the middle child; and items 3, 6, 9 are hashed to the right child. Candidate itemsets are stored at the leaf nodes of the tree. The hash tree shown in Figure 61.7 contains 15 candidate itemsets, distributed across 9 leaf nodes.

We now illustrate how to enumerate candidate itemsets contained in a transaction. Con- sider a transaction t that contains five items, {1, 2, 3, 5, 6}. There are 5C3 = 10 distinct itemsets of size 3 contained in this transaction. Some of these itemsets may correspond to the candidate 3-itemsets under investigation, in which case, their support counts are incremented. Other subsets of t that do not correspond to any candidates can be ignored.

Figure 61.8 shows a systematic way for enumerating size-3 itemsets contained in the transaction t by specifying the items one-by-one. It is assumed that items in every 3- itemset are stored in increasing lexicographic order. Because of the ordering constraint, all itemsets of size-3 derived from t must begin with item 1, 2, or 3. No 3-itemset may begin with item 5 or 6 because there are only two items in this transaction that are greater than or equal to 5. This is illustrated by the level 1 structures depicted in Figure 61.8. For example, the structure 1 2 3 5 6 represents an itemset that begins with 1, followed by two

image

more items chosen from the set {2, 3, 5, 6}.

After identifying the first item, the structures at level 2 denote the various ways to select the second item. For example, the structure 1 2 3 5 6 corresponds to itemsets with prefix (1 2), followed by either item 3, 5, or 6. Once the first two items have been chosen, the structures at level 3 represent the complete set of 3-itemsets derived from transaction t. For example, the three itemsets beginning with the prefix {1 2} are shown in the leftmost box at level 3 of this figure.

The tree-like structure shown in Figure 61.8 is simply meant to demonstrate how subsets of a transaction can be enumerated, i.e., by specifying the items in the 3-itemsets one-by- one, from its left-most item to its right-most item. For support counting, we still have to match each subset to its corresponding candidate. If there is a match, then the support count for the corresponding candidate is incremented.

We now describe how a hash tree can be used to determine candidate itemsets contained in the transaction t = {1, 2, 3, 5, 6}. To do this, the hash tree must be traversed in such a way that all leaf nodes containing candidate itemsets that belong to t are visited. As previously noted, all size-3 candidate itemsets contained in t must begin with item 1, 2, or 3. Therefore, at the root node of the hash tree, we must hash on items 1, 2, and 3 separately. Item 1 is hashed to the left child of the root node; item 2 is hashed to the middle child of the root node; and item 3 is hashed to the right child of the root node. Once we reach a child of the root node, we need to hash on the second item of the level 2 structures given in Figure 61.8. For example, after hashing on item 1 at the root node, we need to hash on items 2, 3, and 5 at level 2. Hashing on items 2 or 5 will lead us to the middle child node while hashing on item 3 will lead us to the right child node, as depicted in Figure 61.9. This process of hashing on items that belong to the transaction continues until we reach the leaf nodes of the hash tree. Once a leaf node is reached, all candidate itemsets stored at the leaf are compared against the transaction. If a candidate belongs to the transaction, its support count is incremented. In this example, 6 out of the 9 leaf nodes are visited and 11 out of the 15 itemsets are matched against the transaction.

image

FP-Tree Structure

Recently, an interesting algorithm called FP-growth was proposed that takes a radically different approach to discovering frequent itemsets. The algorithm does not subscribe to the generate-and count paradigm of Apriori. It encodes the database using a compact data structure called an FP-tree and infers frequent itemsets directly from this structure.

First, the algorithm scans the database once to find the frequent singleton items. An order is then imposed on the items based on decreasing support counts.

Figure 61.10

illustrates an example of how to construct an FP-tree from a transaction database that contains five items, A, B, C, D, and E. Initially, the FP-tree contains only the root node, which is represented by a null symbol. Next, each transaction is used to create a path from the root node to some node in the FP-tree.

After reading the first transaction, {A, B}, a path is formed from the root node to its child node, labeled as A, and subsequently, to another node labeled as B. Each node in the tree contains the symbol of the item along with a count of the transactions that reach the particular node. In this case, both nodes A and B would have a count equal to one. After reading the second transaction {B,C,D} a new path extending from null B C D is created. Again, the nodes along this path have support counts equal to one. When the third transaction is read, the algorithm will discover that this transaction shares a common prefix A with the first transaction. As a result, the path null A C D is merged to the existing path null A B. The support count for node A is incremented to two, while the newly-created nodes, C and D, each have a support count equal to one. This process is repeated until all the transactions have been mapped into one of the paths in the FP-tree. For example, the state of the FP-tree after reading the first ten transactions is shown at the bottom of Figure 61.10.

By looking at the way the tree is constructed, we can see why an FP-tree provides a compact representation of the database. If the database contains many transactions that share common items, then the size of an FP-tree will be considerably smaller than the size

image

of the database. The best-case scenario would be that the database contains the same set of items for all transactions. The resulting FP-tree would contain only a single branch of nodes. The worst-case scenario would be that each transaction contains a unique set of items. In this case, there is no sharing of transactions among the nodes and the size of the FP-tree is the same as the size of the database.

During tree construction, the FP-tree structure also creates a linked-list access mechanism for reaching every individual occurrence of each frequent item used to construct the tree. In the above example, there are five such linked lists, one for each item, A, B, C, D, and E. The algorithm used for generating frequent itemsets from an FP-tree is known as FP- growth. Given the FP-tree shown in Figure 61.10, the algorithm divides the problem into several subproblems, where each subproblem involves finding frequent itemsets having a particular suffix. In this example, the algorithm initially looks for frequent itemsets that end in E by following the linked list connecting the nodes for E. After all frequent itemsets ending in E are found, the algorithm looks for frequent itemsets that end in D by following the linked list for D, and so on.

How does FP-growth find all the frequent itemsets ending in E? Recall that an FP-tree stores the support counts of every item along each path, and that these counts reflect the number of transactions that are collapsed onto that particular path. In our example, there are only three occurrences of the node E. By collecting the prex paths of E, we can solve the subproblem of finding frequent itemsets ending in E. The prefix paths of E consist of all paths starting from the root node up to the parent nodes of E. These prefix paths can form a new FP-tree to which the FP-growth algorithm can be recursively applied.

Before creating a new FP-tree from the prefix paths, the support counts of items along each prefix path must be updated. This is because the initial prefix path may include several transactions that do not contain the item E. For this reason, the support count of each item along the prefix path must be adjusted to have the same count as node E for that particular path. After updating the counts along the prefix paths of E, some items may no longer be frequent, and thus, must be removed from further consideration (as far as our new subproblem is concerned). An FP-tree of the prefix paths is then constructed by removing the infrequent items. This recursive process of breaking the problem into smaller subproblems will continue until the subproblem involves only a single item. If the support count of this item is greater than the minimum support threshold, then the label of this item will be returned by the FP-growth algorithm. The returned label is appended as a prefix to the frequent itemset ending in E.

Comments

Popular posts from this blog

Data Structure Visualization:Introduction and Value of Data Structure Rendering

Collision Detection:Penetration Depth Computation

Concurrent Data Structures:Linked Lists