





















































When presented with a new data set, a natural sequence of questions is:
We will start with a set of items I = {a1, ..., an}, which serves as the base for all the following concepts. A transaction T is just a set of items in I, and we say that T is a transaction of length l if it contains l item. A transaction database D is a database of transaction IDs and their corresponding transactions.
Transaction ID
|
Transaction
|
1
|
a, c, d, f, g, i, m, p
|
2
|
a, b, c, f, l, m, o
|
3
|
b, f, h, j, o
|
4
|
b, c, k, s, p
|
5
|
a, c, e, f, l, m, n, p
|
Table 1: A small shopping cart database with five transactions
Given the definition of a transaction database, a pattern P is a transaction contained in the transactions in D and the support, supp(P), of the pattern is the number of transactions for which this is true, divided or normalized by the number of transactions in D:
supp(s) = suppD(s) = |{ s' ∈ S | s < s'}| / |D|
We use the < symbol to denote s as a subpattern of s' or, conversely, call s' a superpattern of s. Note that in the literature, you will sometimes also find a slightly different version of support that does not normalize the value. For example, the pattern {a, c, f} can be found in transactions 1, 2, and 5. This means that {a, c, f} is a pattern of support 0.6 in our database D of five items.
Support is an important notion, as it gives us a first example of measuring the frequency of a pattern, which, in the end, is what we are after. In this context, for a given minimum support threshold t, we say P is a frequent pattern if and only if supp(P) is at least t. In our running example, the frequent patterns of length 1 and minimum support 0.6 are {a}, {b}, {c}, {p}, and {m} with support 0.6 and {f} with support 0.8. In what follows, we will often drop the brackets for items or patterns and write f instead of {f}, for instance.
To advance our general introduction of concepts, let's next turn to association rules, as first introduced in Mining Association Rules between Sets of Items in Large Databases, available at http:/ /arbor. ee. ntu. edu. tw/~chyun/ dmpaper/agrama93. pdf. In contrast to solely counting the occurrences of items in our database, we now want to understand the rules or implications of patterns. What I mean is, given a pattern P1 and another pattern P2, we want to know whether P2 is frequently present whenever P1 can be found in D, and we denote this by writing P1 ⇒ P2. To make this more precise, we need a concept for rule frequency similar to that of support for patterns, namely confidence. For a rule P1 ⇒ P2, confidence is defined as follows:
conf(P1 ⇒ P2) = supp(P1 ∪ P2) / supp(P1)
This can be interpreted as the conditional support of P2 given to P1; that is, if it were to restrict D to all the transactions supporting P1, the support of P2 in this restricted database would be equal to conf(P1 ⇒ P2). We call P1 ⇒ P2 a rule in D if it exceeds a minimum confidence threshold t, just as in the case of frequent patterns. Finding all the rules for a confidence threshold represents the formal answer to the second question, association rule mining. Moreover, in this situation, we call P1 the antecedent and P2 the consequent of the rule. In general, there is no restriction imposed on the structure of either the antecedent or the consequent. However, in what follows, we will assume that the consequent's length is 1, for simplicity.
{a} ⇒ {c}, {a} ⇒ {f}, {a} ⇒ {m}, {a} ⇒ {p}
{c} ⇒ {a}, {c} ⇒ {f}, {c} ⇒ {m}, {c} ⇒ {p}
{f} ⇒ {a}, {f} ⇒ {c}, {f} ⇒ {m}
{m} ⇒ {a}, {m} ⇒ {c}, {m} ⇒ {f}, {m} ⇒ {p}
{p} ⇒ {a}, {p} ⇒ {c}, {p} ⇒ {f}, {p} ⇒ {m}
Let's move on to formalizing, the third and last pattern matching question we tackle in this chapter. Let's look at sequences in more detail. A sequence is different from the transactions we looked at before in that the order now matters. For a given item set I, a sequence S in I of length l is defined as follows:
s = <s1, s2, ..., sl>
Here, each individual si is a concatenation of items, that is, si = (ai1 ... aim), where aij is an item in I. Note that we do care about the order of sequence items si but not about the internal ordering of the individual aij in si. A sequence database S consists of pairs of sequence IDs and sequences, analogous to what we had before. An example of such a database can be found in the following table, in which the letters represent the same items as in our previous shopping cart example:
Sequence ID
|
Sequence
|
1
|
<a(abc)(ac)d(cf)>
|
2
|
<(ad)c(bc)(ae)>
|
3
|
<(ef)(ab)(df)cb>
|
4
|
<eg(af)cbc>
|
Table 2: A small sequence database with four short sequences.
u1 < si1, ..., un < sin
Here, the < signs in the last line mean that uj is a subpattern of sij. Roughly speaking, u is a subsequence of s if all the elements of u are subpatterns of s in their given order. Equivalently, we call s a supersequence of u. In the preceding example, we see that <a(ab)ac> and a(cb)(ac)dc> are examples of subsequences of <a(abc)(ac)d(cf)> and that <(fa)c> is an example of a subsequence of <eg(af)cbc>.
With the help of the notion of supersequences, we can now define the support of a sequence s in a given sequence database S as follows:
suppS(s) = supp(s) = |{ s' ∈ S | s < s'}| / |S|