ApproxMAP : Sequence of sets
Exploratory Data Analysis of {A,B,D} {B} {B,C}
Does your data look like {A,B,D} {B} {B,C} ?
Would you like to see what your data looks like?
| Some examples |
| life path analysis |
{Graduation, Job1, Marriage} {Job1} {Job1, Birth} |
| welfare service analysis |
{Foster Care, Medicaid, Foodstamp} {Medicaid} {Medicaid, TANF} |
| market basket analysis
| {Bread, Milk, Jam} {Milk} {Milk, Banana} |
Table 1: Basic Terms - sequence of sets
| Items I |
{A B C D} |
Let items be a set of literals
I={i1 i2 i3 ... il}.
|
| Itemsets s13 |
{B,C} |
Then an itemset is a set of items (unordered list). |
| Sequence seq1 |
{A,B,D} {B} {B,C} |
And a sequence is an ordered list of itemsets. |
Sequential pattern mining is an important data mining task with broad applications. We
propose the theme of approximate sequential pattern mining roughly defined as
identifying patterns approximately shared by many sequences. We present an
efficient and effective algorithm, ApproxMAP (APPROXimate Multiple Alignment
Pattern mining), to find maximal approximate sequential patterns from large
databases. Our goal is to assist the data analyst in exploratory data analysis
through organization and data reduction. The method works in three steps.
- First, similar sequences are grouped together using kNN clustering.
- Then we organize and compress sequences within each cluster into weighted
sequences using multiple alignment.
- In the last step, the longest consensus pattern best fitting each
cluster are generated from the weighted
sequences .
Our extensive experimental results on synthetic and real data show that
ApproxMAP is very robust to noise and does well in mapping the high dimensional
noisy data into approximate sequential patterns.
An example :
Given below are 10 sequences lexically sorted.
- Step 1: ApproxMAP first divides the sequences into 2 clusters - cluster 1 and cluster 2 (k was set to 2 for kNN clustering).
| ID |
Full Sequence Database lexically sorted
|
Cluster |
Lseq |
Len |
| seq1 |
{A} |
{B, C, Y} |
{D} |
  |
  |
1 |
3 |
5 |
| seq2 |
{A} |
{X} |
{B, C} |
{A, E} |
{Z} |
1 |
5 |
7 |
| seq3 |
{A, I} |
{Z} |
{K} |
{L, M} |
  |
2 |
4 |
6 |
| seq4 |
{A, L} |
{D, E} |
  |
  |
  |
1 |
2 |
4 |
| seq5 |
{I, J} |
{B} |
{K} |
{L} |
  |
2 |
4 |
5 |
| seq6 |
{I, J} |
{L, M} |
  |
  |
  |
2 |
2 |
5 |
| seq7 |
{I, J} |
{K} |
{J, K} |
{L} |
{M} |
2 |
5 |
7 |
| seq8 |
{I, M} |
{K} |
{K, M} |
{L, M} |
  |
2 |
4 |
7 |
| seq9 |
{J} |
{K} |
{L, M} |
  |
  |
2 |
3 |
4 |
| seq10 |
{V} |
{K, W} |
{Z} |
  |
  |
2 |
3 |
4 |
- Step 2: Then ApproxMAP aligns the sequences in each cluster and compresses
each cluster into one weighted sequence as given in the following two tables.
- Step 3: Finally, ApproxMAP generates a dominant pattern for each cluster
using the weighted sequence as given in the following two tables.
Cluster strength was set to 40%. Thus, the dominant
pattern is
generated by selecting all items in the weighted
sequence that have a weight more than 40% of the cluster.
| Cluster 1
(cluster strength = 40% = 2 sequences, Min_DB_sup = 1 sequence) |
| seq1 |
{A} |
  |
{B, C, Y} |
{D} |
  |
  |
| seq4 |
{A, L} |
  |
  |
{D, E} |
  |
  |
| seq2 |
{A} |
{X} |
{B, C} |
{A, E} |
{Z} |
  |
| Weighted sequence |
{A:3, L:1}:3 |
{X:1}:1 |
{B:2, C:2, Y:1}:2 |
{A:1, D:2, E:2}:3 |
{Z:1}:1 |
3 |
| Consensus Pattern (w≥2) |
{A} |
|
{B,
C} |
{D,
E} |
|
|
Color scheme <100:
85:
70:
50:
35:
20
>
| Cluster 2 (cluster strength = 40% = 3
sequences, Min_DB_sup = 1 sequence) |
| seq9 |
{J} |
  |
{K} |
{L, M} |
  |
  |
| seq5 |
{I, J} |
{B} |
{K} |
{L} |
  |
  |
| seq3 |
{A, I} |
{Z} |
{K} |
{L, M} |
  |
  |
| seq7 |
{I, J} |
{K} |
{J, K} |
{L} |
{M} |
  |
| seq8 |
{I, M} |
{K} |
{K, M} |
{L, M} |
  |
  |
| seq6 |
{I, J} |
  |
  |
{L, M} |
  |
  |
| seq10 |
  |
{V} |
{K, W} |
  |
{Z} |
  |
| Weighted seq |
{A:1,I:5,J:4,M:1}:6 |
{B:1,K:2,V:1,Z:1}:5 |
{J:1,K:6,M:1,W:1}:6 |
{L:6,M:4}:6 |
{M:1,Z:1}:2 |
7 |
| Consensus Pat(w≥3) |
{I,
J} |
|
{K} |
{L,
M} |
|
|
Consensus Var(w≥2) |
{I,
J} |
{ K} |
{K} |
{L,
M} |
|
|
- Which would you prefer to look at, the first table of 10 sequences lexically sorted or the grouped and aligned sequences in the next two tables?
- Imagine what would happen in real applications where you have many more sequences (N >> 100). If you were to glance through the aligned sequences
then studied the dominant patterns generated for each cluster, wouldn't
you
be able to get a good understanding of what your data looked like?
- If you are interested to learn more about ApproxMAP email us at
kum@cs.unc.edu This webpage is still under construction.
- How can you get software that does ApproxMAP? We have a not-so user
friendly software. We will consider running the analysis for you and
we are open to discussing other possibilities. Email us at
kum@cs.unc.edu
Glossary
- Weighted sequences are an effective
method to compress a whole group of aligned sequences into one sequence.
| An example of weighted sequences |
| seq1 |
{A} |
{B} |
{C, D} |
|
| seq2 |
{A} |
  |
{C} |
|
| Weighted sequence |
{A:2}:2 |
{B:1}:1 |
{C:2, D:1}:2 |
2
|
A weighted itemset is defined as an itemset that has a weight associated with
each item in the itemset as well as the itemset itself. It is denoted as wsj
={i1:w1, i2:w2, ... , im:wm}:wj. Then a weighted sequence is an ordered list of weighted itemsets paired with a
separate weight for the whole sequence. The table above show the
aligned sequences and its weighted sequence. The weight associated with the
sequence is the total number of sequences, N, in the group. The weight
associated with the itemset sj, represents how many sequences have a non-empty
itemset in position j. And the weight associated with each item ik in itemset
sj represents the total number of the item ik present in all itemsets in
position j.
- Consensus Patterns : similar to consensus strings
in computational biology literature, it is the underlying sequential
pattern in a group of similar sequences. In social sciences, it is
also
called the ideal sequence, a hypothetical sequence that best captures
the patterns among all the sequences in a cluster. Unlike optimal
matching ApproxMAP generates the dominant pattern automatically
given a user input - the cluster strength.
- Strength (i, j) : the number of occurances of item i in position j in a group.
Thus, cluster strength means how many times an item i occured in the
cluster at position j. Strength can be specified as a % of the group (e.g. 40% of the
cluster) or as an absolute number (e.g. 10 sequences).
Return to my
homepage