# ApproxMAP : Sequence of sets

## Exploratory Data Analysis of {A,B,D} {B} {B,C}

Does your data look like {A,B,D} {B} {B,C} ?
Would you like to see what your data looks like?

Some examples
life path analysis {Graduation, Job1, Marriage} {Job1} {Job1, Birth}
welfare service analysis {Foster Care, Medicaid, Foodstamp} {Medicaid} {Medicaid, TANF}

### Table 1: Basic Terms - sequence of sets

 Items I {A B C D} Let items be a set of literals I={i1 i2 i3 ... il}. Itemsets s13 {B,C} Then an itemset is a set of items (unordered list). Sequence seq1 {A,B,D} {B} {B,C} And a sequence is an ordered list of itemsets.

Sequential pattern mining is an important data mining task with broad applications. We propose the theme of approximate sequential pattern mining roughly defined as identifying patterns approximately shared by many sequences. We present an efficient and effective algorithm, ApproxMAP (APPROXimate Multiple Alignment Pattern mining), to find maximal approximate sequential patterns from large databases. Our goal is to assist the data analyst in exploratory data analysis through organization and data reduction. The method works in three steps.

1. First, similar sequences are grouped together using kNN clustering.
2. Then we organize and compress sequences within each cluster into weighted sequences using multiple alignment.
3. In the last step, the longest consensus pattern best fitting each cluster are generated from the weighted sequences .

Our extensive experimental results on synthetic and real data show that ApproxMAP is very robust to noise and does well in mapping the high dimensional noisy data into approximate sequential patterns.

An example : Given below are 10 sequences lexically sorted.

• Step 1: ApproxMAP first divides the sequences into 2 clusters - cluster 1 and cluster 2 (k was set to 2 for kNN clustering).

 ID Full Sequence Database lexically sorted Cluster Lseq Len seq1 {A} {B, C, Y} {D} 1 3 5 seq2 {A} {X} {B, C} {A, E} {Z} 1 5 7 seq3 {A, I} {Z} {K} {L, M} 2 4 6 seq4 {A, L} {D, E} 1 2 4 seq5 {I, J} {B} {K} {L} 2 4 5 seq6 {I, J} {L, M} 2 2 5 seq7 {I, J} {K} {J, K} {L} {M} 2 5 7 seq8 {I, M} {K} {K, M} {L, M} 2 4 7 seq9 {J} {K} {L, M} 2 3 4 seq10 {V} {K, W} {Z} 2 3 4

• Step 2: Then ApproxMAP aligns the sequences in each cluster and compresses each cluster into one weighted sequence as given in the following two tables.

• Step 3: Finally, ApproxMAP generates a dominant pattern for each cluster using the weighted sequence as given in the following two tables. Cluster strength was set to 40%. Thus, the dominant pattern is generated by selecting all items in the weighted sequence that have a weight more than 40% of the cluster.

 Cluster 1 (cluster strength = 40% = 2 sequences, Min_DB_sup = 1 sequence) seq1 {A} {B, C, Y} {D} seq4 {A, L} {D, E} seq2 {A} {X} {B, C} {A, E} {Z} Weighted sequence {A:3, L:1}:3 {X:1}:1 {B:2, C:2, Y:1}:2 {A:1, D:2, E:2}:3 {Z:1}:1 3 Consensus Pattern (w≥2) {A} {B, C} {D, E}

Color scheme <100: 85: 70: 50: 35: 20 >

 Cluster 2 (cluster strength = 40% = 3 sequences, Min_DB_sup = 1 sequence) seq9 {J} {K} {L, M} seq5 {I, J} {B} {K} {L} seq3 {A, I} {Z} {K} {L, M} seq7 {I, J} {K} {J, K} {L} {M} seq8 {I, M} {K} {K, M} {L, M} seq6 {I, J} {L, M} seq10 {V} {K, W} {Z} Weighted seq {A:1,I:5,J:4,M:1}:6 {B:1,K:2,V:1,Z:1}:5 {J:1,K:6,M:1,W:1}:6 {L:6,M:4}:6 {M:1,Z:1}:2 7 Consensus Pat(w≥3) {I, J} {K} {L, M} Consensus Var(w≥2) {I, J} { K} {K} {L, M}

• Which would you prefer to look at, the first table of 10 sequences lexically sorted or the grouped and aligned sequences in the next two tables?
• Imagine what would happen in real applications where you have many more sequences (N >> 100). If you were to glance through the aligned sequences then studied the dominant patterns generated for each cluster, wouldn't you be able to get a good understanding of what your data looked like?
• If you are interested to learn more about ApproxMAP email us at kum@cs.unc.edu This webpage is still under construction.
• How can you get software that does ApproxMAP? We have a not-so user friendly software. We will consider running the analysis for you and we are open to discussing other possibilities. Email us at kum@cs.unc.edu

Glossary