Mining Frequent Sequential Patterns

Rajae KRIBII; Youssef FAKIR

doi:10.14302/issn.2768-0207.jbr-21-3455

Abstract

In recent times, the urge to collect data and analyze it has grown. Time stamping a data set is an important part of the analysis and data mining as it can give information that is more useful. Different mining techniques have been designed for mining time-series data, sequential patterns for example seeks relationships between occurrences of sequential events and finds if there exist any specific order of the occurrences. Many Algorithms has been proposed to study this data type based on the apriori approach. In this paper we compare two basic sequential algorithms which are General Sequential algorithm (GSP) and Sequential PAttern Discovery using Equivalence classes (SPADE). These two algorithms are based on the Apriori algorithms. Experimental results have shown that SPADE consumes less time than GSP algorithm.

Author Contributions

Received 24 Jun 2020; Accepted 10 Mar 2021; Published 15 Mar 2021;

Academic Editor: Hongwei Mo, Harbin Engineering University, Harbin 150001, China

Checked for plagiarism: Yes

Review by: Single-blind

License

This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Competing interests

The authors declare that they have no conflict of interest.

Citation:

Rajae KRIBII, Youssef FAKIR (2021) Mining Frequent Sequential Patterns . Journal of Big Data Research - 1(2):20-37. https://doi.org/10.14302/issn.2768-0207.jbr-21-3455

Download as RIS, BibTeX, Text (Include abstract )

DOI 10.14302/issn.2768-0207.jbr-21-3455

Data mining is a critical step in the field of data science. It involves sorting large data sets to identify patterns and build relationships to solve problems through data analysis. Data mining tools help companies predict future trends. Specifically, data mining benefits vary depending on the goal and the industry. Sales and marketing departments can mine customer data to improve lead conversion rates or to create one-to-one marketing campaigns. Data mining information on historical sales patterns and customer behaviors can be used to build prediction models for future sales, new products and services.

Data mining parameters include classification, sequence or path analysis, clustering and forecasting. Sequence or path parsing parameters look for patterns in which one event leads to another subsequent event. A sequence is an ordered list of element sets and is a common type of data structure in many databases.

A classification parameter searches for new models and may cause a change in the organization of the data. Classification algorithms predict variables based on other factors in the database. In data mining, association rules are created by analyzing the data to look for frequent "if / then" patterns, and then using the support and trust criteria to locate the most important relationships within data. The frequency with which items appear in the database is supported, while trust is the number of times the statements are accurate ^1,^2,^3,⁴.

Sequence Mining was first introduced in 1995 by Argwal and Srikant ⁶ and is defined by the discovery a set of attributes, shared in time between a large numbers of objects of a given database. Sequential pattern is a sequence of item sets that frequently occurred in a specific order, all items in the same item sets are supposed to have the same transaction time value or within a time gap. Sequential pattern mining is trying to find the relationships between occurrences of sequential events, to find if there exist any specific order of the occurrences ⁶. We can find the sequential patterns of specific individual items; also, we can find the sequential patterns cross-different items. Sequential pattern mining is widely used in analyzing of DNA sequence.

Various algorithms have been implemented to identify the frequent sequences from the sequence database. One of the approaches used to identify frequent sequences is the Apriori approach ⁷. The approach is introduced in the context of the exploration of association rules. This property indicates that if a pattern is not common, no pattern that contains a pattern can be common. Two of the most effective algorithms using this approach are GSP and SPADE. The main difference between the two is that GSP uses a horizontal data format ⁸, while SPADE uses a vertical format ⁹. These algorithms work well in a database of short frequent sequences. However, when exploiting databases composed of long, frequent sequences, e.g. stock values, DNA strings or machine monitoring data, their overall performance deteriorates by an order of magnitude ¹⁰.

This paper is organized as follows: in the second section, we describe the Generalized Sequential Models (GSP) algorithm, which use a horizontal format database. In the third section a Sequential Pattern Discovery using Equivalence classes (SPADE) which uses a vertical database is given. A case study is given in section four. The paper is ended by experimental results and conclusion.

Gsp Algorithm

The Generalized Sequential Models (GSP) algorithm is an apriori-based algorithm applied to sequential models. It integrates with time constraints and relaxes the definition of transaction. For time constraints, maximum gap and minimal gap are defined to specify the gap between any two adjacent transactions in the sequence. If the distance is not in the range between maximum gap and minimal gap then this two cannot be taken as two consecutive transactions in a sequence. The transactions are then applied to generate multiple level sequential patterns to find all sequences whose support is greater than the user-defined minimum support ⁵.

GSP uses the Apriori property ¹, that is to say, considering the minimal number of supports, if a sequence is not accepted; all its super sequence will be ignored. The features require several passes of the initial transaction data set, it uses the horizontal data format, at each pass, the entire candidate is generated by a self-join patterns found in the previous pass. In k-step, a sequence pattern is accepted only if all its submodules (k-1) are accepted in (k-1) ⁸.

The Pseudo code for GSP Algorithm is as follows:

F1 = the set of frequent 1-sequence

K = 2,

do while Fk-1 != Null;

Generate candidate sets Ck (set of candidate k-sequences);

For all input sequences s in the database D

Increment count of all a in C_k if s supports a

End do

F_k = {a ∈ Ck such that its frequency exceeds the threshold}

k = k+1;

End do

Result = Set of all frequent sequences is the union of all F_k' s

Spade Algorithm

SPADE is an algorithm proposed to find frequent sequences using efficient lattice search techniques and simple joins. SPADE decomposes the original problem into smaller sub-problems using equivalence classes on frequent sequences so that each class can be solved independently. SPADE usually makes only three database scans, first one for frequent 1-sequences, second for frequent 2-sequences, and one more for generating all other frequent sequences. If the support of 2-sequences is available then only one scan is required ⁵.

SPADE outperforms GSP algorithm, by a factor of two, and by an order of magnitude with precomputed support of 2-sequences. It also has excellent scale up properties with respect to a number of parameters such as the number of input-sequences, the number of events per input-sequence, the event size, and the size of potential maximal frequent events and sequences ⁹.

The Pseudo Code for SPADE Algorithm is as follows

SPADE (D,min_ supp):

F₁={Frequent items or 1-sequences }

F₂={ Frequent 2 - sequences}

ℇ={equivalence classes ^X_θ1}

for all ^X∈ ℇ do EnumerateFrequentSeq(^X)

EnumerateFrequentSeq(S) :

for all atoms A_i ∈ S do

T_i≠ Ø

for all atoms A_j ∈ S with j≥i do

R=A_i V A_j

if ( Prune(R)==false) then

L(R)= L(A_i ) ⋂ L(A_j)

if σ(R)≥min_supp then

T_i=T_i U {R}; F_|R| = F_|R| U {R}

end

if (DepthFirstSearch) then EnumerateFrequentSeq(T_i )

end

if(BreadthFirstSearch) then

for all T_i≠Ø do EnumerateFrequentSeq(T_i)

Case Study

To demonstrate an explanation of the algorithms GSP and SPADE an example is given (Table 1). For all the algorithms, we fix the minimum support to 2. Table 1.

Table 1. Original database of transactions

Client ID	Date	Items
1	5/1/2015	C, D
1	16/1/2016	A, B, C
1	3/1/2017	A, B, F
1	3/1/2017	A, C, D, F
2	8/1/2017	A, B, F
2	10/1/2016	E
3	5/1/2016	A, B, F
4	5/1/2016	D, H, G
4	3/1/2017	B, F
4	8/1/2017	A, G, H

The Candidate sets are denoted by C.

C_k specifies candidate sets having K items.

The Candidate sets that satisfy minimum support belongs to L_k .

K denotes number of items in sequence.

As a first step, the database needs to be grouped by Client ID and Date, and then we can visually notice the sequence flow. The column Client ID is denoted SID, which means the sequence identifier; EID is the element identifier in the sequence. Table 2.

Table 2. Database after grouping the instances by sequence id and time

SID	EID	ITEMS
1	1,2,3,4	<(CD)(ABC)(ABF)(ACDF)>
2	1,2	<(ABC)(E)>
3	1	<(ABF)>
4	1,2,3	<(DHG)(BF)(AGH)>

GSP Application

C₁={ A,B,C,D,E,F,G,H} are the initial candidates for the first round. For each item, we should extract the number of occurrences or the support that would eliminate the less frequent items (Table 3).

Table 3. Number of occurrences of items in C1

Items	N° d’occurrence
A	4
B	4
C	1
D	2
E	1
F	4
G	1
H	1

The items that satisfies the minimum support 2 are L₁={A,B,D,F} C₂ can be obtained by joining L₁×L₁. Table 4.

Table 4. Number of occurrences of items in C2

Items	N° d’occurrence
A →A	1
A→B	1
A→D	1
A→F	1
AB	3
AD	1
AF	3
B→A	2
B→B	1
B→D	1
B→F	1
BD	1
BF	4
D→A	2
D→B	2
D→D	2
D→F	2
DF	1
F→A	2
F→B	1
F→D	1
F→F	1

C₂={A→A, A→B, A→D, A→F, AB, AD, AF,B→A, B→B, B→D, B→F, BD, BF, D→A, D→B, D→D, D→F, DF, F→A, F→B, F→D, F→F}

The items that satisfies the minimal support:

L₂={AB, AF, B→A, BF, D→A, D→B, D→F, F→A}

In this step, C₃ can be obtained by joining L₂ X L₂Table 5.

Table 5. Number of occurrences of items in C3

Items	N° Occurrence
ABF	3
AB→A	1
AF →A	1
BF →A	2
D →B →A	2
D →F→ A	2
D →BF	2
F →AF	1
D →FA	1
D→AB	1
F→AB	0
B→AB	1
B→AF	1

C ₃ ={ABF, AB→A, AF→A,BF→A,D→B→A, D→F→A, D→BF,F→AF, D→FA, D→AB, F→AB, B→AB, B→AF}

L₃={ABF, BF→A, D→B→A, D→F→A, D→BF}

In this step C₄ can be obtained by joining L₃X L₃Table 6.

Table 6. Number of occurrences of items in C4

Items	N° occurrence
D→BF→A	2

L₄={D→BF→A}

Spade Application

Most of the sequential pattern mining algorithms assume horizontal database layout. SPADE uses vertical database format. We can generate a new mapping model. <SID, EID> the first identifies the identifier of the sequence, and the second the identifier of the element location in the sequence. Table 7.

Table 7. ID list for atoms

A		B		D		F
SID	EID	SID	EID	SID	EID	SID	EID
1	2	1	2	1	1	1	3
1	3	1	3	1	1	1	4
1	4	2	1	4	1	2	1
2	1	3	1			3	1
3	1	4	2			4	2
4	3

Generating 2-sequences

Then to generate the other sequence elements of size 2, we perform a join between the tables and compare the number of occurrences with the minimal support.

Joining A and B items makes two kind of elements, (AB) is the equality join which means they happen at the same time; thus, we’ll notice the same EID for both items. A → B denotes actions that happen consecutively, that means A’s EID has to be smaller than B’s EID. Table 8, Table 9.

Table 8. ID list for (AB) Non-temporal join

(AB)
SID	EID A	EID B
1	2	2
1	3	3
2	1	1
3	1	1

Table 9. ID list for A ->B temporal join

A→B
SID	EID A	EID B
1	3	4

The support for (AB) is 3, having two cases from the same sequence makes a redundancy problem in the model. For A → B, the support is 1. Table 10.

Table 10. ID list for AD non-temporal join

(AD)
SID	EID A	EID D
1	4	4

The support for (AD) is 1, it does not satisfy the minimum support. Table 11.

Table 11. ID list for AF non-temporal join

(AF)
SID	EID A	EID F
1	3	3
1	4	4
2	2	2
3	1	1

The support for (AF) is 3, it satisfies the minimum support, eve, if the number of occurrences is 4 in the database, there is two elements within the same sequence, so they will be considered as one. Table 12.

Table 12. ID list for BD non-temporal join

(BD)
SID	EID B	EID D
-	-	-

The number of occurrences is 0, it does not satisfy the minimum support. Table 13.

Table 13. ID list for BF non-temporal join

(BF)
SID	EID B	EID F
1	3	3
2	2	2
3	1	1
4	3	3

The number of occurrences is 4, it satisfies the minimum support. Table 14.

Table 14. ID list of DF non-temporal join

(DF)
SID	EID D	EID F
1	4	4

The number of occurrences is 1, it does not satisfy the minimum support. Table 15.

Table 15. ID list for A->A temporal join

A→A
SID	EID A	EID A
1	2	3
1	3	4

The number of occurrences is 2, both belongs to the same sequence, that makes the support 1, so it does not satisfy the min support. Table 16.

Table 16. ID list for A -> D temporal join

A→D
SID	EID A	EID D
1	2	4
1	3	4

The support is 1, because the occurrences are from the same sequence. Therefore, it does not satisfy the min support. Table 17.

Table 17. ID list for A->F temporal join

A→F
SID	EID A	EID F
1	2	3
1	2	4
1	3	4

The support is 1. In addition, it does not satisfy the min support. Table 18.

Table 18. ID list for B->A temporal join

B→A
SID	EID B	EID A
1	2	3
4	3	4

The number of occurrences is 2, it satisfies the minimum support. Table 19.

Table 19. ID list for B->B temporal join

B→B
SID	EID B	EID B
1	2	2
1	3	3

The support is 1, it does not satisfy the minimum support. Table 20.

Table 20. ID list for B->D temporal join

B→D
SID	EID B	EID D
1	2	4
1	3	4

The support is 1, it does not satisfy the minimum support. Table 21.

Table 21. ID list for B->F temporal join

B→F
SID	EID B	EID F
1	3	4

The support is 1, it does not satisfy the minimum support. Table 22.

Table 22. ID list for D->D temporal join

D→D
SID	EID D	EID D
1	1	4

The support is 1, it does not satisfy the minimum support. Table 23.

Table 23. ID list for D->A temporal join

D→A
SID	EID D	EID A
1	1	3
1	1	4
4	1	4

The support is 2, it satisfies the minimum support. Table 24.

Table 24. ID list for D->B temporal join

D→B
SID	EID D	EID B
1	1	2
1	1	3
4	1	3

The support is 2, it satisfies the minimum support. Tab 25.

Table 25. ID list for D->F temporal join

D→F
SID	EID D	EID F
1	1	3
1	1	4
4	1	3

The support is 2, it satisfies the minimum support. Table 26.

Table 26. ID list for F->F temporal join

F→F
SID	EID F	EID F
1	3	4

The support is 1, it does not satisfy the minimum support. Table 27.

Table 27. ID list for F->D temporal join

F→D
SID	EID F	EID D
1	3	4

The support is 1, it does not satisfy the minimum support. Table 28.

Table 28. ID list for F->B temporal join

F→B
SID	EID F	EID B
1	3	4

The support is 1, it does not satisfy the minimum support. Table 29.

Table 29. ID list for F->A temporal join

F→A
SID	EID F	EID A
1	3	4
4	3	4

The support is 2, it satisfies the minimum support. Table 30.

Table 30. Number of occurrences of items

Items	N° occurrence
AB	3
AD	1
AF	3
BD	0
BF	4
DF	1
A→A	1
A→B	1
A→D	1
A→F	1
B→A	2
B→B	1
B→D	1
B→F	1
D→D	1
D→A	2
D→B	2
D→F	2
F→F	1
F→D	1
F→B	1
F→A	1

From equality join and temporal joins, the table showing count of occurrence of sequences of items.

L₂={(AB),(AF),(BF),B→A,D→A,D→F,D→B,F→A}

Generating 3-sequence: Table 31.

Table 31. ID list for ABF non-temporal join

ABF
SID	EID A	EID B	EID F
1	3	3	3
2	2	2	2
3	1	1	1

The support is 3, it satisfies the minimum support. Table 32.

Table 32. ID list for AB->A non-temporal join

AB→A
SID	EID A	EID B	EID A
1	2	2	3
1	3	3	4
2	2	2	-
3	1	1	-

The support is 1, it does not satisfy the minimum support. Table 33.

Table 33. ID list for AF->A non-temporal join

AF→A
SID	EID A	EID F	EID A
1	3	3	3
1	4	4	-
2	2	2	-
3	1	1	-

The support is 1, it does not satisfy the minimum support. Table 34.

Table 34. ID list for BF->A non-temporal join

BF→A
SID	EID B	EID F	EID A
1	3	3	4
2	2	2	-
3	1	1	-
4	3	3	4

The support is 2, it satisfies the minimum support. Table 35.

Table 35. ID list for D->B->A non-temporal join

D→B→A
SID	EID D	EID B	EID A
1	1	2	3
4	1	3	4

The support is 2, it satisfies the minimum support. Table 36.

Table 36. ID list for D->F->A non-temporal join

D→F→A
SID	EID D	EID F	EID A
1	1	3	4
4	1	3	4

The support is 2, it satisfies the minimum support. Table 37.

Table 37. ID list for D->BF non-temporal join

D→BF
SID	EID D	EID B	EID F
1	1	3	3
2	-	2	2
3	-	1	1
4	1	3	3

The support is 2, it satisfies the minimum support. Table 38.

Table 38. ID list for F->AF non-temporal join

F→AF
SID	EID F	EID A	EID F
1	-	3	3
1	-	4	4
2	-	2	2
3	-	1	1

The support is 0, it does not satisfy the minimum support. Table 39.

Table 39. ID list for D->AF non-temporal join

D→AF
SID	EID D	EID A	EID F
1	1	3	3
1	1	4	4
2	-	2	2
3	-	1	1

The support is 2, it satisfies the minimum support. Table 40.

Table 40. ID list for D->AB non-temporal join

D→AB
SID	EID D	EID A	EID B
1	1	3	3
1	1	4	4
2	-	2	2
3	-	1	1

The support is 2, it satisfies the minimum support. Table 41.

Table 41. ID list for F->AB non-temporal join

F→AB
SID	EID F	EIDA	EID B
1	-	2	2
1	-	3	3
2	-	2	2
3	-	1	1

The support is 0, it does not satisfy the minimum support. Table 42.

Table 42. ID list for B->AB non-temporal join

B→AB
SID	EID B	EID A	EID B
1	-	2	2
1	2	3	3
2	-	2	2
3	-	1	1

The support is 1, it does not satisfy the minimum support. Table 43.

Table 43. ID list for B->AF non-temporal join

B→AF
SID	EID B	EID A	EID F
1	-	2	2
1	2	3	3
2	-	2	2
3	-	1	1

The support is 1, it does not satisfy the minimum support. Table 44.

Table 44. ID list for ABF->A temporal join

ABF→A
SID	EID A	EID B	EID F	EID A
1	3	3	3	4
2	2	2	2	-
3	1	1	1	-

So, the results are: L₃={(ABF), (BF)→A, D→B→A, D→F→A, D→(BF)}

Generate 4-sequence:

The support is 1, it does not satisfy the minimum support. Table 45.

Table 45. ID list for D->BF->A temporal join

D→BF→A
SID	EID D	EID B	EID F	EID A
1	1	3	3	4
2	-	2	2	-
3	-	1	1	-
4	1	3	3	4

The support is 2, it satisfies the minimum support.

The result: L₄={D→(BF)→A}

[1] 1.Chakrabarti S, Ester M, Fayyad U, Gehrke J, Han J et al. (2006) Data mining curriculum: A proposal (Version 1.0). Intensive Working. , Group of ACM SIGKDD Curriculum Committee 140.
Google Scholar

	GSP		SPADE
Input Sequence Count	CPU time	Sequence Count	CPU time	Sequence Count
5	~5 ms	222	~2 ms	222
10	~3 ms	357	~1 ms	357
100	~11 ms	35	~ 4 ms	35
500	~71 ms	20	~2 ms	20
1000	~80 ms	20	~1 ms	20
10000	~750 ms	20	~16 ms	20
100000	~6441 ms	20	~ 658 ms	20
500000	~31438 ms	20	~425 ms	20

Journal of Big Data Research

Journal of Big Data Research

Mining Frequent Sequential Patterns

Rajae KRIBII¹, Youssef FAKIR¹

Abstract

Author Contributions

Introduction

Gsp Algorithm

The Pseudo code for GSP Algorithm is as follows:

Spade Algorithm

The Pseudo Code for SPADE Algorithm is as follows

GSP Application

Spade Application

Generating 2-sequences

Experimental Results

Conclusion

References

A		B		D		F
SID	EID	SID	EID	SID	EID	SID	EID
1	2	1	2	1	1	1	3
1	3	1	3	1	1	1	4
1	4	2	1	4	1	2	1
2	1	3	1			3	1
3	1	4	2			4	2
4	3

A		B		D		F
SID	EID	SID	EID	SID	EID	SID	EID
1	2	1	2	1	1	1	3
1	3	1	3	1	1	1	4
1	4	2	1	4	1	2	1
2	1	3	1			3	1
3	1	4	2			4	2
4	3

Journal of Big Data Research

Journal of Big Data Research

Mining Frequent Sequential Patterns

Rajae KRIBII 1, Youssef FAKIR 1

Abstract

Author Contributions

Introduction

Gsp Algorithm

The Pseudo code for GSP Algorithm is as follows:

Spade Algorithm

The Pseudo Code for SPADE Algorithm is as follows

GSP Application

Spade Application

Generating 2-sequences

Experimental Results

Conclusion

References

Rajae KRIBII¹, Youssef FAKIR¹

A		B		D		F
SID	EID	SID	EID	SID	EID	SID	EID
1	2	1	2	1	1	1	3
1	3	1	3	1	1	1	4
1	4	2	1	4	1	2	1
2	1	3	1			3	1
3	1	4	2			4	2
4	3