The authors declare that they have no conflict of interest.
In recent times, the urge to collect data and analyze it has grown. Time stamping a data set is an important part of the analysis and data mining as it can give information that is more useful. Different mining techniques have been designed for mining timeseries data, sequential patterns for example seeks relationships between occurrences of sequential events and finds if there exist any specific order of the occurrences. Many Algorithms has been proposed to study this data type based on the apriori approach. In this paper we compare two basic sequential algorithms which are General Sequential algorithm (GSP) and Sequential PAttern Discovery using Equivalence classes (SPADE). These two algorithms are based on the Apriori algorithms. Experimental results have shown that SPADE consumes less time than GSP algorithm.
Data mining is a critical step in the field of data science. It involves sorting large data sets to identify patterns and build relationships to solve problems through data analysis. Data mining tools help companies predict future trends. Specifically, data mining benefits vary depending on the goal and the industry. Sales and marketing departments can mine customer data to improve lead conversion rates or to create onetoone marketing campaigns. Data mining information on historical sales patterns and customer behaviors can be used to build prediction models for future sales, new products and services.
Data mining parameters include classification, sequence or path analysis, clustering and forecasting. Sequence or path parsing parameters look for patterns in which one event leads to another subsequent event. A sequence is an ordered list of element sets and is a common type of data structure in many databases.
A classification parameter searches for new models and may cause a change in the organization of the data. Classification algorithms predict variables based on other factors in the database. In data mining, association rules are created by analyzing the data to look for frequent "if / then" patterns, and then using the
Sequence Mining was first introduced in 1995 by Argwal and Srikant
Various algorithms have been implemented to identify the frequent sequences from the sequence database. One of the approaches used to identify frequent sequences is the Apriori approach
This paper is organized as follows: in the second section, we describe the Generalized Sequential Models (GSP) algorithm, which use a horizontal format database. In the third section a Sequential Pattern Discovery using Equivalence classes (SPADE) which uses a vertical database is given. A case study is given in section four. The paper is ended by experimental results and conclusion.
The Generalized Sequential Models (GSP) algorithm is an aprioribased algorithm applied to sequential models. It integrates with time constraints and relaxes the definition of transaction. For time constraints, maximum gap and minimal gap are defined to specify the gap between any two adjacent transactions in the sequence. If the distance is not in the range between maximum gap and minimal gap then this two cannot be taken as two consecutive transactions in a sequence. The transactions are then applied to generate multiple level sequential patterns to find all sequences whose support is greater than the userdefined minimum support
GSP uses the Apriori property
F1 = the set of frequent 1sequence
K = 2,
do while Fk1 != Null;
Generate candidate sets Ck (set of candidate ksequences);
For all input sequences s in the database D
do
Increment count of all a in C_{k} if s supports a
End do
F_{k} = {a ∈ Ck such that its frequency exceeds the threshold}
k = k+1;
End do
Result = Set of all frequent sequences is the union of all F_{k}' s
SPADE is an algorithm proposed to find frequent sequences using efficient lattice search techniques and simple joins. SPADE decomposes the original problem into smaller subproblems using equivalence classes on frequent sequences so that each class can be solved independently. SPADE usually makes only three database scans, first one for frequent 1sequences, second for frequent 2sequences, and one more for generating all other frequent sequences. If the support of 2sequences is available then only one scan is required
SPADE outperforms GSP algorithm, by a factor of two, and by an order of magnitude with precomputed support of 2sequences. It also has excellent scale up properties with respect to a number of parameters such as the number of inputsequences, the number of events per inputsequence, the event size, and the size of potential maximal frequent events and sequences
SPADE (D,min_ supp):
F_{1}={Frequent items or 1sequences }
F_{2}={ Frequent 2  sequences}
ℇ={equivalence classes ^{X}_{θ1}}
for all ^{X}∈ ℇ do EnumerateFrequentSeq(^{X})
EnumerateFrequentSeq(S) :
for all atoms A_{i} ∈ S do
T_{i}≠ Ø
for all atoms A_{j} ∈ S with j≥i do
R=A_{i} V A_{j}
if ( Prune(R)==false) then
L(R)= L(A_{i} ) ⋂ L(A_{j})
if σ(R)≥min_supp then
T_{i}=T_{i} U {R}; F_{R} = F_{R} U {R}
end
if (DepthFirstSearch) then EnumerateFrequentSeq(T_i )
end
if(BreadthFirstSearch) then
for all T_{i}≠Ø do EnumerateFrequentSeq(T_{i})
To demonstrate an explanation of the algorithms GSP and SPADE an example is given (
Client ID  Date  Items 
1  5/1/2015  C, D 
1  16/1/2016  A, B, C 
1  3/1/2017  A, B, F 
1  3/1/2017  A, C, D, F 
2  8/1/2017  A, B, F 
2  10/1/2016  E 
3  5/1/2016  A, B, F 
4  5/1/2016  D, H, G 
4  3/1/2017  B, F 
4  8/1/2017  A, G, H 
The Candidate sets are denoted by C.
C_{k} specifies candidate sets having K items.
The Candidate sets that satisfy minimum support belongs to L_{k} .
K denotes number of items in sequence.
As a first step, the database needs to be grouped by
SID  EID  ITEMS 
1  1,2,3,4  <(CD)(ABC)(ABF)(ACDF)> 
2  1,2  <(ABC)(E)> 
3  1  <(ABF)> 
4  1,2,3  <(DHG)(BF)(AGH)> 
C_{1}={ A,B,C,D,E,F,G,H} are the initial candidates for the first round. For each item, we should extract the number of occurrences or the support that would eliminate the less frequent items (
Items  N° d’occurrence 
A  4 
B  4 
C  1 
D  2 
E  1 
F  4 
G  1 
H  1 
The items that satisfies the minimum support 2 are L_{1}={A,B,D,F} C_{2} can be obtained by joining L_{1}×L_{1}.
Items  N° d’occurrence 
A →A  1 
A→B  1 
A→D  1 
A→F  1 
AB  3 
AD  1 
AF  3 
B→A  2 
B→B  1 
B→D  1 
B→F  1 
BD  1 
BF  4 
D→A  2 
D→B  2 
D→D  2 
D→F  2 
DF  1 
F→A  2 
F→B  1 
F→D  1 
F→F  1 
C_{2}={A→A, A→B, A→D, A→F, AB, AD, AF,B→A, B→B, B→D, B→F, BD, BF, D→A, D→B, D→D, D→F, DF, F→A, F→B, F→D, F→F}
The items that satisfies the minimal support:
L_{2}={AB, AF, B→A, BF, D→A, D→B, D→F, F→A}
In this step, C_{3} can be obtained by joining L_{2} X L_{2 }



ABF  3 
AB→A  1 
AF →A  1 
BF →A  2 
D →B →A  2 
D →F→ A  2 
D →BF  2 
F →AF  1 
D →FA  1 
D→AB  1 
F→AB  0 
B→AB  1 
B→AF  1 
L_{3}={ABF, BF→A, D→B→A, D→F→A, D→BF}
In this step C_{4} can be obtained by joining L_{3 }X L_{3}
Items  N° occurrence 
D→BF→A  2 
L_{4}={D→BF→A}
Most of the sequential pattern mining algorithms assume horizontal database layout. SPADE uses vertical database format. We can generate a new mapping model. <SID, EID> the first identifies the identifier of the sequence, and the second the identifier of the element location in the sequence.
A  B  D  F  
SID  EID  SID  EID  SID  EID  SID  EID 
1  2  1  2  1  1  1  3 
1  3  1  3  1  1  1  4 
1  4  2  1  4  1  2  1 
2  1  3  1  3  1  
3  1  4  2  4  2  
4  3 
Then to generate the other sequence elements of size 2, we perform a join between the tables and compare the number of occurrences with the minimal support.
Joining A and B items makes two kind of elements, (AB) is the equality join which means they happen at the same time; thus, we’ll notice the same
(AB)  
SID  EID A  EID B 
1  2  2 
1  3  3 
2  1  1 
3  1  1 
A→B  
SID  EID A  EID B 
1  3  4 
The support for (AB) is 3, having two cases from the same sequence makes a redundancy problem in the model. For A → B, the support is 1.
(AD)  
SID  EID A  EID D 
1  4  4 
The support for (AD) is 1, it does not satisfy the minimum support.
(AF)  
SID  EID A  EID F 
1  3  3 
1  4  4 
2  2  2 
3  1  1 
The support for (AF) is 3, it satisfies the minimum support, eve, if the number of occurrences is 4 in the database, there is two elements within the same sequence, so they will be considered as one.
(BD)  
SID  EID B  EID D 
     
The number of occurrences is 0, it does not satisfy the minimum support.
(BF)  
SID  EID B  EID F 
1  3  3 
2  2  2 
3  1  1 
4  3  3 
The number of occurrences is 4, it satisfies the minimum support.
(DF)  
SID  EID D  EID F 
1  4  4 
The number of occurrences is 1, it does not satisfy the minimum support.
A→A  
SID  EID A  EID A 
1  2  3 
1  3  4 
The number of occurrences is 2, both belongs to the same sequence, that makes the support 1, so it does not satisfy the min support.
A→D  
SID  EID A  EID D 
1  2  4 
1  3  4 
The support is 1, because the occurrences are from the same sequence. Therefore, it does not satisfy the min support.
A→F  
SID  EID A  EID F 
1  2  3 
1  2  4 
1  3  4 
The support is 1. In addition, it does not satisfy the min support.
B→A  
SID  EID B  EID A 
1  2  3 
4  3  4 
The number of occurrences is 2, it satisfies the minimum support.
B→B  
SID  EID B  EID B 
1  2  2 
1  3  3 
The support is 1, it does not satisfy the minimum support.
B→D  
SID  EID B  EID D 
1  2  4 
1  3  4 
The support is 1, it does not satisfy the minimum support.
B→F  
SID  EID B  EID F 
1  3  4 
The support is 1, it does not satisfy the minimum support.
D→D  
SID  EID D  EID D 
1  1  4 
The support is 1, it does not satisfy the minimum support.
D→A  
SID  EID D  EID A 
1  1  3 
1  1  4 
4  1  4 
The support is 2, it satisfies the minimum support.
D→B  
SID  EID D  EID B 
1  1  2 
1  1  3 
4  1  3 
The support is 2, it satisfies the minimum support. Tab 25.
D→F  
SID  EID D  EID F 
1  1  3 
1  1  4 
4  1  3 
The support is 2, it satisfies the minimum support.
F→F  
SID  EID F  EID F 
1  3  4 
The support is 1, it does not satisfy the minimum support.
F→D  
SID  EID F  EID D 
1  3  4 
The support is 1, it does not satisfy the minimum support.
F→B  
SID  EID F  EID B 
1  3  4 
The support is 1, it does not satisfy the minimum support.
F→A  
SID  EID F  EID A 
1  3  4 
4  3  4 
The support is 2, it satisfies the minimum support.
Items  N° occurrence 
AB  3 
AD  1 
AF  3 
BD  0 
BF  4 
DF  1 
A→A  1 
A→B  1 
A→D  1 
A→F  1 
B→A  2 
B→B  1 
B→D  1 
B→F  1 
D→D  1 
D→A  2 
D→B  2 
D→F  2 
F→F  1 
F→D  1 
F→B  1 
F→A  1 
From equality join and temporal joins, the table showing count of occurrence of sequences of items.
L_{2}={(AB),(AF),(BF),B→A,D→A,D→F,D→B,F→A}
Generating 3sequence:
ABF  
SID  EID A  EID B  EID F 
1  3  3  3 
2  2  2  2 
3  1  1  1 
The support is 3, it satisfies the minimum support.
AB→A  
SID  EID A  EID B  EID A 
1  2  2  3 
1  3  3  4 
2  2  2   
3  1  1   
The support is 1, it does not satisfy the minimum support.
AF→A  
SID  EID A  EID F  EID A 
1  3  3  3 
1  4  4   
2  2  2   
3  1  1   
The support is 1, it does not satisfy the minimum support.
BF→A  
SID  EID B  EID F  EID A 
1  3  3  4 
2  2  2   
3  1  1   
4  3  3  4 
The support is 2, it satisfies the minimum support.
D→B→A  
SID  EID D  EID B  EID A 
1  1  2  3 
4  1  3  4 
The support is 2, it satisfies the minimum support.
D→F→A  
SID  EID D  EID F  EID A 
1  1  3  4 
4  1  3  4 
The support is 2, it satisfies the minimum support.
D→BF  
SID  EID D  EID B  EID F 
1  1  3  3 
2    2  2 
3    1  1 
4  1  3  3 
The support is 2, it satisfies the minimum support.
F→AF  
SID  EID F  EID A  EID F 
1    3  3 
1    4  4 
2    2  2 
3    1  1 
The support is 0, it does not satisfy the minimum support.
D→AF  
SID  EID D  EID A  EID F 
1  1  3  3 
1  1  4  4 
2    2  2 
3    1  1 
The support is 2, it satisfies the minimum support.
D→AB  
SID  EID D  EID A  EID B 
1  1  3  3 
1  1  4  4 
2    2  2 
3    1  1 
The support is 2, it satisfies the minimum support.
F→AB  
SID  EID F  EIDA  EID B 
1    2  2 
1    3  3 
2    2  2 
3    1  1 
The support is 0, it does not satisfy the minimum support.
B→AB  
SID  EID B  EID A  EID B 
1    2  2 
1  2  3  3 
2    2  2 
3    1  1 
The support is 1, it does not satisfy the minimum support.
B→AF  
SID  EID B  EID A  EID F 
1    2  2 
1  2  3  3 
2    2  2 
3    1  1 
The support is 1, it does not satisfy the minimum support.
ABF→A  
SID  EID A  EID B  EID F  EID A 
1  3  3  3  4 
2  2  2  2   
3  1  1  1   
So, the results are: L_{3}={(ABF), (BF)→A, D→B→A, D→F→A, D→(BF)}
Generate 4sequence:
The support is 1, it does not satisfy the minimum support.
D→BF→A  
SID  EID D  EID B  EID F  EID A 
1  1  3  3  4 
2    2  2   
3    1  1   
4  1  3  3  4 
The support is 2, it satisfies the minimum support.
The result: L_{4}={D→(BF)→A}
In this section, we have carried out some experiments in order to evaluate the performance of SPADE and GSP. The implementation of these algorithms has been carried out using JAVA as a coding language and a machine with an intel core i5 processor and 6GB of RAM. The experiment used the same dataset for both algorithms however; the input was varied according to a number of sequences each time. The results display the number of sequences, the CPU time, and the number of sequences found at a support of 30%.



Input Sequence Count  CPU time  Sequence Count  CPU time  Sequence Count 
5  ~5 ms  222  ~2 ms  222 
10  ~3 ms  357  ~1 ms  357 
100  ~11 ms  35  ~ 4 ms  35 
500  ~71 ms  20  ~2 ms  20 
1000  ~80 ms  20  ~1 ms  20 
10000  ~750 ms  20  ~16 ms  20 
100000  ~6441 ms  20  ~ 658 ms  20 
500000  ~31438 ms  20  ~425 ms  20 
This paper presented the concept explanation of two algorithms GSP and SPADE used for mining frequent sequential patterns and presented a case study of how the two algorithms work. The experiment results displayed a higher performance from the SPADE algorithm compared to GSP in consuming a minimal process time.