High Speed Data Streams
In this paper a novel approach has been proposed for finding frequent itemsets in high speed data streams by fuzzifying the support of Positive Border (PB+) itemsets using the concept of jumping windows. Fuzzification helps in preserving the information regarding the Positive Border (PB+) itemsets in the previous windows. Use of jumping window provides an added advantage of speeding up the computation process for high speed data streams. An extensive experimental study to validate the performance of the proposed new algorithm has been carried out.
Experimental results show that the proposed algorithm has performance advantage in terms of execution time in finding frequent itemsets in high speed data streams.Data stream is a fast arriving, massive unbounded input data that can only be handled once. In recent years, online mining of data stream is one of the important research areas. For handling data streams different models for processing data stream have been discussed in the literature viz. Landmark window, Damped window and Sliding window. Landmark window model [6, 7, 17, 18] mines all frequent itemsets over the entire history of stream data from a specific time point called landmark to the present.
Damped window model [8, 9, 11], also called as Time Fading model, mines frequent itemsets from stream data in which each transaction has a weight and their weight decreases with age. Sliding windows model [15] finds frequent itemsets and maintains them in sliding window. In this model, only part of the data streams within the sliding window are stored and processed at the time when the data flows in. This model has been used successfully in many research applications [5, 12, 16]. Jumping window model [14] has equal window size and they do not overlap.
Unlike Landmark, Damped and Sliding window, all the values in the data stream will reside only once in the window.Different models have their own advantages and disadvantages. These models are used in various data mining techniques that have been explored in the literature for finding frequent patterns in stream data mining. One of the most important data mining problems is mining Positive Border (PB+) itemset from a data stream. Positive Border (PB+) itemsets are nothing but maximal frequent itemsets. From the Positive Border (PB+) itemsets the complete list of frequent itemsets can be found.This paper proposes the use of jumping window model and the concept of fuzzification of Positive Border (PB+) itemsets in order to find frequent itemsets in the data stream.
Jumping window tries to preserve the advantages of both Landmark and Sliding window approaches. This has been achieved by fuzzification of the support of the Positive Border (PB+) itemsets. Fuzzification of support of Positive Border itemsets (PB+) has the following advantages:The concise information regarding the support of the Positive Border (PB+) itemsets in the previous windows is preserved instead of maintaining all the transactions.Using fuzzification concept, overlapping of windows is maintained by preserving the fuzzy membership value of the Positive Border (PB+) itemsets of the earlier windows.Fuzzy membership value is being maintained from the beginning till the end of the data stream that is clubbed as more and more transactions come in and hence it gives the same advantage as landmark window.*Corresponding authorEmail addresses: [email protected] (B.Chandra), [email protected] (Shalini Bhaskar)Comparative performance evaluation of the proposed algorithm has been done with the MOMENT algorithm since this algorithm also finds the special kind of itemsets (viz. closed frequent itemsets) like the proposed algorithm that finds Positive Border (PB+) itemsets. The results show that the proposed algorithm reduces the computational time to a great extent.
2. Overview of the earlier work
Aggrawal et al [1] first introduced mining frequent itemsets from Market Basket Data in 1993. In 1998, Bayardo [2] proposed the problem of mining maximal frequent itemsets. In 1999, Pasquier et al [3] provided an improved theory for generating frequent closed itemset. Han et al [4] gave FP Tree approach without candidate generation in the year 2000, which was extended to many areas including data stream. Chi et al [5] proposed MOMENT algorithm in 2004 using sliding window concept in which Closed Enumerated Tree structure is used to store information of the closed frequent itemsets and the boundary nodes present in the transactions in the current sliding window of the continuous data stream.
Liu et al [6] proposed FP-CDS algorithm that can capture all frequent closed itemsets in a new storage structure FP-CDS tree over the landmark window.Li et al [7] proposed DSM-FI (Data Stream Mining for Frequent Itemsets) method to mine frequent itemsets over entire history using Landmark window and FP Tree approach was used for finding frequent itemsets in the top down manner. Giannelle et al [8, 9] proposed approximation algorithm for finding frequent itemset over arbitrary time intervals using FP_Stream to maintain itemset frequency count. Chang et al [11] proposed BTS based algorithm for finding frequent itemsets in data stream by using weight concept for transaction on the basis of age of transaction.In the year 2007, Woo et al [10] proposed the estMAX algorithm for finding maximal frequent itemsets using prefix tree without any additional superset /subset checking mechanism.
Li et al [16] proposed MFI-TransSW (Maximal Frequent Itemsets in Transactional Sliding Window) algorithm to mine the set of all frequent itemsets in data stream with a transaction sensitive sliding window by using bit sequence representation of items. Jin at el [18] proposed hash based approach to discover frequent pattern. Mao et al [19] proposed INSTANT algorithm for finding maximal frequent itemsets over data stream.In the various algorithms discussed above, work has been done on finding frequent itemsets, closed frequent itemsets and maximal frequent itemsets in data streams over landmark, damped and sliding window. The proposed approach of fuzzification of Positive Border (PB+) itemsets i.e. maximal frequent itemsets obtained using jumping window has not been attempted so far.
Positive Border (PB+) itemsets help in determining all possible frequent itemsets. The proposed approach is explained in Section 3.
3. Proposed Approach
In the proposed approach concept of jumping window has been used. For the set of transactions in each jumping window, FP-Tree is generated. Generating FP-Tree speeds up traversal of the transactions and at the same time uses less memory to store the transactions. From FP-Tree, Set Enumeration Tree is created that stores only frequent itemsets.
Each node in Set Enumeration Tree stores the following information: the itemset and the corresponding support. Hash based approach is used for implementing Set Enumeration Tree. The process of creating Set Enumeration Tree, identifying Positive Border (PB+) itemsets and fuzzification of Positive Border (PB+) itemsets can be performed in the background while in the foreground; FP-Trees are generated for the incoming high speed data streams.
Complexity of the algorithm depends upon the number of jumping windows, size of jumping window, minimum support and the number of nodes in the Set Enumeration Tree. The proposed algorithm using fuzzification of Positive Border (PB+) itemsets and jumping window is given in Figure 1. Figure 2, 3 and 4 gives the pseudocode to identify Positive Border itemsets from the data stream.
Algorithm
1: INPUT: A continuous data stream, a minimum support count ms, size of jumping window,2: OUTPUT: List of Positive Border itemsets3: Main:4: ;5: ;6: for first window7: {8:9:
Article name: High Speed Data Streams essay, research paper, dissertation