# The data mining tasks

**1. Introduction**

Clustering is one of the primary data mining tasks. All clustering algorithms aim at segmenting a collection of objects into subsets or clusters, such that objects within one cluster are more closely related to one another than to objects assigned to different clusters. Finding these clusters is important because they represent different classes of objects that were previously unknown, bringing additional insight which can be exploited for various purposes.

Many applications of clustering are characterized by high-dimensional data which poses two challenges : First, the so called curse of dimensionality which essentially means that distance measures become increasingly meaningless as the number of dimensions increases in the data set [1]. The second challenge is clusters may be only visible in certain, maybe even arbitrarily oriented subspaces of the feature space. Thus, traditional clustering methods considering all dimensions of the feature space often fail to detect meaningful clusters in high-dimensional data. Reduction of attributes, by various feature reduction techniques is often infeasible as they cannot account for local trends in the data [10].

To overcome these limitations of traditional clustering methods, recent research has focused on discovering clusters embedded in different subspaces of high-dimensional data sets, a paradigm commonly known as Subspace Clustering. Formally, a subspace cluster can be defined as a pair (subspace of the feature space, subset of data points). [12]

The goal of our research is to formulize or develop Data Management technique and Data clustering algorithm to efficiently and effectively mine Distributed, very High Dimensional data with equal efficiency and accuracy as that of traditional centralized high dimensional clustering techniques. The technique will offer the benefits of combining High Dimensional Data Clustering and Distributed Data Clustering.

§ Very High Dimensional Data Clustering : The aim of the High Dimensional Data Clustering is to find clusters hidden in subsets of the dimensions of the given data set, without loosing the deep insight into the database.

§ Distributed Data Clustering : The aim of the Distributed Data Clustering is to allow clustering of distributed data as if they were in one location, while minimizing the network traffic.

Thus this project brings the added complexity as distributed tables are having large feature space hiding useful clusters in small subsets of dimensions.

**2. Proposed Research**

Thus the goal of this project can be detailed as:

- To introduce an effective and efficient method to find clusters, at local sites, hidden in individual, relevant dimensions using the density based approach for Subspace Clustering, in contrast to standard, currently available grid based methods.
- The information of these clusters, in proper format, will be sent to central server which will combine all these clusters to form global level clusters without disturbing the corresponding subspaces.
- The global clusters will be sent back to local sites to update local clusters according to global pattern.

The remainder of this document is organized as follows. The related work on Subspace Clustering techniques is reviewed in the following section. In section 4, our algorithm ISC (Intelligent Subspace Clustering) is being formulized. Section 5 discusses the current status and future research direction followed by references.

**3. Related Work**

Clustering generates groups of similar objects while assigning dissimilar objects to different clusters [4]. In density based clustering, clusters are dense areas separated by sparsely populated areas as in DBSCAN [7]. It has shown to be capable of detecting arbitrarily shaped clusters even in noisy databases. Traditional clustering does not scale to high-dimensional spaces. As clusters do not show across all attributes, they are hidden by irrelevant attributes. Global dimensionality reduction techniques such as principal component analysis are typically not appropriate, as relevance is not globally uniform.

Subspace Clustering is a very important technique to seek clusters hidden in various subspaces (Dimensions) in a very high dimensional database. These algorithms search for clusters in any possible subset of the attributes [6, 13]. As the number of subspaces is exponential in the number of dimensions, this is a challenging task. There are very few approaches to Subspace Clustering. The pioneering one is CLIQUE (CLUstering in QUEst) [9]. CLIQUE is a grid-based algorithm, using an apriori-like method which recursively navigates through the set of possible subspaces. A slight modification of CLIQUE is the algorithm ENCLUS (Entropy based CLUStering) [2]. A more significant modification to CLIQUE is MAFIA (Merging of Adaptive Finite IntervAls), which is also a grid-based but uses adaptive, variable sized grids in each dimension [11].

The major disadvantage of all these techniques is caused by the use of grids. Grid-based approaches are based on positioning of grids. Thus clusters are always of fixed size and depend on orientation of grid.

Another approach to Subspace clustering is density based. SUBCLU extends non-discretized density-based clustering to subspaces [5]. Compared to the grid-based approaches, SUBCLU achieves a better clustering quality but requires a higher runtime. SURFING is one more effective and efficient algorithm for feature selection in high dimensional data [1]. It finds all subspaces interesting for clustering and sorts them by relevance. But it just gives relevant subspaces for further clustering. The only approach which can find subspace cluster hierarchies is HiSC [3]. However it uses the global density threshold and epsilon distance at different levels of dimensionalities while finding subspace clusters, its result is biased with respect to dimensionality.

To find clusters those are hidden in different subspaces, parameters like density threshold, has to be set depending upon number of dimensions considered for clustering. Additionally, we should be able to see the intermediate level clustering results so that to change parameters for in-depth analysis and interaction. To implement all these, an algorithm ISC (Intelligent Subspace Clustering) is being designed which will help solve this problem.

**4. ISC : Intelligent Subspace Clustering**

This new clustering algorithm ISC (Intelligent Subspace Clustering) tries to overcome three major limitations of the existing state of the art techniques.

- Global level density thresholds across all dimensionality levels generally lead to improper clusters. Various subspaces form clusters of different applications and hence need to define relative threshold. ISC applies the differential and adaptive thresholds, which helps in finding meaningful clusters.
- The uniform parameters approach is not suitable for different kinds of databases. ISC implements dynamic and adaptive determination of meaningful clustering parameters based on hierarchical filtering approach.
- The most important feature of ISC is the ability of incremental learning and dynamic inclusion and exclusions of subspaces which lead to better cluster formation.

Algorithm ISC is based on the density notion of hierarchical subspace clustering. The concept of hierarchy will be used in ISC at dimension level. It first starts finding low dimensional subspace clusters and then tries to extend these low dimensional subspace clusters to form higher dimensional meaningful clusters. Objects will be assigned to subspace clusters using the density notion of clustering. As the number of possible subspaces is exponential in the number of dimensions, this is a challenging task both with respect to efficient runtime of the algorithm as well as to the typically enormous number of output clusters. To cope with this, we will compute the relevance of dimensions and provide rankings to the dimensions according to the interestingness of the clustering structure they exhibit. Thus we will start with highest ranked dimension, build 1-d clusters and then continue in the descending order of ranking.

As the dimensionality increases, the subspace clusters become sparser. Use of global density threshold to find a dense area doesn't give meaningful results. To eliminate this problem, we will check the cluster quality at each dimension level and adaptively change the values of parameters such as density threshold (µ), epsilon distance (e) and re-run the algorithm with new values at that dimensionality level, till we find the clusters of required quality.

Our work thus has following important improvements over the state of the art subspace clustering approaches -

- Use of density approach for clustering data set, even at each dimension level, builds / finds clusters of any size, shape and density.
- Detects subspace clusters at different dimensionality levels.
- Build a hierarchy of nested subspace clusters
- Users can interact for parameter settings at various dimension level to find meaningful clusters.

**4.1 General Idea**

Let D be a set of N feature vectors with dimensionality d. Let A = {a1, . . . , ad} be the set of all attributes ai of D. Any subset S of A is called a subspace. The projection of an object p onto a subspace S is denoted by ps. For any e R+ the e -neighborhood of an object p D is denoted by Ne(p). The parameter µ specifies the density threshold, initially as an input parameter.

The main idea of ISC is to apply Hierarchical Subspace Clustering to find clusters hidden at different dimension levels of high dimensional database, using density notion of clustering and by changing the necessary parameters adaptively at various dimension levels, to find meaningful subspace clusters. None of the previously proposed algorithms uses density based approach for subspace clustering to detect such hierarchies of nested subspace clusters using differential and adaptive input density parameters.

ISC can be divided into 4 major tasks.

- Application of algorithm RANK to find dimensions in the descending order of relevance / interestingness.
- Application of density based clustering on all 1-d data (starting with highest ranked attribute).
- Continue combining smaller dimensional clusters to form higher dimensional clusters by selecting next ranked dimensions to add with.

- In this case, measuring distance between two subspace clusters, is measuring distance among two sets of points.
- A Single-Link linkage method will be used to find the distance between two clusters.
- Again density approach will be used to detect clusters in that dimension level.
- These subspace clusters will be treated as data points, the distance between clusters (will be called as Subspace Distance) will be treated as e - distance.

- At each dimension level, the subspace clustering will be visualized such that, we can modify the parameter values and see the output again.

**4.2 Algorithm RANK**

Algorithm RANK measure the "interestingness" of a dimension with respect to no. of data points taking part in building subspace cluster along that dimension.

For each object, first Subspace Preference Vector will be computed. This vector will contain "1" for all those dimensions for which the point can become part of the cluster along that dimension and "0" otherwise. For this we search e neighborhood of the point in that dimension. If it contains no. of objects higher than µ, the point can participate in cluster along that dimension. So p_vector contains "1" in that dimension position. All these sets are then compared and the attributes with highest number of "1"s are ranked as most interesting. Subsequently in descending order we tag remaining dimensions with corresponding ranking.

**4.3 Hierarchical Subspace Clustering**

Hierarchical Subspace Clustering uses the concept that, several subspace clusters of low dimensionality may together form a subspace cluster of higher dimensionality [3].

The general idea is to evaluate whether two points are contained in a common Subspace Cluster. For example, two points that belong to two different 1-dimensional subspace clusters may also be contained together in a 2-dimensional cluster that consists of the two 1-dimensional projections. If this distance has to be measured between 2 subspace clusters (Subspace Dimensionality ? 1), we will use single-link linkage method [10] which tries to find the minimum distance between two clusters Ci and Cj as the distance between the two closest objects, we call it as Subspace Distance. The strategy will be to merge those points into common clusters which will have Subspace Distances smaller than e distance, adaptively calculated at that level. A hierarchy of subspace clusters will be build accordingly.

**4.4 Algorithm ISC**

On one dimensional dataset DBSCAN [7], the first but robust density based clustering algorithm will be applied with input parameters e (radius) and µ (density threshold). The first dimension to be considered will be the dimension which will be having highest rank given by algorithm RANK. DBSCAN is able to detect arbitrarily shaped clusters by one single pass over the data. DBSCAN checks the e-neighborhood of each point p in the database. If Ne(p) of an object p consists of at least µ objects, i.e., if p is a core object, a new cluster C containing all objects of Ne(p) is created. Then, the e - neighborhood of all points q C which has not yet been processed is checked. If object q is also a core object, the neighbors of q which are not already assigned to cluster C are added to C and their e -neighborhood is checked in the next step. This procedure is repeated until no new point can be added to the current cluster C. Then the algorithm continues with a point which has not yet been processed, trying to expand a new cluster.

The algorithm ISC (Fig. 3) will start at 1-dimension and will iterate till d-dimensions according to the ranks stored in vector p_vector. At each dimension level, we will check the clusters against quality, if the quality is satisfactorily accepted; the algorithm can be made to stop at that dimension level. Otherwise it will be continued to the higher level of dimensionality.

We will need to define Subspace Distance which will be the distance between two subspace clusters to be combined in higher dimensions. For this Single-Link method [10] will be used to find Subspace Distance between two subspace clusters. It is the distance between any two clusters Ci and Cj as the minimum distance between two closest objects.

To visualize subspace clusters, we will need at each dimension level - the no. of attributes, no. of clusters (overlapping) and no. of objects (overlapping). Using this we will be able to display graphically the subspace clusters on 2-dimesnional screen. ISC will further allow to change the parameters and view the results, or to continue to next higher level of dimension.

**4.5 Analysis**

ISC has two input parameters e and µ used to define :

- An e - neighborhood of a point p.
- A core objects (a point with a neighborhood consisting of more than µ objects).
- To find the interestingness of a subspace according to the number of points, greater than µ in e - neighborhood across that subspace.
- Subspace clusters to be merged at higher dimension level, if the distance between them is smaller than e.

First, the parameter e is rather critical because if it is chosen too large, the subspace image may be blurred by noise points, whereas if it is chosen too small, there may not be a clear subspace preference observable, though existing. Further, as the dimensions goes on increasing, the distance between data points already become larger. So we may need to increase it a little bit at higher level to accommodate high level subspace clusters.

Second, the parameter µ is important for specifying the density threshold at various levels of subspace hierarchy. Similar to e, µ also has to be determined adaptively as the data points become more and more sparse.

For determining Density Threshold 'µ' as well as Radius 'e', we will try to keep one parameter constant and check for subspace clusters against jitter. If subspace clusters have lot of jitter, that means we need to increase Density Threshold or Radius to accommodate more points within the cluster. This gives the ability of incremental learning and dynamic inclusion and exclusions of subspaces which lead to better cluster formation.

Thus our algorithm ISC, Intelligent Subspace Clustering algorithm, which uses the concept of Hierarchical Subspace Clustering, tries to overcome the limitations of existing subspace clustering approaches.

- It will be able to detect clusters of any shape and size as it will be based on density notion of clustering.
- It will be able to find clusters at various subspace hierarchies i.e. lower dimensional clusters embedded within higher dimensional clusters.
- ISC does not use a global clustering criterion attempting to form more meaningful clusters.
- ISC applies differential and adaptive level thresholds to find meaningful clusters depending upon the application need.
- Thus it extends the ability of incremental learning and dynamic inclusion and exclusions of subspaces which lead to better cluster formation.

**5 Current Status and Future Plan**

As described in Section 4.1 of the first Progress Report, the Research Project technique consists of six major steps as shown in Fig. 4 which gives the semester wise implementation plan of the project.

**5.1 Current Status**

In this semester, we first defined our approach towards Subspace Clustering to be applied at local sites as "Hierarchical Subspace Clustering" which tries to find subspace clusters hierarchically starting at 1-dimension and then going to wards higher dimension. We also adopted the adaptive parameter setting approach which will help to decide the quality of clustering at various dimension level. We also designed necessary algorithms (Algorithm ISC and Algorithm RANKING) to implement this approach.

Thus, as per the schedule, in this semester we could design the algorithms to implement Subspace Clustering at local sites which defines the proper further direction of our research. The implementation of this will surely have enormous efficiency and accuracy advantages to Data Mining.

**5.2 Future Plan**

In the immediate future the tasks to be completed would be -

- Start implementing these algorithms using C/C++
- Collect sample High Dimensional Database to apply these algorithms
- To test the results visually, learn some tool like MATLAB

**6. References**

- C. Baumgartner, C. Plant, K. Kailing, H.P. Kriegel, and P. Kröger, "Subspace Selection for Clustering High-Dimensional Data", In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM), Volume , Issue , 1-4 Nov. 2004 Page(s): 11 - 18
- C. H. Cheng, A. W.-C. Fu, and Y. Zhang, "Entropy-based subspace clustering for mining numerical data", In Proceedings of the 5th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Diego, CA, pages 84{93, 1999.
- Elke Achtert, Christian Bohm, Hans-Peter Kriegel, Peer Kroger, Ina Muller-Gorman, Arthur Zimek, "Finding Hierarchies of Subspace Clusters", In : Proc. PKDD. (2006)
- J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco, CA, USA, 2001.
- K. Kailing, H.P. Kriegel, and P. Kroger, "Density-connected subspace clustering for high-dimensional data", In Proceedings of the 4th SIAM International Conference on Data Mining (SDM), Orlando, FL, 2004.
- L. Parsons, E. Haque, and H. Liu. Subspace clustering for high dimensional data: a review. ACM SIGKDD Exploration Newsletter, 6(1):90{105, June 2004.
- M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, "A density-based algorithm for discovering clusters in large spatial databases with Noise", In Proceedings of the 2nd ACM International Conference on Knowledge Discovery and Data Mining (KDD), Portland, OR, 1996.
- M. Steinbach, L. Ertöz and V. Kumar, "The Challenges of Clustering High Dimensional Data", [online]. Available : http://www-users.cs.umn.edu/~kumar/papers/high_dim_clustering_19.pdf
- R. Agrawal, J. Gehrke, D. Gunopulos, and. Raghavan, "Automatic subspace clustering of high dimensional data for data mining applications", In Proceedings of the SIGMOD Conference, Seattle, WA, 1998.
- R. Sibson. SLINK, "An optimally efficient algorithm for the single-link cluster method", The Computer Journal, 16(1):30{34,1973.
- S. Goil, H. Nagesh, and A. Choudhary, "MAFIA: Efficient and scalable subspace clustering for very large data sets", Technical Report CPDC-TR-9906-010, Northwestern University, 1999.
- Anne Patrikainen and Marina Meila, "Comparing Subspace Clusterings", IEEE Transactions on Knowledge and Data Engineering, Issue Date: July 2006, pp. 902-916
- C.C. Aggarwal , P.S. Yu; "Redefining Clustering for High-Dimensional Applications", IEEE Transactions on Knowledge and Data Engineering, Issue Date : March 2002, pp. 210-225

Article name: **The data mining tasks essay, research paper, dissertation**