Visual Interactive 3-Dimensional Clustering
This paper resulted from postgraduate study and collaborative research. The paper describes a new approach to clustering problem based on geometric modeling of clusters and visualization of the clustering process. Clusters are described as geometric solid objects with function representations. Our approach applies the concepts of geometric modelling to clustering and uses data density as clustering criteria. An implementation of the proposed geometric clustering model is discussed. Visual clustering allows the user to see the result of clustering on the screen and set the appropriate parameters. The user visually clusters the data through graphics interface accessing dynamically 3-dimensional projections of multidimensional points from database or files.
Clustering is one of the methods of unsupervised machine learning. Clustering techniques can be applied for similarity search, customer segmentation, pattern recognition, trend analysis, etc. Many methods have been implemented recently but there are still some limitations in the algorithms.
In the existed methods, cluster is usually defined as a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters [1]. This definition only reflects the nature inside clusters, but does not emphasize the difference between clusters and does not describe the granular and unseparated characteristics of clusters.
For we intend to apply the clustering method on multi various data sets, the shape of clusters can be arbitrary. Most of the existing methods are based on the following assumptions: the data sets should have the certain distribution or the number of clusters has to be known in advance. In probabilistic modeling clustering technique, data is considered to be a sample independently drawn from a mixture model of several probability distributions. To determine the distribution of data sets may be the difficult task in these methods. On the other hand, the traditional partitioning methods such as k-means [2] and k-medoids [3] face the problem to determine the number of clusters in advance. Moreover, both methods lack the ability to deal with the clusters of concave shape. Hierarchical methods can detect clusters with irregular shapes, but the linkage methods predispose to find clusters with convex shapes. The improved hierarchical algorithms such as CURE [4] still cannot detect clusters with very complex shape because of the limitation of representation of clusters. The density-based algorithms such as DBSCAN [5] and DENCLUE [6] can find arbitrary shape clusters relatively efficiently. But setting of the parameters in DBSCAN can be a problem, and the efficiency of checking the connectivity between attractors in DENCLUE could be improved.
Moreover, to get more reasonable results of clustering, the value of parameters should be varied according to different distributions of data sets. It means that the parameters should be local rather than global. But, due to the assumptions mentioned above and global parameter setting, few methods existed can deal well with such data set when the density of it changes a lot through the whole data space.
To give the user an easy understanding of both the data set and the results becomes more and more important for the modern clustering systems [7]. The visualization techniques [8] could improve interpretability and usability of the data and the clustering process. First, the user is interested in not only the results of clustering such as attributes of objects in the cluster, but also in the characteristics of cluster such as shape of cluster, and relationship between clusters. Second, since the data from different applications may be quite different in features, currently there is no clustering system that could deal with all possible types of data well. Therefore, the user should be involved into the clustering process to make it more efficient and accurate. Thus, visualization techniques could be used not only for the interpretation of the results but also for the interpretation of the whole process of clustering in order to help the user to make decisions or to set the values of parameters. For instance, the user could select the projection directions [9] for high dimension data set. To incorporate visualization techniques, the existing clustering algorithms use the result of clustering algorithm as the input for visualization system that is costly and inefficient. The better solution is to combine two processes together, which means to use the same model in clustering and visualization. Traditional clustering can be described as follows. Given a set of multidimensional points {x1,x2,…,xn} , find a partition of the points into clusters so that the points within each cluster are similar to one another. In this paper, we extend the definition of cluster by adding the boundary of the cluster, and we introduce geometric model of the cluster describing cluster as a solid. This allows us to propose a new clustering algorithm integrating computer graphics algorithms and visualization techniques.
Furthermore, the aim of clustering is not only to find clusters and attributes of objects in the cluster, but finding the clusters of the certain shape could be even more valuable in some applications. Then, we need to know the formula of every cluster. In this case, the search for similar shape clusters and comparison of cluster shapes could be possible [10].
The high dimensionality problem is usually approached by specifying the subspace with the lower dimensionality [12]. In this paper, we do not study the problem of choosing the projected dimensionality.
Clustering is a classical problem in databases and warehousing. Development of data mining query methods and graphical user interfaces is a new trend in data mining [1]. In this work, we propose the geometric model of clustering with function-based representation of solids that allows us to integrate visualization and querying of clusters [11] using the same model.
The paper is organized as follows. Section 2 introduces basic concepts of our solid-based clustering method. The algorithm’s implementation is described in Section 3. Visualization of clustering process and clustering results are shown in Section 4. In Section 5, conclusion and future work are discussed.
Our approach is based on the following basic concepts. Traditional clustering methods define clusters as subsets that consist of objects that are similar to one another. This definition only emphasizes the characteristics of objects inside the cluster, but not between clusters. For the boundary of a cluster could be a good description of the dissimilarity between clusters, we added it to our definition of the cluster. However, how to build the boundary of clusters should be carefully considered. Here, we use the idea of DENCLUE [6], where the distances between all objects can influence at the resulting clusters, and this influence can be simulated by functions. Then, the boundary of cluster can be described as the area consisted of the farthest places on which the objects of cluster can still influence. Therefore, the cluster could consist of the objects inside and on its boundary. From this point, to define cluster as a solid could be a more natural way. This definition of cluster also reflects the granule feature of cluster.
Visualization techniques play a significant role in clustering [7]. As it was shown in [15, 16], it is natural to integrate clustering technique and computer graphics technique. Then, we can have the advantages of both techniques such as an efficient detection of clusters and interpretability of the clustering process to the users. As we mentioned above, although we could use the results we have got from clustering systems as the inputs to the visualization systems, it would be costly and inconvenient. A better solution is to combine these two processes together, which means using the same model for both clustering and visualization. By comparison of formulas in computer graphics with ones in data mining, we have noticed that the formulae of the density-based clustering method—DENCLUE and the formulae of computer graphics solid model—BLOBBY [13] are quite similar. Moreover, the BLOBBY model could simplify the computation if we apply it to clustering. Based on these findings, we build a geometric model with function representation of solids based on implicit modeling for both clustering and visualization.
Function based shape modeling is becoming increasingly popular in computer graphics [13-14]. The idea that complex geometric objects could be produced from "small formulae" is applied in this work. Using the function representation, geometric shapes are defined with the inequality ¦(x,y,z)³0, where the function f is positive for the points inside the solid, equal to zero on its border and negative outside the solid. The operations are defined as function superpositions, and therefore the result of any operation is a functionally defined solid as well, which can be used as an argument to other operations. We build the geometric model by using function representation of solids. It provides the possibility of further analysis such as the comparison of shapes of clusters and querying of clusters. In our model, clusters can be expressed by a unique function. Let P be a set of multidimensional points P={[p1, p2,…, pn]} in the n-dimensional Euclidean space En. Then, a solid reconstructed on the points can be described with function-based representation as follows:
f(X,P)³0, where X belongs to En. The function can be defined analytically or by procedure. Such function defines closed n-dimensional geometric solid in En space under the following conditions:
f(X,P)>0 for the points inside the solid,
f(X,P)=0 for the points on the solid boundary (surface),
f(X,P)<0 for the points outside the solid,
where X= {x1,x2,…,xn} is a position vector of a point in En.
The zero set of these functions provides surfaces, and the values that are greater or equal to zero define multidimensional geometric solid objects. We regard each cluster as a different solid. To find out whether one object belongs to it, we use function-based representation of the solid. We build the solids by computing the density function of objects in the whole field, which means adding up all the influence functions of objects inside the field. The sum is a field density function that consists of all influence functions of objects. The field density function can make a complete description of the whole data space. Parabolic function, square wave function and Gaussian function are some examples of basic influence function. The functions such as meta-balls or soft objects can also serve as the basic influence function for efficiency. We build the solids on the objects (points) whose values of field function are greater than a threshold.
Let us introduce the formal mathematical specification of our solid-based clustering method.
Influence function. The influence of an object rÎRn can be simulated by function f r : Rn®R as follows: