Table of Contents
Abstract
Clustering is the process of division of data into the groups such that objects in the same group have similarity but are different to objects in other group. Clustering plays an important role in many disciplines and has wide range of applications. Data clustering objective is to discover patterns and structure in high dimensional data. Data clustering point out the sparse and crowded places and find the overall distribution patterns of dataset. High-Dimensional data are ubiquitous in many areas of machine learning, pattern recognition, computer vision, signal and image processing, bioinformatics, etc. The idea of data clustering is same to human way of thinking and is simple in its nature. K-means is a partitioning clustering algorithm, this method is to divided the given date objects into k different clusters through the iterative, converging to a local minimum. In data mining data sets often contain categorical values which need special algorithms to processed this because only numeric values can be processed through K-means algorithm but working only on numeric values limits its use in data mining.
Introduction
Clustering is the process of division of data into the groups such that objects in the same group have similarity but are different to objects in other group. In data mining clustering is very useful to find out distribution patterns in the process data. Data clustering have applications in many fields like pattern recognition, medicine, marketing, statistics, business etc. For large dataset clustering is very good mining tool. There are many techniques in clustering like partitioning and hierarchical clustering. Clustering is actually data mining technique used to get information from raw data. Clustering is an unsupervised learning process which mean that data have no accurate result and give no inputs to the model during the training. Clustering play an important role in data mining applications like information retrieval and text mining, web analysis, medical diagnostic, marketing and many others.
Clustering plays an important role in many disciplines and has wide range of applications. Data clustering applications deal with large datasets, data with many attributes and attributes of different types. Now a day data clustering has many applications in many fields such as marketing, earthquake studies, banking, database and computing(facing privacy in database) city-planning, online sale etc. In the field of marketing clustering help marketers to find out different groups in their customers base.
Data clustering objective is to discover patterns and structure in high dimensional data. Clustering is applied for data summarization and to find groups or cluster of like data. Clustering is unsupervised learning deal with finding a structure in a collection of unlabeled data. In large databases multi-dimensional data is accurately and efficiently clustered to determine patterns and extract useful information from such patterns. In multi-dimensional data points in large data set, the data space is not uniformly occupied by the data points. Some places in the space are crowded while other are sparse. Data clustering point out the sparse and crowded places and find the overall distribution patterns of dataset. Data clustering is studied in the statistics, Machine Learning and Database which is communicated in different methods to analyze and grouping the large data to extract useful information.
Literature Review
The idea of data clustering is same to human way of thinking and is simple in its nature. For the analysis of data first of all large data is divided into a small number of groups or categories. For the low dimension of data grouping is simple but categories the large and complex data is not simple for human due to which some soft computing methods have been choose to solve this kind of problem.
Data Clustering Techniques
There are so many data clustering techniques used, some of the most techniques are:
- K-means Clustering
- Mountain Clustering
- Fuzzy C-means Clustering
- Subtractive Clustering
The common objective of all the clustering techniques given above is to find cluster centers that will represent each cluster. A cluster center is a way to find where the heart of each cluster is located, so that later when presented with an input vector, the system can find which cluster this vector belongs to by measuring a similarity metric between the input vector and all the cluster centers, and finding which cluster is the nearest or most similar one.
The K-means Clustering Algorithm
K-means is an unsupervised, numerical and iterative method. It is very fast and simple, so in many practical applications, the method is proved to be a very effective way that can find out very good clustering results. This is a new partitioning clustering algorithm, which can handle the data of symbol attribute, and it also can handle the data of numerical attribute. This method reduces the impact of ‘noise’, so it enhances the efficiency of clustering. It present a systematic method to find the initial cluster centers, obtained centers by this method are consistent with the distribution of data.
K-means is a partitioning clustering algorithm, this method is to divided the given date objects into k different clusters through the iterative, converging to a local minimum. So the results of these clusters are independent and compact. The algorithm consists of two different phases. The first phase randomly selects k centers, where the value k is fixed in advance. The second phase is to take each data object to the closest center. The distance between cluster centers and each data object is generally considered is euclidean distance.
Improved K-means Clustering Algorithm
The standard k-means algorithm calculate the distance from all the centers of k clusters to the each data object when it executes the iteration each time, which takes up a lot of execution time especially for large databases . The main idea of improved algorithm is to set two simple data structures is to find the distance of all the data objects to the nearest cluster during the each iteration and to retain the labels of cluster, that can be used in next iteration, we calculate the distance between the new cluster center and the current data object, if the computed distance is equal to or smaller than the distance to the old center, the data object stays in it’s cluster that was assigned to in previous iteration.
Experimental Result
We have tested the clustering algorithms on many well-known data sets, namely the synthetic data set, the iris data set and the image segmentation data set. In all data sets we carried out experiments for the clustering problems conducted by considering only feature vectors and ignoring class labels. The synthetic data set 250 two-dimensional data points, the iris data set have 150 four-dimensional data points and for the image segmentation data set we get 210 six-dimensional data points obtained through PCA on the original 18-dimensional data points. The quality of the resulted solutions was obtained in terms of the values of the final clustering error.
For each data set we carried out the below experiments:
- in case of global k-means algorithm for M =15 (one run).
- one run of the fast global k-means algorithm for M =15.
- the k-means algorithm for k=1,…,15. For each value of k, the k-means algorithm was executed N times (where N is the number of data points) starting from random initial positions for the k centers, and we calculated the minimum and average clustering error as well as its standard deviation.
HIGH-DIMENSIONAL data are ubiquitous in many areas of machine learning, pattern recognition, computer vision, signal and image processing, bioinformatics, etc. For example, images consist of billions of pixels, videos can have millions of frames, text and web documents are consist of hundreds of thousands of features, etc. Data with the high-dimensionality not only increases the memory requirements and the computational time of algorithms, but also badly affects their performance due to the noise effect and less number of samples with respect to the ambient space dimension, commonly referred to as the “curse of dimensionality”.
In data mining data sets often contain categorical values which need special algorithms to processed this because only numeric values can be processed through K-means algorithm but working only on numeric values limits its use in data mining. There is advance algorithm called k-modes, to extend the k-means to categorical domains. For clustering categorical data we present a fast clustering algorithm called k-modes, is an extension to the Popular and well known k-means algorithm. Data mining applications consist of categorical data. The famous approach to converting categorical data into numeric values does not necessarily produce meaningful results in the case where categorical domains are not ordered. The k-modes algorithm extends the k-means algorithm and removes this limitation of numerical data.
Implementation and Result
In many fields high dimensional data is increasing day by day. Data becomes sparse in high dimension data due to which distance measures become meaningless. Various data clustering techniques and algorithms are apply to solve this problem. To extract knowledge from large terabytes data these techniques and algorithm needs improvement day by day.
One of the popular framework call MapReduce framework which provide high speed and scalability for implementing clustering algorithm. Parallel clustering algorithms improved the speed and scalability of clustering algorithms but still there are some problems with processor distribution and memory. MapReduce framework is represented by Hadoop and Google which is open source and is shown in Fig 2.
The clustering problem is occurs in applications where a partition of data is necessary which is studied from several years. Along with geometric procedures and K-means algorithm one another most technique is used for clustering which is probabilistic approach. In the probabilistic approach partition is obtained which is interpreted from statistical point of view. In high-dimensional spaces model-base methods show disappointing behavior because model base clustering methods in high-dimensional data are over-parametrized . To reduce the dimension of the data, dimension reduction methods are used before the clustering step. Dimension reduction has some drawbacks and to avoid these drawbacks model-based methods is used to cluster high-dimensional data.
Communication technology and modern information gives the massive amount of data. The aim of pattern recognition is to extract hidden structure from data to create data representation and to represent symbolic data processing concepts. The applications of clustering algorithms cover in many fields from video data compression and audio signals to structure detection in machine learning. In this paper stochastic optimization technique to data clustering which is relate to robustness of maximum entropy. The pair of proximity data is mathematically formulated with a minimization called deterministic annealing. Data clustering is belongs to unsupervised learning problems in pattern recognition and statistics. For vector quantization a solution of deterministic annealing procedure have different rate of constraints. Grouping data into clusters is essential in discovering structure. Embedding and pair wise clustering is performed on real data and artificial.
Many techniques were previously implemented and used in clustering gene expression data. Data clustering analysis have a lot of applications and importance in medical health filed and can applied to tumor tissues and to differentiate tumor types on the basis of gene expression patterns in the data. One of the latest and famous algorithmic techniques called novel algorithm is used for the problem of clustering gene expression patterns. Novel algorithm does not build tree of cluster but clusters are built as unrelated entities. Large quantities of gene expression data has the need to the way gene expression are discovered. Several algorithms are presented here, one of them is heuristic which have no formal form of time complexity.
Data sets measuring in terabytes are now common in text mining and data where a few millions data is normal here. In case of huge data it is recommended to use parallel computing rather than disk based algorithm which is very slower. For classification and association rules parallel data mining algorithm is considered to use. Clustering is used to analyze unstructured text documents and label such as unstructured collections. Parallel clustering algorithm shared parallel machine to analyze the large data in less time to extract information from data. Also parallelizing of direct K-means algorithm is focus here for data clustering.
Conclusion
As mentioned above clustering is the process of division of data into the groups. In this paper many data clustering techniques are presented like K-means clustering, Mountain clustering, Subtractive clustering etc. In this paper main focus is on data clustering analysis for which many algorithms and techniques are presented here. The main focus is on Improved K-means algorithm. The main idea of improved algorithm is to set two simple data structures is to find the distance of all the data objects to the nearest cluster during the each iteration and to retain the labels of cluster. K-modes which is the extension and advance version of K-means algorithm is presented here for categorical data.
References
- Joshi, A., and Kaur, R.: ‘A review: Comparative study of various clustering techniques in data mining’, International Journal of Advanced Research in Computer Science and Software Engineering, 2013, 3, (3)
- Gulati, H., and Singh, P.: ‘Clustering techniques in data mining: A comparison’, in Editor (Ed.)^(Eds.): ‘Book Clustering techniques in data mining: A comparison’ (IEEE, 2015, edn.), pp. 410-415
- Berkhin, P.: ‘A survey of clustering data mining techniques’: ‘Grouping multidimensional data’ (Springer, 2006), pp. 25-71
- Alguliyev, R., Aliguliyev, R., Bagirov, A.M., and Karimov, R.: ‘Batch clustering algorithm for big data sets’ (2016. 2016)
- Zhang, T., Ramakrishnan, R., and Livny, M.: ‘Method and system for data clustering for very large databases’, in Editor (Ed.)^(Eds.): ‘Book Method and system for data clustering for very large databases’ (Google Patents, 2014, edn.), pp.
- Hammouda, K. and Karray, F., 2016. A comparative study of data clustering techniques. University of Waterloo, Ontario, Canada, p.1.
- Na, S., Xumin, L. and Yong, G., 2010, April. Research on k-means clustering algorithm: An improved k-means clustering algorithm. In 2010 Third International Symposium on intelligent information technology and security informatics (pp. 63-67). IEEE.
- Likas, A., Vlassis, N. and Verbeek, J.J., 2003. The global k-means clustering algorithm. Pattern recognition, 36(2), pp.451-461.
- Elhamifar, E. and Vidal, R., 2013. Sparse subspace clustering: Algorithm, theory, and applications. IEEE transactions on pattern analysis and machine intelligence, 35(11), pp.2765-2781.
- Karypis, M.S.G., Kumar, V. and Steinbach, M., 2017, August. A comparison of document clustering techniques. In TextMining Workshop at KDD2000 (May 2017).
- Huang, Z., 2013 A fast clustering algorithm to cluster very large categorical data sets in data mining. DMKD, 3(8), pp.34-39.
- Parsons, L., Haque, E. and Liu, H., 2014. Subspace clustering for high dimensional data: a review. Acm Sigkdd Explorations Newsletter, 6(1), pp.90-105.
- Shirkhorshidi, A.S., Aghabozorgi, S., Wah, T.Y. and Herawan, T., 2014, June. Big data clustering: a review. In International Conference on Computational Science and Its Applications(pp. 707-720). Springer, Cham.
- Bouveyron, C. and Brunet-Saumard, C., 2014. Model-based clustering of high-dimensional data: A review. Computational Statistics & Data Analysis, 71, pp.52-78.
- Hofmann, T. and Buhmann, J.M., 2012. Pairwise data clustering by deterministic annealing. Ieee transactions on pattern analysis and machine intelligence, 19(1), pp.1-14.
- Ben-Dor, A., Shamir, R. and Yakhini, Z., 2013. Clustering gene expression patterns. Journal of computational biology, 6(3-4), pp.281-297.
- Bouveyron, C., Girard, S. and Schmid, C., 2007. High-dimensional data clustering. Computational Statistics & Data Analysis, 52(1), pp.502-519.
- Dhillon, I.S. and Modha, D.S., 2002. A data-clustering algorithm on distributed memory multiprocessors. In Large-scale parallel data mining (pp. 245-260). Springer, Berlin, Heide