Investigation and Development of a Novel Clustering Algorithm for Big Data Applications Open Access
Downloadable ContentDownload PDF
The advances in digital technology including sensors, communications, cloud computing and storage have allowed people to generate large volumes of data which as a result have posed some difficulties on the current computation capabilities. This data could be of any format such as texts, geometries, images, videos, sounds, or their combination. As many organizations have become data-driven entities, developing supervised and unsupervised learning algorithms that are scalable and efficient in processing big data has attracted researchers from different applications domains such as computer vision, data mining, computational biology and social sciences. One feasible way to deal with this massive data is to group them into subsets of categories or clusters where objects within a cluster are similar to one another and dissimilar to objects in other clusters. This indispensable technique can often lead to find models, discover useful knowledge and insights as well as uncover hidden features or characteristics that naturally divide the cases. In this work, we extend previous efforts for developing a hash-based clustering algorithm that employs double Golay encoding technique. Initially, we study the structures and properties of the error-correction codes and introduce our improved version of a hash-based clustering algorithm which is scalable, efficient, reliable and much more suitable to handle big data applications. Our approach in particular utilizes single encoding technique; therefore, it increases the quality of the clustering method and reduces the overall computational cost. This clustering algorithm can be effectively applied to various computational intelligence problems Furthermore, we provide a fast and noise-robust pattern prediction and classification algorithm. This method aims at classifying an input pattern into a specific category or class by using some characteristics derived from Golay clustering scheme. Moreover, we investigate and discuss the factors that could increase the noise and illustrate how they could be controlled. To support our theoretical arguments, empirical evidence is shown. Also, number of similar algorithms are reviewed and evaluated for their relative performances.In addition, we present a clustering ensembles approach that attempts to achieve and improve the clustering quality. This clustering ensembles approach generates a set of clustering schemes from the same dataset and combines them into a concluding clustering. Hence, this technique ensures that the produced clustering is a consensus of multiple clustering schemes.