Biologically Inspired Data Mining Framework and Algorithms Open Access
Downloadable ContentDownload PDF
We exist in a world of "big data." With every cyberspace browsed, e-mail sent, status shared, photo uploaded, tweet tweeted, or search query issued, we leave digital imprints behind, that exponentially increase the massive amount of data floating in the ether, specifically in cables and airwaves. A high volume, velocity, and variety of enormous amounts of data make it challenging to structure, understand, search, data-mine, and visualize them through existing data-mining algorithms. One of the traditional means of understanding this avalanche of data is to classify or cluster large and ambiguous data elements into categories. Data clustering has been the subject of active research in several fields such as statistics, data mining, and machine learning. The amount of research that has been conducted on clustering is approaching a climax, yet the limitations of existing clustering algorithms have largely been ignored. This dissertation introduces a novel biologically inspired data-clustering model that is based on the natural phenomenon of bird flocking. Bird flocking behavior inspired researchers from different institutions including the United States Department of Defense where it became a sine qua non in most rigorous missions carried by Unmanned Aerial Vehicles. In this dissertation, we define a multi-disciplinary Swarm Clustering Framework that encapsulates new flocking algorithms which can be used for different clustering settings: batch clustering, stream clustering, or outlier detection. The underlying formula of our approach can be summarized as follows: we model data (e.g. social network online activities, text documents, or gene expressions in microarrays experiments) as flocks of birds flying in a fictional yet observable space where data leaders are being detected, followers flock with their leaders, and over the course of the flocking process the fittest leaders survive to lead its flocks (clusters). We extend the rules that orchestrate flocking behavior in nature to new flocking rules that conform to data clustering: (i) flock homogeneity, and (ii) flock leadership. Large data are thereby represented by flocking birds that can be easily analyzed and visualized. This dissertation follows both a theoretical and practical approach. The theoretical aspect involves a study of the new flocking algorithms in terms of cost analysis, convergence analysis, and proximity measures. In practice, we present (1) the first biologically inspired framework in the literature for identifying communities and community leaders in dynamic social networks inspired from the natural phenomenon of bird flocking, (2) an outlier detection approach to detect outliers cancer samples in gene microarray experiments, (3) a data clustering model entitled "Flock by leader" which was applied to information retrieval in text mining and social network analysis. The overall goal is to provide a comprehensive biologically inspired data-mining framework that outperforms the existing data-mining algorithms and which can be applied to solve complex problems in bioinformatics, social network analysis, and text mining.