Methods of Enriching Domain Knowledge with Universal Semantics for Higher Text Mining Performance Open Access
Downloadable ContentDownload PDF
Language models are either trained only on a repository data or post-trained using therepository after having been trained on a huge dataset like Wikipedia. Either way, sincethe distribution of the repository data (usually a domain-specific corpus) and the real-worlddistribution of concepts (such as classes in a classification application) are rarely equivalent,the accuracy of the language model is reduced. That is usually due to the Inadequacy ofKnowledge (IoK) of the domain-specific corpus (referred to as the local knowledge), relativeto the real-world, universal knowledge. To address this IoK issue, depending on the languagemodeling technique, whether traditional or recent, different methods are proposed in thisdissertation to combine efficiently the local knowledge with the universal semantics andimprove the performance of various text mining tasks. Considering traditional languagemodeling such as bag of words, two novel techniques are proposed to combine two sourcesof knowledge: one technique for document classification and one for document clustering.For classification, a novel feature weighting function is proposed which calculates the weightof each feature using the discriminating power of the feature derived from the local and theuniversal sources of knowledge. For document clustering, where no labels are available, adifferent technique is introduced to combine the similarities of each pair of documents, wherethe similarities are derived from the local and the universal knowledge. The performanceof the proposed methods for document classification and document clustering is evaluatedon some widely-used classification and clustering algorithms and on a number of standardsdatasets. The evaluation results show that the performance of both document classificationand clustering techniques is significantly improved using our proposed methods. Also,we consider recent language modeling (or rather word embedding) techniques such asWord2Vec, GloVe, USE and BERT, which introduce new kinds of feature vectors, andexhibit a new, additional kind of IoK, namely, the Out Of vocabulary (OOV) problem, i.e.when certain words do not appear in the repository data but could appear in test data. Inthis thesis, the OOV issue in GloVe is addressed by changing the form of the training datainto n-grams of characters. We show that this version of GloVe, which we call C-GloVe,addresses the OOV problem quite effectively, and generally outperforms GloVe and FastText,especially for smaller training datasets. Also, the IoK of the local knowledge (i.e., domain specifictraining corpus) relative to the universal knowledge is addressed where the featurevectors are embedding vectors. Specifically, we propose a method to integrate local anduniversal sources of knowledge and to combine different word embedding algorithms. Ourexperimental results on three different text mining tasks show that the proposed methodsyield higher performance than a standalone source of knowledge and a standalone wordembedding algorithm, especially if one embedder is trained on the local source and adifferent embedder is trained on the universal source of knowledge. Also, experimentalresults on the classification task show that the proposed integrated method achieves thesame or higher F1-score than the state of the art (i.e., BERT) in nearly all the documentclassification experiments performed.