From Preprocessing to Named Entity Recognition, Linking and Clustering in Multilingual, Cross-Lingual, High-Low Resources Settings Open Access
Downloadable ContentDownload PDF
In this information age, we are inundated with tremendous amounts of data coming from multiple diverse sources. Efficiently Extracting information from such sources is of paramount priority to several industries, governments and policy makers, to name a few. However, extracting information from textual genres that is relevant to specific domains posits several challenges given the genres variation, i.e. building tools for one genre doesn't guarantee quality performance on other genres. In addition, due to the significant role that social media currently imposes as an important information provider, new sets of challenges have been imposed on text processing tasks, such as language variants and dialects awareness, and de-noising text in order to handle non-structured data. Additionally, given globalization, information is needed to transcend language barriers, hence the need for 'cheaper' multilingual and multi-genres solutions that benefits from the currently available resources.In this dissertation, we will address the problem of information extraction focusing on the specific sub problems of named entity recognition and resolution. We will first address the preprocessing steps required for any information extraction system in an innovative way that requires minimal external resources dependencies. Then, we will address our exploration of named entity recognition systems where resources are limited. We posit solutions that handle multiple language variants and dialects, and introduce new concepts of embeddings and abstraction that are used in the preprocessing steps as well as in the named entity recognition process. We propose transfer learning models that aim to convey knowledge from rich resource languages to other sets of languages or genres that suffer from minimal resource availability, where the goal resides in creating systems that are more efficient in time and money, and require minimal annotation efforts. Finally, we address the tasks of named entity aliasing resolution and named entity linking. In named entity aliasing resolution we investigate the problem of identifying variants of the same name in social and traditional media where a variant could be a string variation of the same name (George W. Bush vs. GW Bush) or a complete alias such as Abu Mazen vs. Mahmoud Abbas. Furthermore, we address named entity linking task where we map identified named entities to reference entries in a knowledge base. We devise novel language independent techniques and we apply our approaches to both English and Arabic with extensions for some of our algorithms to other languages such as Spanish and Chinese.