Graph-based Algorithms for Keyphrase Extraction in Social Text Open Access
Downloadable ContentDownload PDF
The sheer volume of text in the web mandates automated approaches for identifying keyphrases that distinguish documents and help recognize important topics within documents. Automatic extraction of keyphrases is accomplished by designing algorithms capable of quantifying saliency in text. Measuring saliency score of textual units has traditionally used bag-of-words (BOW) approaches where the ranking is measured without considering the context. Such approach has several limitations as in polysemy and synonymy, where it is hard to detect these natural characteristics of text without understanding the context. In contrast, graph-based approaches model the relation between textual units that alleviate the aforementioned problems. In this dissertation, we introduce a collection of novel approaches for graph-based keyphrase ranking.First, we propose a novel random walk extension for graph-based ranking that can leverage weights on both vertices and edges, called NE-Rank. The ranking algorithm combines additional ranking methods that enhance existing graph-based approaches. Specifically, we combine a discriminative ranking approach as in tf-idf to the co-occurrence ranking in graphs. Moreover, the ranking model uses social tags, as in Twitter's hashtags, and explores leveraging them by boosting their weights for the task of keyphrase extraction in social microposts. Additionally, we propose a lexical graph expansion through social tags for keyphrase extraction. After modeling the textual content of microposts in a lexical graph, we expand the graph by finding more similar content linked by tags. We show a number of different approaches to lexical graph expansion through Twitter hashtags, and show a significant improvement over using the textual content alone. Second, we propose a new approach for measuring saliency in short documents. We model the textual units in a hypergraph by modeling words as vertices and short documents as hyperedges, and we study a high-order co-occurrence relation that is beyond the pair-wise relation in graphs. Therefore, we propose a novel probabilistic random walk over hypergraphs that captures weights on vertices and hyperedges to rank vertices. We compare our proposed random walk with different random walk approaches for hypergraphs and show the validity of the approach. Finally, we propose a complete ranking framework for extracting keyphrases from short documents using the hypergraph proposed random walk. The ranking takes into account temporal and social attributes that are important for a dynamic genre such as Twitter.