Organization of Information Retrieval Procedures for Approximate Attribute Matching Open Access
Downloadable ContentDownload PDF
An objective of information systems is to effectively retrieve data items similar to the specifications of given queries. One of the major data structures for information retrieval constitutes the bit-attribute matrices, in which the items are described using binary vectors where the values of 0 and 1 signify the absence or presence of the attributes. In this dissertation, we have developed a technique that provides efficient searching procedures for operating on bit-attribute matrices. The query specifies which attributes to search for and a similarity threshold that matching items have to exceed. Each attribute in a query carries a weight value that characterizes its relative importance for this particular request. An information item is selected when the weighted sum of matching attributes goes above an established threshold. The suggested technique for weighted search in bit-attribute matrices employs what is called the "vertical" representation of a bit-attribute matrix. Regularly, in a straight "horizontal" representation of a bit-attribute matrix the attributes of a particular item are grouped together in a collection of machine-words; contrary, in the "vertical" representation the identical attributes of different items are featured as bit-slices in the sense that they are stored in the same machine-word. Thus, in a 64-bit machine, a bit-slice contains the identical attributes of 64 different items. This technique offers a number of operational advantages that have been thoroughly investigated in this work to ensure the efficient organization of information retrieval procedures. Advances in computing environments such as from 32-bits to 64-bits would result in efficiency improvement. No extra cost is necessary on top of exact matching to perform approximate search. The technique is easily parallelized so that multiple processor can work together to speed up the search with little coordination. Experiments were run to test the technique outside the analytically estimated realm. Efficiency improvement observed in the experiments shows that the suggested scheme provides an eight time speedup with respect to the ordinary organization. A privacy-preserving clinical notes search program was built to demonstrate the application of the technique.