Development and Analysis of Query Generation Methods for Automated Search

Ryota Watanabe (AY 2019)

In order to present appropriate search results to each user from a huge amount of data, various studies on query generation and search result improvement have been conducted in information retrieval. However, the information retrieval technologies proposed in existing studies only support users who have the ability to perform information retrieval independently. Based on this, we set two research objectives in this study. The first is to develop an automatic retrieval system that can omit the step of "thinking of a query and searching" in the flow of "what you want to know occurs, the user thinks of a query, and then the user searches". The second is to verify what kind of query generation method is effective in this automatic retrieval system.

To achieve this goal, we conducted a study in which queries were generated from three layers (base layer, context layer, and user layer), and the results of the queries and their compatibility with the input sentences were judged. In the context layer, the 10 words immediately preceding each word in the input sentence were used as context information. In the user layer, we assumed that the user is interested in a certain word in the input sentence, and used the context information of the word and the words extracted from the lead sentence of the document retrieved by the word as the user profile. Ten input sentences were used, and experiments were conducted on all of them using three methods: one in the base layer, two in the context layer (word2vec and doc2vec), and three each in the user layer (word2vec and doc2vec).

The main result was that in situations where queries are automatically generated, the base and context layers alone are not sufficient, and queries and their search results generated by the user layer are often useful. On the other hand, in some cases, queries generated in the base layer were also sufficient to produce search results. In addition, there were many cases where the search results were not sufficient when using the distributed representation of documents compared to the distributed representation of words. This may be due to the fact that the word order was not properly maintained in the process of generating pseudo-documents to obtain the distributed representation.

The findings from these results indicate that user profiles are important in the query generation process of automatic retrieval systems, and that the distributed representation of words is more appropriate than that of documents for the method used in this study. Future work includes experimenting with methods such as re-ranking search results, which could not be used in this study, and building user profiles that take into account the passage of time; investigating how queries and search results change when parameters such as the width of the context window and the weight of each word are varied; and evaluating the results by an unspecified number of people in order to draw more objective conclusions.

(Translated by DeepL)

Back to Index