Natural Language & XML Query Logs

 

This page is a distribution site for two logs of database queries. They were obtained from two different data sources, the first on American geography and the second a collection of job listings in the state of Texas. A total of 1,520 unique queries were collected from two database logs which are presented here in both natural language and XQuery formats. We also provide derived XML schemas for the two datasets. We found the data and queries to be useful in evaluating the design and performance of natural language and XML-based query interfaces and feel that they can be a useful resource to other researchers working in this area.

The data was originally compiled by the Machine Learning Research Group at the University of Texas, and the queries were posed by real users (students at the university). Both the data and the queries can be downloaded in their native Prolog format here. The following paper describes the purpose and use of these logs for natural language query answering.

  • Lappoon R. Tang and Raymond J. Mooney, Using Multiple Clause Constructors in Inductive Logic Programming for Semantic Parsing. Proceedings of the 12th European Conference on Machine Learning (ECML-2001), pp. 466-477, Freiburg, Germany, September 2001.

The translations were done by Yunyao Li and Magesh Jayapandian of the Database Research Group at the University of Michigan. The natural language queries were used in the following papers.

  • Magesh Jayapandian and H. V. Jagadish, Automated Creation of a Forms-based Database Query Interface. Proceedings of the 34th International Conference on Very Large Data Bases (VLDB 2008), Auckland, New Zealand, August 2008. (PDF) (BibTeX)
  • Yunyao Li, Ishan Chaudhuri, Huahai Yang, Satinder Singh and H. V. Jagadish, Enabling Domain-Awareness for a Generic Natural Language Interface. Proceedings of the 22nd Conference on Artificial Intelligence (AAAI 2007), Vancouver, British Columbia, Canada, July 2007. (PDF) (BibTeX)

If you choose to use our versions of these datasets and queries for your research, we would really appreciate it if you could tell us briefly how and why you used them. We will then add you to a running bibliography maintained on this site. Please send email to jmagesh@umich.edu. Thanks!


Datasets and Queries

Geoquery: Data and queries pertaining to the geography of the United States.

(1) geobase.xsd (4.57 KB): XML Schema Definition of the Geobase dataset.

(2) geobase.xml (123 KB): XML version of the Geobase dataset (conforms to geobase.xsd).

(3) geoqueries880.txt (37.6 KB): 880 natural language queries to the Geobase dataset.

(4) geoqueries880.xquery.txt (175 KB): 880 queries in XQuery (translations of the natural language queries above).

NOTE: Some queries did not conform exactly to the schema of the dataset. In each such case, a minor modification was made to the natural language query prior to XQuery translation. The original query can be found next to each modified query (within parentheses).


Jobsquery: Data and queries pertaining to job announcements on the newsgroup austin.jobs.

(1) jobdata.xsd (1.66 KB): XML Schema Definition of the Jobdata dataset.

(2) jobdata.xml (1.38 MB): XML version of the Jobdata dataset (conforms to jobdata.xsd).

(3) jobqueries640.txt (31.1 KB): 640 natural language queries to the Jobdata dataset.

(4) jobqueries640.xquery.txt (103 KB): 640 queries in XQuery (translations of the natural language queries above).

NOTE: Some queries did not conform exactly to the schema of the dataset. In each such case, a minor modification was made to the natural language query prior to XQuery translation. The original query can be found next to each modified query (within parentheses).


If you have any questions or comments about the content of this page, please send email to Magesh Jayapandian (jmagesh@umich.edu).