This page is a
distribution site for two logs of database queries. They were obtained from two different
data sources, the first on American geography and the second a collection of job listings
in the state of Texas. A total of 1,520 unique queries were collected from two database
logs which are presented here in both natural language and XQuery formats. We also provide
derived XML schemas for the two datasets. We found the data and queries to be useful in
evaluating the design and performance of natural language and XML-based query interfaces
and feel that they can be a useful resource to other researchers working in this area.
The data
was originally compiled by the
Machine Learning Research Group at the
University of Texas, and the queries were posed by real users
(students at the university). Both the data and the queries can be downloaded in their native
Prolog format here. The following
paper describes the purpose and use of these logs for natural language query
answering.
- Lappoon R. Tang
and Raymond J. Mooney, Using Multiple Clause Constructors in Inductive
Logic Programming for Semantic Parsing. Proceedings of the 12th
European Conference on Machine Learning (ECML-2001), pp. 466-477, Freiburg,
Germany, September 2001.
The translations
were done by Yunyao Li and Magesh Jayapandian of the
Database Research Group at the
University of Michigan. The natural language
queries were used in the following papers.
- Magesh Jayapandian
and H. V. Jagadish,
Automated Creation of a Forms-based Database Query Interface.
Proceedings of the 34th International Conference on Very Large Data Bases
(VLDB 2008), Auckland, New Zealand, August 2008.
(PDF)
(BibTeX)
- Yunyao Li,
Ishan Chaudhuri, Huahai Yang, Satinder Singh and H. V. Jagadish,
Enabling Domain-Awareness for a Generic Natural Language Interface.
Proceedings of the 22nd Conference on Artificial Intelligence
(AAAI 2007), Vancouver, British Columbia, Canada, July 2007.
(PDF)
(BibTeX)
If you choose to use
our versions of these datasets and queries for your research, we would really appreciate
it if you could tell us briefly how and why you used them. We will then add you to a running
bibliography maintained on this site.
Please send email to jmagesh@umich.edu. Thanks!
Geoquery:
Data and queries pertaining
to the geography of the United States.
(1) geobase.xsd
(4.57 KB): XML Schema Definition of the Geobase dataset.
(2) geobase.xml
(123 KB): XML version of the Geobase dataset (conforms to geobase.xsd).
(3) geoqueries880.txt
(37.6 KB): 880 natural language queries to the Geobase dataset.
(4) geoqueries880.xquery.txt
(175 KB): 880 queries in XQuery (translations of the natural language queries above).
NOTE: Some queries did not conform exactly to the schema of the dataset. In
each such case, a minor modification was made to the natural language query prior to
XQuery translation. The original query can be found next to each modified query (within parentheses).
Jobsquery:
Data and queries pertaining
to job announcements on the newsgroup austin.jobs.
(1) jobdata.xsd
(1.66 KB): XML Schema Definition of the Jobdata dataset.
(2) jobdata.xml
(1.38 MB): XML version of the Jobdata dataset (conforms to jobdata.xsd).
(3) jobqueries640.txt
(31.1 KB): 640 natural language queries to the Jobdata dataset.
(4) jobqueries640.xquery.txt
(103 KB): 640 queries in XQuery (translations of the natural language queries above).
NOTE: Some queries did not conform exactly to the schema of the dataset. In
each such case, a minor modification was made to the natural language query prior to
XQuery translation. The original query can be found next to each modified query (within parentheses).
If you have any questions or comments about the content of this page,
please send email to Magesh Jayapandian (jmagesh@umich.edu).