This
page is a distribution site for the intranet data for use in transactional
search experiments. Available here are a collection of intranet data
sets, search tasks and queries for the evaluation of transactional search.
The data sets were introduced in the following paper (joint
work of DB group at UMich and Avatar
project team at IBM Almanden).
- Yunyao
Li, Rajasekar Krishnamurthy, Shivakumar Vaithyanathan, and H.V.
Jagadish. Getting Work Done on the Web: Supporting Transactional
Queries. To appear in Proceedings of SIGIR 2006,
Seattle, WA, August 2006 (pdf)
(Bibtex)
An actively
maintained bibliography on transaction
search and related topics is also included.
If
you have results to report on these corpora, please send email to Yunyao
Li (yunyaol a_t umich d_o_t edu). Thanks!
Transactional
Search Data Sets:
Dataset
introduced in the above paper.
(1)
S-DOC (1.11 GB): Pool of 434203
unprocessed html files
Collected
from umich.edu using crawler GNU Wget in November 2005: given a single
start point (www.umich.edu), the software recursively collected textual
documents with a small set of MIME types (e.g., html, php) within the
domain of \texttt{umich.edu} in November 2005.
This
is the raw dataset from which the following transactional datasets (2)
S-SDC and (3) S-ANN-NE were derived.
(2)
S-TDC (93 MB): Transactional page
dataset
a
subset of S-DOC, comprised of web pages containing transactional features,
including form-entry pages and software download pages.
(3)
S-ANN-NE (4 MB): Transactional feature
dataset
Each
file contains all the transactional features from a web page in S-SDC.
The identifier of each file corresponds to the original document.
(4)
15 transactional search tasks:
Collected
through an informal survey conducted among administrative staff, graduate
students and recent graduates in the University of Michigan and IBM
Almaden Research Center.
(5)
394 unique user queries:
Collected
from 26 subjects from the University of Michigan and IBM Almaden Research
Center for the search tasks in (4), with redundant queries removed.
Data
Sets not used in the above paper, but potentially useful:
Coming
soon...
If
you have any questions or comments regarding this site, please send
email to Yunyao Li (yunyaol a_t umich d_o_t edu)