This 
          page is a distribution site for the intranet data for use in transactional 
          search experiments. Available here are a collection of intranet data 
          sets, search tasks and queries for the evaluation of transactional search. 
          The data sets were introduced in the following paper (joint 
          work of DB group at UMich and Avatar 
          project team at IBM Almanden).
         
          
            - Yunyao 
              Li, Rajasekar Krishnamurthy, Shivakumar Vaithyanathan, and H.V. 
              Jagadish. Getting Work Done on the Web: Supporting Transactional 
              Queries. To appear in Proceedings of SIGIR 2006, 
              Seattle, WA, August 2006 (pdf) 
              (Bibtex)
An actively 
            maintained bibliography on transaction 
            search and related topics is also included.
         
         If 
          you have results to report on these corpora, please send email to Yunyao 
          Li (yunyaol a_t umich d_o_t edu). Thanks!
      
      
Transactional 
          Search Data Sets: 
        Dataset 
          introduced in the above paper.
        (1) 
          S-DOC (1.11 GB): Pool of 434203 
          unprocessed html files 
        Collected 
          from umich.edu using crawler GNU Wget in November 2005: given a single 
          start point (www.umich.edu), the software recursively collected textual 
          documents with a small set of MIME types (e.g., html, php) within the 
          domain of \texttt{umich.edu} in November 2005. 
        This 
          is the raw dataset from which the following transactional datasets (2) 
          S-SDC and (3) S-ANN-NE were derived. 
        (2) 
          S-TDC (93 MB): Transactional page 
          dataset
        a 
          subset of S-DOC, comprised of web pages containing transactional features, 
          including form-entry pages and software download pages. 
        (3) 
          S-ANN-NE (4 MB): Transactional feature 
          dataset 
        Each 
          file contains all the transactional features from a web page in S-SDC. 
          The identifier of each file corresponds to the original document. 
        (4) 
          15 transactional search tasks: 
        Collected 
          through an informal survey conducted among administrative staff, graduate 
          students and recent graduates in the University of Michigan and IBM 
          Almaden Research Center.
        (5) 
          394 unique user queries: 
        Collected 
          from 26 subjects from the University of Michigan and IBM Almaden Research 
          Center for the search tasks in (4), with redundant queries removed.
        Data 
          Sets not used in the above paper, but potentially useful: 
        Coming 
          soon...
        
        If 
          you have any questions or comments regarding this site, please send 
          email to Yunyao Li (yunyaol a_t umich d_o_t edu)