This
page is a distribution site for the labeled data for use in regular expression learning. Available here are a collection of labeled data for four extraction tasks:
course number, phone numbers, software names, and urls extracted from intranet translation search dataset.
The data sets were introduced in the following paper (joint
work of DB group at UMich and Avatar
project team at IBM Almanden).
- Yunyao
Li, Rajasekar Krishnamurthy, Sriram Raghavan, Shivakumar Vaithyanathan, and H.V.
Jagadish. Regular Expression Learning for Information Extraction. In Proceedings of EMNLP 2008,
Honolulu, HI, November 2008 (pdf)
(Bibtex)
If
you have results to report on these corpora, please send email to Yunyao
Li (yunyaoli a_t usd_o_tibmd_o_tcom). Thanks!
Transactional
Search Data Sets:
Dataset
introduced in the above paper.
(1)
Course Number (32 KB): Labeled data for course number candidates along with their left and right contexts.
(2)
Phone Number (874 KB): Labeled data for phone number candidates along with their left and right contexts.
(3)
Software Name (1.24 MB): Labeled data for software name candidates along with their left and right contexts
(4)
URLs (167 KB): labeled data for URL candidates along with their left and right contexts
If
you have any questions or comments regarding this site, please send
email to Yunyao Li (yunyaoli a_t usd_o_tibmd_o_tcom). Thanks!