Regular Expression Learning for Information Extraction


This page is a distribution site for the labeled data for use in regular expression learning. Available here are a collection of labeled data for four extraction tasks: course number, phone numbers, software names, and urls extracted from intranet translation search dataset. The data sets were introduced in the following paper (joint work of DB group at UMich and Avatar project team at IBM Almanden).

  • Yunyao Li, Rajasekar Krishnamurthy, Sriram Raghavan, Shivakumar Vaithyanathan, and H.V. Jagadish. Regular Expression Learning for Information Extraction. In Proceedings of EMNLP 2008, Honolulu, HI, November 2008 (pdf) (Bibtex)

If you have results to report on these corpora, please send email to Yunyao Li (yunyaoli a_t usd_o_tibmd_o_tcom). Thanks!

Transactional Search Data Sets:

Dataset introduced in the above paper.

(1) Course Number (32 KB): Labeled data for course number candidates along with their left and right contexts.

(2) Phone Number (874 KB): Labeled data for phone number candidates along with their left and right contexts.

(3) Software Name (1.24 MB): Labeled data for software name candidates along with their left and right contexts

(4) URLs (167 KB): labeled data for URL candidates along with their left and right contexts

If you have any questions or comments regarding this site, please send email to Yunyao Li (yunyaoli a_t usd_o_tibmd_o_tcom). Thanks!