Spreadsheet Datasets
We have introduced two spreadsheet datasets:
- SAUS:
The 2010 Statistical Abstract of the United
States (SAUS) consists of 1,369 spreadsheet files
totaling 70MB. We downloaded the dataset from the U.S.
Census Bureau. It covers a variety of topics of general
public interest, such as state-level finances, educational
attainment, levels of public health, and so on.
[Download]
- WEB:
Our Web dataset consists of 410,554 Microsoft
Excel files from 51,252 distinct Internet domains. They
total 101 GB. We found the spreadsheets by looking
for Excel-style file endings among the roughly 10
billion URLs in the ClueWeb09 Web crawl.
[Download]