The BRDC'12 dataset was collected from the Brazilian Web (.br domain). This dataset was gathered using the InWeb Crawler. It was collected between September and November 2012. The dataset consists of a fixed set of web pages, which were crawled on a daily basis during approximately two months. Each line represents a web page, and each 1 and 0 in the i-th collumn tells whether the page changed or not in the i-th download. The following table summarizes the dataset.
|Monitoring period||57 days|
|# web pages||417,048|
|# web sites||7,171|
|Min # web pages/site||1|
|Max # web pages/site||2,336|
|Average web pages/site||58.15|
|% downloads with errors||2.92|
How BRDC'12 was built?
To build BRDC'12, we used as seeds approximately 15,000 URLs of the most popular Brazilian sites according to Alexa. Only sites under the .br domain were considered as seeds. A breadth-first crawl from these seeds downloaded around 200 million web pages, from where more URLs were further extracted. From these URLs, we then selected a set of 10,000 web sites using stratified random sampling, thus keeping the same distribution of the number of web pages per site of the complete dataset. Next, for each selected site, we chose the largest number of web pages that could be crawled in one day without violating politeness constraints. We selected, in total, 3,059,698 web pages, which were then daily monitored. The complete BRDC'12 has about 1 Tb of data.
During the monitoring periods, our crawler run from 0 AM to approximately 11 PM, recollecting each selected web page every day, which allowed us to determine when each page was modified. In order to detect changes in a page, we used the SimHash technique to create a fingerprint of the plain text extracted from the pages. Accesses to web pages from the same site were equally spaced to avoid hitting a web site too often.
We did observe download errors during monitoring periods. Such errors might be due to, for instance, the page being removed, the web site's access permissions (robots.txt) being changed, or the download time reaching a predefined limit (30 seconds). We removed from the BRDC'12 collection all web pages with more than two errors, thus including only pages with fewer errors.
Note that, in case of an error, we cannot tell whether the page changed on that particular day. Thus, we guess this information by analyzing the history of changes of that web page in the days that preceded the error. Specifically, let us say the download of a page p failed on day d. We then analyze the distribution of the number of days between successive changes of p in the first d-1 days, and use the most frequent period without change to determine whether we should consider that p changed on day d.
Note that the errors that remain in the BRDC'12 collection after filtering represents only 2.92% of all downloads performed.