The information in a WARC file name provides insight into the content in the WARC and how it was captured. WARC file names can be found in the CDX, and when downloading WARCs, so understanding the attributes in the file name can help you identify if a particular WARC is relevant to your needs.
Standard Crawls
Standard crawls started after September 9th, 2018 write WARCs per seed. They will have the following name attributes per WARC:
Standard crawls started prior to September 9th 2018 wrote WARCs per crawl. Crawls run since March 2015 will have the following name attributes per WARC:
These attributes are described in detail in the below table:
Attribute | Description | Example |
service | The group within the Internet Archive who captured the data | ARCHIVEIT |
collection | This is the number of the collection where the crawl lives | 7310 |
crawl job frequency | This reflects the frequency of the seeds at the time of the crawl, if it was a scheduled crawl. This will say “test” is it is a test crawl | BIMONTHLY |
crawl job id | Each crawl gets a unique identifier. These crawl id’s can be found in a seed/collection/account crawl list. | JOB457797 |
seed id | Each seed has a unique ID number associated with it. This lets you know which seed URL is expressed in a given WARC file | SEED1444539 |
timestamp | Time at which the WARC began writing, expressed as year-month-day-hour-minute-second. | 20171003231334202 |
serial number | To designate separate WARCs per crawl | 00000 |
file format | This reflects the WARC file format. | .warc.gz |
For naming convention information for WARCs before March 2015, please contact us for more information.
Brozzler Crawls
Brozzler WARCs are similar, but have two extra fields: seed id and crawl token.
These attributes are described in detail in the below table:
Attribute |
Description |
Example |
seed id |
Because Brozzler writes separate WARC files for each seed within a crawl, the seed id helps identify which seed a Brozzler WARC is associated with. |
SEED1542893 |
crawl token |
This serves multiple purposes, among them avoiding name collisions |
8b0qk6nu |
Comments
0 comments
Please sign in to leave a comment.