Total data for a seed
FeaturedIs there a way to determine how much data an individual seed collects over time? The Collection Overview Page shows Total Data Archived for a collection, but I want to see how much data is archived for each individual seed in that collection.
I know crawl reports will break down the amount of data by seed, but I don't see the total data for a seed anywhere on Archive-It. Going through every crawl report and adding up data would be laborious so I'm hoping there's an easier way.
-
I realize this is not quite what you're asking for, but I wrote a command line script in Python recently that does this by retrieving seed and crawl data from Archive-It's Partner API. This might work if running command line scripts is an option for you.
I could make the code available if there's community interest.
-
Okay, great! In its present form, the script takes a string as input and gets a total data count (all crawls, all time) for any seed URL matching the string, which could be specific (e.g., 'https://www.instagram.com/parks.canada/') or more general (e.g., 'https://www.instagram.com/', which would match any instagram seeds). It works across collections, but could be retooled to look at all/some seeds within a collection instead.
I want to put some effort into cleaning up the code and writing documentation, so it might be a little while before I get to this--I'll update this thread when I have news.
-
I've uploaded this Python reporting script to a Github repository along with documentation. There's the script file (seedstats.py) and a config file (seedstats_config.py). To run the script you'll need to update to the config file with a valid Archive-It username and password (required by the API to retrieve account data).
The script takes a string as input and collects stats for seeds matching the string, so it's mainly intended for individual seeds or domains. In theory you could cast a wide net using ".com" or even "." as your string, but I haven't tested along these lines.
Feedback welcome.
Please sign in to leave a comment.
Comments
6 comments