Total data for a seed

Featured

July 23, 2020 19:50

Is there a way to determine how much data an individual seed collects over time? The Collection Overview Page shows Total Data Archived for a collection, but I want to see how much data is archived for each individual seed in that collection.

I know crawl reports will break down the amount of data by seed, but I don't see the total data for a seed anywhere on Archive-It. Going through every crawl report and adding up data would be laborious so I'm hoping there's an easier way.

Comments

6 comments

Katie Fearer March 01, 2021 22:43

I am interested in this functionality too, if it exists.

0

Comment actions Permalink
Russell White March 05, 2021 21:54

I realize this is not quite what you're asking for, but I wrote a command line script in Python recently that does this by retrieving seed and crawl data from Archive-It's Partner API. This might work if running command line scripts is an option for you.

I could make the code available if there's community interest.

0

Comment actions Permalink
Katie Fearer March 06, 2021 17:46

Command line scripts is an option for us, and the "new data" by seed from each crawl would be a useful statistic. I think this script would help us tremendously if you are willing to share. Thanks much!

0

Comment actions Permalink
Lorrie Chisholm March 08, 2021 13:13

Very interested in this command line script. Thanks so much!

0

Comment actions Permalink
Russell White March 09, 2021 20:42

Okay, great! In its present form, the script takes a string as input and gets a total data count (all crawls, all time) for any seed URL matching the string, which could be specific (e.g., 'https://www.instagram.com/parks.canada/') or more general (e.g., 'https://www.instagram.com/', which would match any instagram seeds). It works across collections, but could be retooled to look at all/some seeds within a collection instead.

I want to put some effort into cleaning up the code and writing documentation, so it might be a little while before I get to this--I'll update this thread when I have news.

0

Comment actions Permalink
Russell White May 03, 2021 21:44 (Edited May 03, 2021 21:45)

I've uploaded this Python reporting script to a Github repository along with documentation. There's the script file (seedstats.py) and a config file (seedstats_config.py). To run the script you'll need to update to the config file with a valid Archive-It username and password (required by the API to retrieve account data).

The script takes a string as input and collects stats for seeds matching the string, so it's mainly intended for individual seeds or domains. In theory you could cast a wide net using ".com" or even "." as your string, but I haven't tested along these lines.

Feedback welcome.

https://github.com/DigitalIntegration/seedstats.py

0

Comment actions Permalink

Please sign in to leave a comment.

Comments

Didn't find what you were looking for?