Total data for a seed

Featured

Comments

6 comments

  • Avatar
    Katie Fearer

    I am interested in this functionality too, if it exists.

     

    0
    Comment actions Permalink
  • Avatar
    Russell White

    I realize this is not quite what you're asking for, but I wrote a command line script in Python recently that does this by retrieving seed and crawl data from Archive-It's Partner API. This might work if running command line scripts is an option for you.

    I could make the code available if there's community interest.

     

    0
    Comment actions Permalink
  • Avatar
    Katie Fearer

    Command line scripts is an option for us, and the "new data" by seed from each crawl would be a useful statistic.  I think this script would help us tremendously if you are willing to share.  Thanks much!

    0
    Comment actions Permalink
  • Avatar
    Lorrie Chisholm

    Very interested in this command line script. Thanks so much!

    0
    Comment actions Permalink
  • Avatar
    Russell White

    Okay, great! In its present form, the script takes a string as input and gets a total data count (all crawls, all time) for any seed URL matching the string, which could be specific (e.g., 'https://www.instagram.com/parks.canada/') or more general (e.g., 'https://www.instagram.com/', which would match any instagram seeds). It works across collections, but could be retooled to look at all/some seeds within a collection instead.

    I want to put some effort into cleaning up the code and writing documentation, so it  might be a little while before I get to this--I'll update this thread when I have news.

    0
    Comment actions Permalink
  • Avatar
    Russell White (Edited )

    I've uploaded this Python reporting script to a Github repository along with documentation. There's the script file (seedstats.py) and a config file (seedstats_config.py). To run the script you'll need to update to the config file with a valid Archive-It username and password (required by the API to retrieve account data).

    The script takes a string as input and collects stats for seeds matching the string, so it's mainly intended for individual seeds or domains. In theory you could cast a wide net using ".com" or even "." as your string, but I haven't tested along these lines.

    Feedback welcome.

    https://github.com/DigitalIntegration/seedstats.py

    0
    Comment actions Permalink

Please sign in to leave a comment.