Blogspot (or Blogger) is a free content management system owned by Google. The platform allows users to create blogs and host them at a subdomain of blogspot.com. This guide provides an overview of how to properly format, scope, and crawl Blogspot seeds.
There are currently no known issues for archiving Blogspot. For a full list of known issues for archiving various platforms please visit our Status of monitored platforms page.
On this page:
How to select and format your Blogspot seeds
- Format seed URLs like the example below, substituting the below URL's "example" subdomain for the one you wish to collect:
- Use the Standard seed type for Blogspot seeds.
Scoping Blogspot seeds
Default scoping for Blogspot seeds
New Blogspot seeds will have the default scoping rules automatically applied at the seed level when they are added to a collection. To learn more, including how you can add default scoping rules to existing seeds, please visit Sites with automated scoping rules.
Blogspot sites block pagination between sections (ex. Older Posts) with robots.txt. To capture this functionality you will need to:
- Ignore robots.txt at the seed level -OR- Add a collection level scoping rule to ignore robots.txt for the seed host. Please note that this is different than adding a rule for blogspot.com. Your host could be something like example.blogspot.com, which will need to be added in its entirety for the rule to work. Ignore Robots.txt will be added automatically to all new Blogspot seeds.
Running your crawl
Once you have finished selecting your seeds and adding recommended scoping rules, we recommend that you crawl your seeds using Brozzler.