Wordpress and Squarespace are popular website hosting platforms that support many of the sites that our partners wish to archive, from simple blogs to highly sophisticated sub-domains of larger websites. In general, our crawler can reliably archive material hosted and served by these platforms without any special scope modifications; you may crawl and archive these sites as you would any other typical seed in your collections.
There are currently no known issues for archiving Wordpress or Squarespace sites. For a full list of known issues for archiving various platforms please visit our Social media and other platforms status page.
On this page
Scoping Wordpress and Squarespace seeds
When reviewing your Hosts report for Wordpress sites, you may notice many URLs with directories like /wp-admin or /wp-login in them, and which were either blocked by a robots exclusion or deemed "out of scope" and therefore not archived. This is completely normal and appropriate, as those URLs refer to areas reserved for administrators of the targeted website, rather than any publicly visible front-end material.
The same can be said for hosts from Squarespace coming from static.squarespace, and related hosts. These hosts often have "out of scope" content that is not necessary to render an archived page complete. However if a page is missing elements, it can be helpful to investigate these hosts.
For specific guidance on archiving password-protected Wordpress or Squarespace sites, see our guide: How to archive password protected sites.
Running your crawl
As with all such sites, we strongly recommend running a test crawl and reviewing the results of your crawls in order to ensure that no special limitations, such as robots exclusions, prevent content from archiving fully.
Crawl your seeds using either Standard or Brozzler crawling technologies.