You can add Facebook pages, profiles, and/or groups to your collection in order to crawl, archive, and replay them as you would any other seed site, just so long as you remember to format and scope them according to a few simple rules:
On this page:
- How to add Facebook seeds to your collection
- How to scope Facebook seeds
- Optional Facebook scoping rules
- What to expect from your archived Facebook seeds
How to add Facebook seeds to your collection
Follow our standard guidance for adding seeds to your collection, but keep the following principles in mind:
- Be Specific! Only add seeds for specific users, groups, events, etc. Do not attempt to crawl and archive all of Facebook--don't forget to add an ending slash to your Facebook URL
- Use the HTTPS version of the URL. Since Facebook serves its content exclusively from HTTPS, you can avoid potential crawl problems related to redirects by formulating your seed using "https://" instead of "http://"
- Check a seed's availability on the live web. If you must be logged in to Facebook as a specific user to view a private page or group, then our crawler will need to be logged in as that user, too. Follow our directions to crawl password protected websites, if necessary
How to scope Facebook seeds
The proper formatting above enables our crawler to access Facebook seeds. To ensure that it also archives all of the proper content it finds there, and furthermore to limit it from archiving too much material from remote areas of Facebook, always apply the following scope modifications:
- Limit the scope of each Facebook seed to archive no more than 1GB of data. If you are re unsure how to add these rules, please consult our documentation on seed-level scoping rules
- To archive content from Facebook's dynamically scrolling pages, expand the scope of your crawl again to include URLs that match the following SURT, exactly as it appears here: +
Optional Facebook scoping rules
To limit the scope of your crawls to only archive Facebook content in English, add a rule to block URLs that match the following regular expression: ^https?://..-..\.facebook\.com.*
Adding this rule to all crawls in a given collection by adding it to the host facebook.com under the Collection Scope tab. Alternatively, you can block matching URLs under the Seed Scope tab for each individual Facebook seed.
If you prefer to archive a Facebook page/feed as it appears in a language other than English, you will need to log our crawling technology into Facebook with an account that defaults to your preferred language. You can create a "dummy" account for this purpose, then add its login credentials to each Facebook seed to which they apply.
Exclude personal information
If adding this rule to the Collection Scope tab, enter facebook.com in the host field:
Exclude segmented video files
Facebook frequently serves videos in segmented files and the crawler will capture these segments in addition to files that contain the entire video.
If you run a test crawl and find that you are capturing a high volume of segmented video files, you may want to run a new test crawl with a seed-level scoping rule that excludes segmented video files. This can help to ensure that your data is being used to capture the complete unsegmented versions of your target videos.
Exclude segmented video files by by limiting any relevant Facebook seeds to block URLs that include the text: bytestart=
What to expect from your archived Facebook seedsBecause of the way that our crawling technology must log in to Facebook in order to avoid being blocked from it entirely, you can anticipate two things:
- Multiple (no more than three) distinct capture dates may appear on the Wayback calendar page of your seed for the time period of your crawl, and one or more of these access points may lead to your desired page in its public form -- that is, how it appears to a person not logged in to Facebook:
- Unless you provided other login credentials, our technology will log in as (and therefore display the name of) our default user, “Charlie Archivist.”
Archived Facebook pages should appear the same as their live analogs do to logged-in Facebook users with the following exceptions:
- You will not be able to expand comments sections of posts in order to see any more comments than our crawling technology saw.
- Each post will display how many times that it has been “liked,” however you will not be able to browse through the specific names of Facebook users who “liked” any post.