Social media platforms update frequently. For current information on any known issues archiving Facebook content, please see our System Status page.
New Facebook seeds will have the default scoping rules automatically applied at the seed level when they are added to a collection. To learn more, including how you can add default scoping rules to existing seeds, please visit Sites with automated scoping rules.
You can add Facebook pages, profiles, and/or groups to your collection in order to crawl, archive, and replay them as you would any other seed site, just so long as you remember to format and scope them according to a few simple rules. We recommend running a test crawl after implementing any new rules.
On this page:
- How to add Facebook seeds to your collection
- Embedded Facebook feeds
- Required scoping for Facebook seeds
- What to expect from archived Facebook seeds
- Troubleshooting
How to add Facebook seeds to your collection
Follow our standard guidance for adding seeds to your collection, but keep the following principles in mind:
- Be Specific! Only add seeds for specific users, groups, events, etc. Do not attempt to crawl and archive all of Facebook--don't forget to add an ending slash to your Facebook URL
- Use the HTTPS version of the URL. Since Facebook serves its content exclusively from HTTPS, you can avoid potential crawl problems related to redirects by formulating your seed using "https://" instead of "http://"
- Use the Standard seed type. We advise strongly against using the Standard Plus seed type when crawling Facebook.
- Crawl for at least one day. Anything less than a one day long crawl will be unlikely to capture the data necessary to display a Facebook page in Wayback.
- Check a seed's availability on the live web. Archive-It crawlers will not be able to access content from a Facebook page that requires a user to be logged in. It's possible to add user credentials to your Facebook seeds, however we advise against this. User credentials added to Facebooks seeds have sometimes been flagged by Facebook's bot tracker, which may result in that user account being locked.
- Use helper seeds. To optimize in-page navigation between subpages of an archived Facebook seed, add each subpage of the Facebook page you are crawling as its own private helper seed (Standard seed type) and crawl all seeds together. Each helper seed will also have the default scoping rules added.
-
- E.g. for the public seed https://www.facebook.com/internetarchive/, add https://www.facebook.com/internetarchive/photos/ and https://www.facebook.com/internetarchive/videos/ and set them to private
Setting the helper seeds to private ensures that the homepage of your Facebook seed, e.g. https://www.facebook.com/internetarchive/, remains the primary point of entry for accessing that content on your public landing page.
N.B. Some Facebook subpages may look like this on the live web: https://www.facebook.com/pg/internetnetarchive/photos/?ref=page_internal. For the best possible capture, remove the "pg" in the middle of the URL and the ending string "?ref-page_internal" when adding these subpages as seeds.
Embedded Facebook feeds
You will need to add the required scoping rules for Facebook to seeds that have embedded feeds, if you want to capture the Facebook feed.
Required scoping for Facebook seeds
New Facebook seeds added to collections will have the following default scoping rules applied automatically; older Facebook seeds can be updated by following these instructions.
1. Apply seed-level data limit
Facebook data can range wildly, from 1 - 20GB depending on the type of page and its content. Start by limiting the scope of each Facebook seed to 3GB of data. If you do not get a complete capture or feel more data than necessary is being captured, try incrementally changing the seed-level data limit and run a new test crawl at each stage to determine the ideal limit for your seed.
To read about how to add these rules, please consult our documentation on seed-level scoping rules.
2. Ignore robots.txt
Ignore robots.txt for each Facebook seed at the seed level.
OR Add collection-level scoping rules to ignore robots exclusions on the following hosts, exactly as they appear here:
- www.facebook.com - in order to archive Facebook-hosted content
- fbcdn.net - in order to archive important page styling elements
- akamaihd.net - in order to archive important page styling elements
3. Expand scope to capture scroll
To archive content from Facebook's dynamically scrolling pages, expand the scope of your crawl at the seed level to include URLs that match the following SURT, exactly as it appears here: http://(net,fbcdn,
Note that scroll on Facebook Groups' seeds (i.e. https://www.facebook.com/groups/) will be limited.
4. Exclude segmented video files
Facebook frequently serves videos in segmented files and the crawler will capture these segments in addition to files that contain the entire video. Exclude segmented video files at the seed level by adding a seed level block URL rule to all Facebook seeds on all URLs that include the text: bytestart=
5. Exclude non-English interface
Limit the scope of your crawls to only archive the Facebook interface in English (this will not affect posts written in other languages) by adding a rule to block URLs that match the following regular expression: ^https?://..-..\.facebook\.com.*
If adding this rule to the Collection Scope tab, enter facebook.com in the host field.
6. Exclude personal user profiles
Prevent the capture of individual Facebook users' personal profiles from archiving by limiting your crawl to block all URLs that match the following regular expression: ^https?://www\.facebook\.com/(profile\.php|people/).*$
If adding this rule to the Collection Scope tab, enter facebook.com in the host field.
6. Crawl using Brozzler
What to expect from archived Facebook seeds
Because of the way that our crawling technology captures Facebook, you can anticipate the following:
- Multiple distinct capture dates may appear on the Wayback calendar page of your seed for the time period of your crawl, and one or more of these access points may lead to your desired page in its public form-- that is, how it appears to a person not logged into Facebook.
- Archived Facebook pages should appear the same as their live analogs do to logged-in Facebook users with the following exceptions:
- You will not be able to expand comments sections of posts in order to see any more comments than our crawling technology saw.
- Posts will not expand when clicked.
- Each post will display how many times that it has been “liked,” however you will not be able to browse through the specific names of Facebook users who “liked” any post.
- Scroll will be limited.
- There will be no scroll on Facebook Groups' seeds (i.e. https://www.facebook.com/groups/).
Troubleshooting
- If you have a 3GB data limit in place, and do not get a complete capture with all the rules on this page, incrementally increase the seed-level data limit by 2GB and run a new test crawl at each stage to determine the ideal limit for your seed. If pages undergo large changes, recurring crawls may need one crawl with a higher data amount to capture the new content, before returning to a lower data amount.
- It is possible to use the Wayback QA tool on saved test crawls and production crawls to capture missing content. It is best to use this tool to run patch crawls as soon after a crawl ends as possible, because some URLs will be time sensitive, due to how Facebook serves up its content.
Comments
0 comments
Please sign in to leave a comment.