Overview
Facebook is an online social media networking service. This guide provides an overview of how to properly format, scope, and crawl Facebook seeds. Currently, there are some known issues with archiving Facebook which you can read more about below.
Known issues
Social media platforms like Facebook can be difficult to archive. Currently, Facebook has the following issues which we continue to actively monitor:
- ❌ Facebook is blocking the capture of most organizational profile pages.
- ❌ Facebook is also blocking the capture of Facebook Groups pages.
You can find a full list of known issues for archiving various platforms on our Social media and other platforms status page.
On this page:
- How to select and format your Facebook seeds
- Scoping Facebook seeds
- Running your crawl
- What to expect from archived Facebook seeds
- Troubleshooting
How to select and format your Facebook seeds
Facebook seeds
You can add Facebook pages, profiles, and/or groups to your collection in order to crawl, archive, and replay them as you would any other seed site, just so long as you remember to format and scope them according to a few simple rules.
Follow our standard guidance for adding seeds to your collection, but keep the following principles in mind:
- Be Specific! Only add seeds for specific users, groups, events, etc. Do not attempt to crawl and archive all of Facebook--don't forget to add an ending slash to your Facebook URL
- Use the HTTPS version of the URL. Since Facebook serves its content exclusively from HTTPS, you can avoid potential crawl problems related to redirects by formulating your seed using "https://" instead of "http://"
- Use the Standard seed type. We advise strongly against using the Standard Plus seed type when crawling Facebook.
- Crawl for at least one day. Anything less than a one day long crawl will be unlikely to capture the data necessary to display a Facebook page in Wayback.
- Check a seed's availability on the live web. Archive-It crawlers will not be able to access content from a Facebook page that requires a user to be logged in. It's possible to add user credentials to your Facebook seeds, however we advise against this. User credentials added to Facebooks seeds have sometimes been flagged by Facebook's bot tracker, which may result in that user account being locked.
- Use helper seeds. To optimize in-page navigation between subpages of an archived Facebook seed, add each subpage of the Facebook page you are crawling as its own private helper seed (Standard seed type) and crawl all seeds together. Each helper seed will also have the default scoping rules added.
-
- E.g. for the public seed https://www.facebook.com/internetarchive/, add https://www.facebook.com/internetarchive/photos/ and https://www.facebook.com/internetarchive/videos/ and set them to private
Setting the helper seeds to private ensures that the homepage of your Facebook seed, e.g. https://www.facebook.com/internetarchive/, remains the primary point of entry for accessing that content on your public landing page.
Note: Some Facebook subpages may look like this on the live web: https://www.facebook.com/pg/internetnetarchive/photos/?ref=page_internal. For the best possible capture, remove the "pg" in the middle of the URL and the ending string "?ref-page_internal" when adding these subpages as seeds.
Embedded Facebook feeds
You will need to add the below scoping rules for Facebook to seeds that have embedded feeds, if you want to capture the Facebook feed. Scoping rules for embedded feeds will not be applied automatically and will need to be added manually.
Scoping Facebook seeds
Default scoping for Facebook seeds
New Facebook seeds added to collections will have the following default scoping rules applied automatically at the seed level; older Facebook seeds can be updated by adding the below scoping rules manually or following these instructions.
To learn more, please visit Sites with automated scoping rules.
1. Apply seed-level data limit
Facebook data can range wildly, from 1 - 20GB depending on the type of page and its content. Start by limiting the scope of each Facebook seed to 3GB of data. If you do not get a complete capture or feel more data than necessary is being captured, try incrementally changing the seed-level data limit and run a new test crawl at each stage to determine the ideal limit for your seed.
To read about how to add these rules, please consult our documentation on seed-level scoping rules.
2. Ignore robots.txt
Ignore robots.txt for each Facebook seed at the seed level.
OR Add collection-level scoping rules to ignore robots exclusions on the following hosts, exactly as they appear here:
- www.facebook.com - in order to archive Facebook-hosted content
- fbcdn.net - in order to archive important page styling elements
- akamaihd.net - in order to archive important page styling elements
3. Expand scope to capture scroll
To archive content from Facebook's dynamically scrolling pages, expand the scope of your crawl at the seed level to include URLs that match the following SURT, exactly as it appears here: http://(net,fbcdn,
Note that scroll on Facebook Groups' seeds (i.e. https://www.facebook.com/groups/) will be limited.
4. Exclude segmented video files
Facebook frequently serves videos in segmented files and the crawler will capture these segments in addition to files that contain the entire video. Exclude segmented video files at the seed level by adding a seed level block URL rule to all Facebook seeds on all URLs that include the text: bytestart=
5. Exclude non-English interface
Limit the scope of your crawls to only archive the Facebook interface in English (this will not affect posts written in other languages) by adding a rule to block URLs that match the following regular expression: ^https?://..-..\.facebook\.com.*
If adding this rule to the Collection Scope tab, enter facebook.com in the host field.
6. Exclude personal user profiles
Prevent the capture of individual Facebook users' personal profiles from archiving by limiting your crawl to block all URLs that match the following regular expression: ^https?://www\.facebook\.com/(profile\.php|people/).*$
If adding this rule to the Collection Scope tab, enter facebook.com in the host field.
Running your crawl
Once you have finished selecting your seeds and adding recommended scoping rules, we highly recommend that you crawl your seeds using Brozzler.
We recommend running a test crawl after implementing any new rules and crawling for at least one day. Anything less than a one day long crawl will be unlikely to capture the data necessary to display a Facebook page in Wayback.
What to expect from archived Facebook seeds
Because of the way that our crawling technology captures Facebook, you can anticipate the following:
- Multiple distinct capture dates may appear on the Wayback calendar page of your seed for the time period of your crawl, and one or more of these access points may lead to your desired page in its public form-- that is, how it appears to a person not logged into Facebook.
- Archived Facebook pages should appear the same as their live analogs do to logged-in Facebook users with the following exceptions:
- You will not be able to expand comments sections of posts in order to see any more comments than our crawling technology saw.
- Posts will not expand when clicked.
- Each post will display how many times that it has been “liked,” however you will not be able to browse through the specific names of Facebook users who “liked” any post.
- Scroll will be limited.
- There will be no scroll on Facebook Groups' seeds (i.e. https://www.facebook.com/groups/).
Troubleshooting
- If you have a 3GB data limit in place, and do not get a complete capture with all the rules on this page, incrementally increase the seed-level data limit by 2GB and run a new test crawl at each stage to determine the ideal limit for your seed. If pages undergo large changes, recurring crawls may need one crawl with a higher data amount to capture the new content, before returning to a lower data amount.
- It is possible to use the Wayback QA tool on saved test crawls and production crawls to capture missing content. It is best to use this tool to run patch crawls as soon after a crawl ends as possible, because some URLs will be time sensitive, due to how Facebook serves up its content.
Comments
0 comments
Please sign in to leave a comment.