Facebook crawls now require our partners to apply all the scoping rules that were previously “optional” if they want to continue to limit their Facebook seeds to 1GB of new data. The help center instructions have been updated to reflect that the optional rules are now “recommended”.
We recommend that all partners review the revised help center instructions on archiving Facebook and adjust their seeds accordingly. Please visit Archiving Facebook for the full instructions, including all options and seed formulation guidelines, or use the following abbreviated rule list:
- Limit the scope of each Facebook seed to archive a minimum of 1GB of data.
- Ignore robots.txt for each Facebook seed at the seed level.
- Expand scope at the seed level to include URLs that match the following SURT, exactly as it appears here: +http://(net,fbcdn,
- Block URLs at the seed level that include the text: bytestart=
- Block URLs at the seed level that match the followingregular expression: ^https?://..-..\.facebook\.com.*
- Block URLs at the seed level that match the following regular expression: ^https?://www\.facebook\.com/(profile\.php|people/).*$
Based on extensive testing, 1GB is now the minimum amount of data necessary to capture a logged in page with scroll. Higher seed-level data limits should be tested if you test and find a 1GB limit insufficient to meet your goals.
Security captchas are currently appearing on the calendar page for recent Facebook crawls that encounter them. We are working on a solution that would block these entries from appearing on the calendar page and will update you here when a solution has been found.
Please submit a support ticket if you have any questions about your crawl results.
Please sign in to leave a comment.