Important Update: Crawling Facebook

Facebook crawls now require our partners to apply all the scoping rules that were previously “optional” if they want to continue to limit their Facebook seeds to 1GB of new data. The help center instructions have been updated to reflect that the optional rules are now “recommended”.

We recommend that all partners review the revised help center instructions on archiving Facebook and adjust their seeds accordingly. Please visit Archiving Facebook for the full instructions, including all options and seed formulation guidelines, or use the following abbreviated rule list:

Limit the scope of each Facebook seed to archive a minimum of 1GB of data.
Ignore robots.txt for each Facebook seed at the seed level.
Expand scope at the seed level to include URLs that match the following SURT, exactly as it appears here: +http://(net,fbcdn,
Block URLs at the seed level that include the text: bytestart=
Block URLs at the seed level that match the followingregular expression: ^https?://..-..\.facebook\.com.*
Block URLs at the seed level that match the following regular expression: ^https?://www\.facebook\.com/(profile\.php|people/).*$

Based on extensive testing, 1GB is now the minimum amount of data necessary to capture a logged in page with scroll. Higher seed-level data limits should be tested if you test and find a 1GB limit insufficient to meet your goals.

Security captchas are currently appearing on the calendar page for recent Facebook crawls that encounter them. We are working on a solution that would block these entries from appearing on the calendar page and will update you here when a solution has been found.

Please submit a support ticket if you have any questions about your crawl results.

Official comment

Mary Haberle November 15, 2017 22:41

UPDATE: The block of security check pages was restored by our engineers in late October 2017.

Ann Lally October 05, 2017 19:57

Mary if we allow the crawler to log into FB using our personal credentials, do the security captchas still show up on the calendar page?

Thanks! Ann Lally

Mary Haberle October 05, 2017 21:23

Hi Ann,

Yes, unfortunately, the security captchas are currently appearing irrespective of which account credentials are used to log in to the site. We have our engineers actively pursuing a fix to this issue and hope to be able to block these captures soon. I will update this thread once a solution is in place.

All best,

Mary

Important Update: Crawling Facebook

Comments

Didn't find what you were looking for?