Overview
Instagram is a photo and video-sharing application and social networking service. This guide provides an overview of how to properly format, scope, and crawl Instagram seeds. Currently, there are known issues with archiving Instagram detailed below.
Known issues
Social media platforms like Instagram can be difficult to archive. Currently, Instagram has the following issues that we continue to actively monitor:
- ⚠️ Instagram is blocking collection and Wayback replay beyond the 12-post default load page for most organizational and personal profile pages; open posts in a new tab. It may take 10-20 seconds for a page to load.
- As a potential workaround, we have seen some success crawling third-party Instagram viewer platforms. See the Troubleshooting section below.
For a full list of known issues for archiving various platforms, see Status of monitored platforms
On this page:
- How to select and format your Instagram seeds
- Scoping Instagram seeds
- Running your crawl
- What to expect from archived Instagram seeds
- Troubleshooting
How to select and format your Instagram seeds
- Be specific. Always include a specific user, followed by a / at the end. For example https://www.instagram.com/internetarchive/
- Use the Standard seed type for Instagram seeds.
Scoping Instagram seeds
Default scoping for Instagram seeds
New Instagram seeds will have the default scoping rules automatically applied at the seed level when they are added to a collection. To learn more, including how you can add default scoping rules to existing seeds, visit Sites with automated scoping rules.
- At the seed level, add a ignore robots.txt scoping rule. Note: Ignore Robots.txt is automatically to all new Instagram seeds.
-OR-
- At the collection level, add a scoping rule to ignore robots.txt for the hosts www.instagram.com and fbcdn.net.
Running your crawl
Once you have finished selecting your seeds and adding recommended scoping rules, we highly recommend that you crawl your seeds using Brozzler.
What to expect from your archived Instagram seeds
Captures replay the default load (up to 12 images) of content on Instagram feeds. If a page initially appears blank, wait for 10-20 seconds for the page to render. To view a feed's individual post, right-click to open it in a new tab. To playback videos and media, use the Wayback banner's media link.
Troubleshooting
While we work to address the known issues with archiving Instagram, we have seen some success crawling third-party Instagram viewer platforms such as Picuki.
To try Picuki:
- Search for your Instagram account on the platform.
- In your Archive-It account, add the Picuki URL as a seed. We recommend the following:
- Use the Standard seed type.
- Add a seed-level scoping rule to ignore robots.txt files.
- Add a second seed-level scoping rule to 'Accept URLs if they contain the text media ' (this helps collect more posts, however scroll is limited for their replay).
- Crawl with Brozzler.
Comments
0 comments
Please sign in to leave a comment.