Instagram Web Crawling

December 17, 2024 16:28

We have a request from a student group to capture their Instagram Account. For a test I tried to capture my personal account with no luck - just a blank page.

I first did it without credentials; then I added my login credentials; then I added ignore robots.txt. Each time nothing more than a blank page.

Does anyone have advice/tricks they use to capture Instagram? Thank - Dan

Comments

2 comments

Skip Kendall December 17, 2024 16:39

Hi Dan,

Instagram's a pain. We haven't been able to directly crawl it for more than a year. Just isn't possible. For a while, we used manual tools (Webrecorder, Conifer) and then uploaded the WARCs to Archive-It. That worked until something changed at Instagram and our WARCs no longer replayed at Archive-It. Archive-It suggests picuki.com, through which we had a lot of success. Picuki does their own replay of Instagram feeds. We crawled that through Archive-It until recently when Picuki started blocking their crawler. Then I did it manually for a short time, very tedious, but that has become problematic in the last month or so. They have security software now that has a hair trigger and if I move too quickly in a manual crawl, it will identify me as a machine. I managed to get on IP address banned for a while because of that.

So, at the present time, the only way I know to get Instagram feeds is to capture Picuki manually and go very slow. Definitely not scalable but it does work.

Skip

0

Comment actions Permalink
Dan Nooonan December 17, 2024 16:43

Skip Kendall Thanks Skip! I tried Conifer, too and that was worse. Not willing to pay for Webrecorder just now (nor could I justify that as an additional subscription. I could give the Picuki a shot, as the request we have has less than 250 posts. Thanks!

0

Comment actions Permalink

Please sign in to leave a comment.

Comments

Didn't find what you were looking for?