web scraping - Java web-scraper sees captcha -
i have made web-scraper google scholar in java jsoup. scraper search scholar doi , finds citations paper. data needed research.
but, scraper works first requests. .. after scraper encounters captcha on scholar site.
however, when open website in browser (chrome) google scholar opens normally.
how possible? request come same ip-address! far have tried following options:
- choose random user-agent request (from list of 5 user-agents)
- random delay between request between 5- 50 seconds
- use tor-proxy. end-nodes have been blocked google
when analyse request made chrome scholar see cookie used session id's. why chrome requests not blocked. possible use cookie request made jsoup?
thank you!
there's 3 things spring mind:
- you aren't saving cookies between requests. first request should save cookie , pass server next request (setting referer header wouldn't hurt too). there's example here.
- if google being tricky see first request didn't load css/js/images on page. sure sign bot.
- javascript doing in page once have loaded.
i think first option. should try copy many of headers see in request chrome java code.
Comments
Post a Comment