Using robotparser package in Python
Robot Parser can help you to understand the robots.txt file before scraping a web site.
Introduction
During my daily reads in Medium, I bumped into this quick article where the author gives a couple of tips for those who want to go web scraping.
One of the tips that caught my attention was about the package robotparser
from urllib
in Python. That is a small module that helps us to understand the file robots.txt
.
But what exactly is this file?
robots.txt
is a page hidden in many websites that tells you which pages are allowed or disallowed to be scraped. Additionally, that page will tell the rules to scrape the website, like the expected request rate.
robots.txt
is a page hidden in many websites that tells you which pages are allowed or disallowed to be scraped.
Usually, well-constructed crawler robots are programmed to look for that file and use it as their guidelines when scraping a website.
Let’s see how to use the module.
Parser
The page in the Reference section is very good documentation. Here is the basic use of the robotparser
.