Using robotparser package in Python

Gustavo Santos
3 min readMar 5, 2024

Robot Parser can help you to understand the robots.txt file before scraping a web site.

Photo by Glenn Carstens-Peters on Unsplash

Introduction

During my daily reads in Medium, I bumped into this quick article where the author gives a couple of tips for those who want to go web scraping.

One of the tips that caught my attention was about the package robotparser from urllib in Python. That is a small module that helps us to understand the file robots.txt.

But what exactly is this file?

robots.txt is a page hidden in many websites that tells you which pages are allowed or disallowed to be scraped. Additionally, that page will tell the rules to scrape the website, like the expected request rate.

robots.txt is a page hidden in many websites that tells you which pages are allowed or disallowed to be scraped.

Usually, well-constructed crawler robots are programmed to look for that file and use it as their guidelines when scraping a website.

Let’s see how to use the module.

Parser

The page in the Reference section is very good documentation. Here is the basic use of the robotparser.

--

--

Gustavo Santos

Data Scientist. I extract insights from data to help people and companies to make better and data driven decisions. | In: https://www.linkedin.com/in/gurezende/