txtbot

txtbot is a specialized web crawler designed to access and analyze plain text files (such as robots.txt, security.txt, humans.txt, llms.txt, etc.) across the internet.

User Agent

You can identify txtbot by any of the following user agent strings:

txtbot (+https://txtbot.net)

or

Mozilla/5.0 (compatible; txtbot; +https://txtbot.net)

Purpose

The primary purpose of txtbot is to crawl and analyze plain text files (such as robots.txt, security.txt, humans.txt, llms.txt, etc.). It is not designed to access HTML pages, however it may be redirected by websites to access them (e.g. by returning a 301 or 302 redirect or 404 Not Found when responding to a request for a txt file), in which case the body of the response will be discarded. However, if the website returns a 200 OK response with a body, txtbot will parse the body and extract the information it can.

Crawling Frequency

txtbot is designed to access individual plain text files (such as robots.txt, security.txt, humans.txt, llms.txt, etc.) no more than a few times per day.

Note: It's possible that this timeframe may be more or less frequent in some edge cases. txtbot is engineered to minimize its impact and should not pose a burden on most sites.

Blocking txtbot

As txtbot respects robots.txt directives, you can block it using the following methods:

Method 1: Using robots.txt (recommended)

Add the following lines to your robots.txt file:

User-agent: txtbot
Disallow: /

Important: This method will not prevent txtbot from accessing the robots.txt file itself, as defined in section 2.2.2 of RFC 9309.

Method 2: IP Blocking

Alternatively, block the following IP addresses at the server/firewall level:

Important: Blocking using robots.txt is the recommended and most reliable method, because these IP addresses may change over time. New addresses may be added, old ones may be removed and temporary IP addresses may be used.

Contact

If you have any questions or concerns regarding txtbot, please reach out at txtbot@txtbot.net.