Google has declared it is open sourcing its robots.txt parser software, in parallel with an effort to standardise robots.txt, the venerable protocol that allows website owners to tell “automated clients” what they can and can’t do on their sites.
Robots.txt was first proposed by Martijn Koster back in 1994 – five years after Tim Berners Lee kicked off the web – as a way of giving site owners some control over what then nascent web crawlers did when they alighted on a site.
It became a de facto standard and was central to the orderly growth of early search giants such as Lycos and Alta Vista, and eventually Google. But it was never an agreed standard, meaning there was room for site owners, developers, and search outfits to get things wrong.
Google declared today that “we’re spearheading the effort to make the REP an internet standard” adding, “While this is an important step, it means extra work for developers who parse robots.txt files.”
Regarding the protocol, Google said, “Since its inception, the REP hasn’t been updated to cover today’s corner cases. This is a challenging problem for website owners because the ambiguous de-facto standard made it difficult to write the rules correctly.”
More importantly for DevClass readers, Google reckons “developers have interpreted the protocol “somewhat differently over the years.”
So, it says, “Together with the original author of the protocol, webmasters, and other search engines, we’ve documented how the REP is used on the modern web, and submitted it to the IETF.”
The draft proposal “doesn’t change the rules created in 1994, but rather defines essentially all undefined scenarios for robots.txt parsing and matching, and extends it for the modern web.” Says Google.
This means any URI based transfer protocol can use robots.txt – meaning FTP or CoAP as well as HTTP. It also says “Developers must parse at least the first 500 kibibytes of a robots.txt.”
It sets a maximum caching time of 24 hours, or cache directive value, allowing site owners to update their robots.txt whenever they want, and ensuring sites aren’t overloaded by crawlers.
“As we work to give web creators the controls they need to tell us how much information they want to make available to Googlebot, and by extension, eligible to appear in Search, we have to make sure we get this right,” said Google.
At the same time, said Google, “We open sourced the C++ library that our production systems use for parsing and matching rules in robots.txt files.”
As Google puts it, “we learned a lot about how webmasters write robots.txt files and corner cases that we had to cover for, and added what we learned over the years also to the internet draft when it made sense.”
More to the point, as it says in the GitHub repo, “The library is released open-source to help developers build tools that better reflect Google’s robots.txt parsing and matching.”