Click For Photo: https://regmedia.co.uk/2017/11/13/illustration_of_giant.jpg
Google on Monday released its robots.txt parsing and matching library as open source in the hope of its now public code will help encourage web developers to agree on a standard way to spell out the proper etiquette for web crawlers.
The C++ library powers Googlebot, the company's crawler for indexing websites in accordance with the Robots Exclusion Protocol (REP), a scheme that allows website owners to declare how code that visits websites to index them should behave. REP specifies how directives can be included in a text file, robots.txt, to tell visiting crawlers like Googlebot which website resources can be visited and which can be indexed.
Years - Martijn - Koster - Creator - Web
In the 25 years since Martijn Koster - creator of the first web search engine - created the rules, REP has been widely adopted by web publishers but never blessed as an official internet standard.
"[S]ince its inception, the REP hasn't been updated to cover today's corner cases," explained a trio of Googlers – Henner Zeller, Lizzi Harvey, and Gary Illyes – in a blog post. "This is a challenging problem for website owners because the ambiguous de-facto standard made it difficult to write the rules correctly."
Example - Differences - Way - Editors - Characters
For example, differences in the way text editors handle newline characters on different operating systems can prevent robots.txt files from working as expected.
Google's library goes out of its way to try to make such files less brittle. For example, it includes code to accept five different misspellings of the "disallow" directive in robots.txt.
REP - Implementations - Google - REP - Internet
To make REP implementations more consistent, Google is pushing to make the REP an Internet Engineering Task Force standard. It has published a draft proposal in the hope anyone concerned about such things will voice an opinion about what's needed.
The latest draft...
Wake Up To Breaking News!