Google open sources its robots.txt parser to make Robots Exclusion Protocol an official internet standard

3 min read

Yesterday, Google announced that it has teamed up with the creator of Robots Exclusion Protocol (REP), Martijn Koster and other webmasters to make the 25 year old protocol an internet standard. The REP, better known as robots.txt, is now submitted to IETF (Internet Engineering Task Force). Google has also open sourced its robots.txt parser and matcher as a C++ library.

REP was created back in 1994 by Martijn Koster, a software engineer who is known for his contribution in internet searching. Since its inception, it has been widely adopted by websites to indicate whether web crawlers and other automatic clients are allowed to access the site or not.

When any automatic client wants to visit a website it first checks for robots.txt that shows something like this:

User-agent: *

Disallow: /

The User-agent: * statement means that this applies to all robots and Disallow: / means that the robot is not allowed to visit any page of the site.

Despite being used widely on the web, it is still not an internet standard. With no set in stone rules, developers have interpreted the “ambiguous de-facto protocol” differently over the years. Also, it has not been updated since its creation to address the modern corner cases. This proposed REP draft is a standardized and extended version of REP that gives publishers fine-grained controls to decide what they like to be crawled on their site and potentially shown to interested users.

The following are some of the important updates in the proposed REP:

It is no longer limited to HTTP and can be used by any URI-based transfer protocol, for instance, FTP or CoAP.
Developers need to at least parse the first 500 kibibytes of a robots.txt. This will ensure that the connections are not open for too long to avoid any unnecessary strain on servers.
It defines a new maximum caching time of 24 hours after which crawlers cannot use robots.txt. This allows website owners to update their robots.txt whenever they want and also avoid the overloading robots.txt requests by crawlers.
It also defines a provision for cases when a previously accessible robots.txt file becomes inaccessible because of server failures. In such cases the disallowed pages will not be crawled for a reasonably long period of time.

This updated REP standard is currently in its draft stage and Google is now seeking feedback from developers. It wrote, “we uploaded the draft to IETF to get feedback from developers who care about the basic building blocks of the internet. As we work to give web creators the controls they need to tell us how much information they want to make available to Googlebot, and by extension, eligible to appear in Search, we have to make sure we get this right.”

To know more in detail check out the official announcement by Google. Also, check out the proposed REP draft.

Top life hacks for prepping for your IT certification exam

I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…

3 years ago

Artificial Intelligence

Learn Transformers for Natural Language Processing with Denis Rothman

Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…

3 years ago

Servers

Learning Essential Linux Commands for Navigating the Shell Effectively

Once we learn how to deploy an Ubuntu server, how to manage users, and how…

3 years ago

Interviews

Clean Coding in Python with Mariano Anaya

Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…

3 years ago

Front-End Web Development

Exploring Forms in Angular – types, benefits and differences   

While developing a web application, or setting dynamic pages and meta tags we need to deal with…

3 years ago

Featured

Gain Practical Expertise with the Latest Edition of Software Architecture with C# 9 and .NET 5

Software architecture is one of the most discussed topics in the software industry today, and…

3 years ago

Google open sources its robots.txt parser to make Robots Exclusion Protocol an official internet standard

Read Next

Recent Posts

Top life hacks for prepping for your IT certification exam

Learn Transformers for Natural Language Processing with Denis Rothman

Learning Essential Linux Commands for Navigating the Shell Effectively

Clean Coding in Python with Mariano Anaya

Exploring Forms in Angular – types, benefits and differences

Gain Practical Expertise with the Latest Edition of Software Architecture with C# 9 and .NET 5

Google open sources its robots.txt parser to make Robots Exclusion Protocol an official internet standard

Read Next

Related Post

Recent Posts

Top life hacks for prepping for your IT certification exam

Learn Transformers for Natural Language Processing with Denis Rothman

Learning Essential Linux Commands for Navigating the Shell Effectively

Clean Coding in Python with Mariano Anaya

Exploring Forms in Angular – types, benefits and differences

Gain Practical Expertise with the Latest Edition of Software Architecture with C# 9 and .NET 5

Exploring Forms in Angular – types, benefits and differences