News

Google open sources its robots.txt parser to make Robots Exclusion Protocol an official internet standard

3 min read

Yesterday, Google announced that it has teamed up with the creator of Robots Exclusion Protocol (REP), Martijn Koster and other webmasters to make the 25 year old protocol an internet standard. The REP, better known as robots.txt, is now submitted to IETF (Internet Engineering Task Force). Google has also open sourced its robots.txt parser and matcher as a C++ library.

REP was created back in 1994 by Martijn Koster, a software engineer who is known for his contribution in internet searching. Since its inception, it has been widely adopted by websites to indicate whether web crawlers and other automatic clients are allowed to access the site or not.

When any automatic client wants to visit a website it first checks for robots.txt that shows something like this:

User-agent: *

Disallow: /

The User-agent: * statement means that this applies to all robots and Disallow: / means that the robot is not allowed to visit any page of the site.

Despite being used widely on the web, it is still not an internet standard. With no set in stone rules, developers have interpreted the “ambiguous de-facto protocol” differently over the years. Also, it has not been updated since its creation to address the modern corner cases. This proposed REP draft is a standardized and extended version of REP that gives publishers fine-grained controls to decide what they like to be crawled on their site and potentially shown to interested users.

The following are some of the important updates in the proposed REP:

  • It is no longer limited to HTTP and can be used by any URI-based transfer protocol, for instance, FTP or CoAP.
  • Developers need to at least parse the first 500 kibibytes of a robots.txt. This will ensure that the connections are not open for too long to avoid any unnecessary strain on servers.
  • It defines a new maximum caching time of 24 hours after which crawlers cannot use robots.txt. This allows website owners to update their robots.txt whenever they want and also avoid the overloading robots.txt requests by crawlers.
  • It also defines a provision for cases when a previously accessible robots.txt file becomes inaccessible because of server failures. In such cases the disallowed pages will not be crawled for a reasonably long period of time.

This updated REP standard is currently in its draft stage and Google is now seeking feedback from developers. It wrote, “we uploaded the draft to IETF to get feedback from developers who care about the basic building blocks of the internet. As we work to give web creators the controls they need to tell us how much information they want to make available to Googlebot, and by extension, eligible to appear in Search, we have to make sure we get this right.”

To know more in detail check out the official announcement by Google. Also, check out the proposed REP draft.

Read Next

Do Google Ads secretly track Stack Overflow users?

Curl’s lead developer announces Google’s “plan to reimplement curl in Libcrurl”

Google rejects all 13 shareholder proposals at its annual meeting, despite protesting workers

Bhagyashree R

Share
Published by
Bhagyashree R

Recent Posts

Top life hacks for prepping for your IT certification exam

I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…

3 years ago

Learn Transformers for Natural Language Processing with Denis Rothman

Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…

3 years ago

Learning Essential Linux Commands for Navigating the Shell Effectively

Once we learn how to deploy an Ubuntu server, how to manage users, and how…

3 years ago

Clean Coding in Python with Mariano Anaya

Key-takeaways:   Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…

3 years ago

Exploring Forms in Angular – types, benefits and differences   

While developing a web application, or setting dynamic pages and meta tags we need to deal with…

3 years ago

Gain Practical Expertise with the Latest Edition of Software Architecture with C# 9 and .NET 5

Software architecture is one of the most discussed topics in the software industry today, and…

3 years ago