If you have ever been a webmaster, you would know there’s more to a website that just purchasing the domain and setting up the hosting. We are talking about the various files and lines of codes that webmasters are required to copy-paste across the web between their own website as well as search engines, analytics, etc. One of the files is called ‘robot.txt’ or as it is called, Robots Exclusion Protocol (REP) allows webmasters to exclude crawlers and other automatic clients from accessing your site for crawling purposes.
REP has been around for 25 years and although it is recognized and used by almost half a billion websites, it is yet to become a universal standard where Google is taking a stride to make it an internet standard after 25 years of being in existence but the search engine giant is open for interpretation.
According to the extract published by 9to5Google, as said REP is recognized and used across developers and webmasters, however, it doesn’t address modern use cases. One of such exclusions is when a webmaster is using the text editor on its website with BOM characters in their robots.txt file. On the other hand, developers have to address uncertainty with the file as well as it could get as large as hundreds of megabytes.
REP was introduced in 1994 and Google has joined forces with the original author of the protocol and has suggested few changes into the protocol that could be exercised in order to make REP an internet standard. Google will import the use cases of the real world experience in order to revamp the robots.txt.
According to the draft, the protocol will specify webmasters on how much information will be made available to the Googlebot and other extensions in order to influence the Google Search. Search Engine Journal points how REP will be used to any URL based transfer protocol which is not just limited to HTTP but CoAP and FTP can use it too.
Furthermore, developers must parse the first 500 kibibytes of the robots.txt if not the entire file. Furthermore, webmasters will get flexibility in order to update their robots.txt file since the maximum caching time as proposed by Google is 24 hours. Finally, if there’s any server failure causing robots.txt to become inaccessible, disallowed pages are not crawled for a long period of time as proposed by the new standard for REP which is awaiting feedback from the developers as Google has highlighted its openness towards the proposed protocol before it could be concretized.