Role of ROBOT.TXT File
Robots.txt file provides necessary information to search engines for proper crawling and indexing a website. It is a protocol for instructing search engines to exclude certain content while indexing now this is important, the robots dot txt file is not intended for security, secure files or anything that you do not want to be published online. Must be behind password protection. The robots dot txt protocol is typically used to block access to development sites, or non search content like scripting files or redundant or duplicate information. Remember, this is only a protocol. Not all bots will follow the protocol. So do not use this for any type of security at all.
Search Engine User-agent
Now, the robot.txt file can be tricky. And I’ve seen companies all over the world forget about this tiny little file and when they forget about it, they end up blocking access to the search engines which will cause their site to disappear from the results and they won’t remember why. All because someone made a mistake with this little file. Now if you want the search engine to Spider your entire website, the format is this user-agent:* which means any user agent and then Disallow:/printfriendly/ in the next line. And you’ll notice that nothing follows the disallow command.
That means that any user agent is able to access anything on the website. Now if I want to allow access, except for a directory that has duplicate files, such as Printer Friendly documents that are the same as the web pages, I don’t want two versions of the same page in the search engines. So I’ll add that directory into my disallow command like this user agent star, then disallow forward-slash Printer Friendly forward slash. This now makes the directory of Printer Friendly documents disallowed. So that won’t be crawled or included in the search results. Now if I have a site in development, and I don’t want to publish yet, I can disallow the entire website like this user agent star disallow forward slash.
Mistake to avoid
Now when you add that forward slash it disallows the entire website from being indexed at the root level. Nothing is allowed to be spidered or accessed by search engines. This is where a lot of companies forget about this file as they forget to remove this or change it once they go live. And so once they go live, if they forget about it, the new website doesn’t get indexed.
So if you use this as a method of disallowing your development site from being indexed, make sure you had it to your list of go-live steps because so much could go wrong with an improperly formatted robots dot txt file. Google Webmaster Tools has provided you with a robots dot txt test protocol. Simply use the test to ensure that your formatting is correct. And that you are allowing access to the parts of the site that you desire.