Configure a robots.txt file for your Proxy

An important, but often overlooked, step to properly installing your proxy is to add a simple robots.txt file to your root directory. This file is used to tell robots (the search engine scanners) what you would and would not like to be indexed.

The reason why you should take the time to do this is twofold. Firstly by preventing robots from crawling proxified sites, you will save bandwidth and server resources for crawls that will not directly benefit your ranking. Secondly, it prevents proxified pages ranking under your URL. While at first this may sound counter intuitive (surely it’s better to rank for lots of content), in fact you risk copyright infringement and duplicate content penalties. The latter may mean your site is blacklisted from search engines altogether.

Creating a robots.txt file is simple and can be done in any text editor (I use notepad). There are two elements to the code below. The first line ‘User-agent’ simply means which robots/spiders should follow your instructions. This can be useful if for instance you wanted to allow the Google image bot access to your images folder, but to hide it from all other robots. For our purposes though we simply use an asterisk which is computer-speak for ‘all robots’.

The second line ‘Disallow:’ tells the bot which URL they are banned from crawling/indexing. Because the different proxy scripts all work slightly differently, simply copy the code below for whichever proxy script you are running.

CGI Proxy

User-agent:*
Disallow:/nph-proxy.pl/

PHProxy

User-agent:*
Disallow:/index.php?q*

Glype

User-agent:*
Disallow:/browse.php

Once you have uploaded the robots.txt file to your root directory check it is accessible by visiting http://www.yourproxy.com/robots.txt – if you can read the file it has been uploaded correctly and spiders will check this file next time they crawl your site. If you cannot access the file please check that it has been uploaded to the correct directory (usually something like public_html or htdocs). This is the same directory where your sites index page is located.

This simple step, which will take no more than a minute to complete, will save server resources and a lot of potential headaches in one fell swoop.

Leave a Reply

Your email address will not be published. Required fields are marked *