Sitemap XML

Providing that your web hosting supports Python (or you are able to add support for it) then here is a relatively simple way to add your own XML sitemap to your site which will automatically update itself on a regular basis. The end result is that you get an XML sitemap service of your own that you'd otherwise have to pay for. Just how much you'd have to pay for someone else to regenerate your sitemap at regular intervals depends on how frequently you update your site and how big your site is.

The first thing that you need to do is to get the Python script that you will use to generate your sitemap from SourceForge.

There are four files in this download that we will upload to our site the three .py files which contain the Python code and a modified version of the config.xml file that will contain the instructions on what to add to the sitemap. I am not sure whether all three Python files are really required but as I am about to suggest that you load them outside of the public part of your site the extra two files are not going to hurt anything by being there.

The idea is going to be to run the sitemap generator on a regular basis from a cron job. This avoids the need to put the Python files on your public site and also avoids the need for you to have command line access into your hosting to be able to run it.

The first thing to do though is to rename the example_config.xml file to config.xml and update its content so that it will tell the Python script what entries to generate for the sitemap.

The first section to update is the entries in the site tag. For the purpose of what follows I am going to assume that your web site is and that you have your hosting on the server in a folder called home/example which contains a public web folder called public_html. Simply substitute the real names for your site and folder locations for those entries in all the following references.

<site base_url=""

Next we use a directory tag to identify what directory contains the site we want to generate the sitemap for.


Finally we add a number of filter tags to drop all the files and folders that we don't want in the sitemap. Here are a few examples to get you started since your exact requirements depend on what files you have on your site.

Remove all hidden files (those where the filename starts with a dot).

<filter  action="drop"  type="regexp"    pattern="/\.[^/]*"  />

Remove files with a specific extension such as .js (simply repeat and replace the extension to hide additional extensions).

<filter  action="drop"  type="regexp"    pattern="/.*?\.js"  />

We'll also want to remove all the folders that are not supposed to contain web pages that are accessed via that domain and folder combination. For example if all our images are in an img folder then we would exclude that (we would also exclude any folders containing add-on or sub-domains).

<filter  action="drop"  type="regexp"    pattern="/.*\/img\/.*"  />

Once you have created your config.xml file you are ready to upload it and the three .py files to your hosting. I suggest that you create a py folder on your account that is outside the public_html folder and upload the four files there.

Now all that remains to be done is to create the cron job to run the script on a regular basis. The following will run it weekly in the early hours of each Saturday morning.

10 0 * * 6
python /home/example/py/

With the files uploaded and the cron set up you now have an XML sitemap for your site that will update itself to add the new pages that you add on a regular basis and that will handle as many pages as you add to your site.

That just leaves letting the search engines know about the existance of your sitemap. The script itself notifies Google every time it runs unless you add --testing to the end of the command to suppress that notification. You can also make certain Google knows about the sitemap by logging in to your Google Webmasters Tools account (if you have one) and tell Google about the sitemap there. You can also add an entry into your robots.txt file telling all the search engines about your sitemap from there using:


go to top

FaceBook Follow
Twitter Follow