November 28, 2008 – 10:06 am
A couple of months ago, i completed a web site for a client who sells autographed photos of bands and musicians. It’s basically a gallery of photos with the option to buy them. I designed the software for this site with search engine optimisation in mind, but the requirement to have the photos in alphabetical order complicated navigation considerably. It turned out that the solution i came up with meant that the URLs of the individual photo pages would change when items were added to or deleted from the catalogue. This meant that clicking on a Google search link to a particular band photo would most likely take you to a photo of a different band – and that’s not good for business!
I didn’t realise the seriousness of this until after the site had gone live – and, consequently, until after it had been fully indexed by google. This meant that after i’d fixed the problem, the indexes in Google and the other search engines would take some time to correct themselves. To try and speed up this process, i decided to provide the crawlers with a sitemap.
Sitemaps are only really necessary if all the URLs you want to be indexed aren’t accessible, directly or indirectly, via links from the home page – and therefore they will never be found by google etc – so there wasn’t really a need to build a sitemap into the design before. But uploading a sitemap to google might trigger a full re-indexing with a bit of luck – particularly if it showed the pages had changed since the last time Google crawled the site.
If i was going to add a sitemap, it might as well be one that’s always available and up to date – and is automatically generated. There are two ways to do that – either generate a new static sitemap every time something’s changed, or generate a sitemap on the fly every time google or another search engine asks for it.
Generating the sitemap on the fly is simpler because it only requires writing a single standalone page of php – rather than adding code to all the places where the database gets changed. However, it’s hard to say which method would put the most load on the server – that would depend on how often the database gets changed and how often spiders crawl the site and ask for the sitemap. It probably wouldn’t make much difference either way though, so i decided to do it the simple way!
Even though it’s optional, a useful sitemap should include the date, and maybe the time, when each page was last modified. So, to start with, i needed to add a column to the database table that holds the information about each item in the catalogue. The simplest way to do this is with a mysql timestamp which automatically updates itself each time that record’s altered. I used the following mysql data type declaration:
MTime TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
Then i had to set that column to the current time in all rows in the table. I did that at the mysql console too:
update Photos set MTime = NOW();
Then i consulted Google’s sitemap specifications to find out what format the sitemap should be in. This is Google’s example, from that page:
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.example.com/</loc> <lastmod>2005-01-01</lastmod> <changefreq>monthly</changefreq> <priority>0.8</priority> </url> </urlset>
That’s fairly straightforward and all i needed to do was write a script to iterate through all the items in the catalogue, construct their URL and insert it between the <loc> tags, and insert the modification time between the <lastmod> tags.
Before that, though, i coded up a small section for the static pages – “About Us”, “Contact”, etc. I did that by defining an array with each of their URLs in and then iterating through that to output a <url> chunk for each one. I didn’t bother with a lastmod time for those pages.
After the static pages, i wrote a section of code to generate a chunk for each of the catalogue pages. Again i didn’t bother with a lastmod time, as i’m not worried about google indexing those pages – it’s the individual photo pages that are important. I gave the static pages a priority of 0.5 and a monthly changefreq and the catalogue pages a priority of 0.7 and also a monthly changefreq. The photo pages were set at 0.9 priority and a weekly changefreq. This probably won’t make much difference, but it doesn’t do any harm.
I had a problem when it came to the lastmod time for the photo pages though. The date/time format used by the sitemap standard that the sitemaps standard uses (the W3C Date and Time formats standard) is slightly different from mysql’s timestamp format. Mysql uses YYYY-MM-DD hh:mm:ss, but W3C wants YYYY-MM-DDThh:mm:ss – i.e., with a “T” between the date and the time, not a space. W3C’s standard also insists on a timezone suffix: TZD+hh:mm.
However, the date on its own ( YYYY-MM-DD ) is acceptable too, and that’s what i decided to use. The exact time doesn’t really matter, and the server’s in another country and i couldn’t be bothered working out the time zone stuff and adding that into the equation. But i needed to separate the date from the time after retrieving it from the database record – which was a simple matter of using
explode() before inserting it into the XML output.
Well, it would have been much better to have got the design right properly in the first place, so these steps were never needed, but now that site’s got a dynamic sitemap that Google’s webmaster tools tells me it’s happy with. Hopefully it will trigger re-indexing a bit more quickly than it would have happened otherwise. We’ll see…