Friday, February 10, 2006

The ReRight Way to Do It

I see it everyday. Someone proposing a url structure in the forums and asking for advice on whether it'll get crawled or if there are any visual flaws. So let this be an end-all post on the subject.

Out of all the experimenting with urls, I've found that
.com/ to
works the best at being crawled regulary and deeply. PERIOD.

Now let me go over the problems with the other structures that I see everyday as I cruise the forums.

The Directory Structure & PR Dither Rewrite
This is like the Camero Mullet (see 2nd image) of mod_rewrites, olds school and looked cool back in the day.

.com/ (PR5)
/folder/file.hml (PR4)
/folder/folder/file.html (PR3)
/folder/folder/folder/file.html (PR0)

In the directory tree structure the root links to the first folder, the file in the first folder links to the file in the 2nd folder and so on. Once you get to 3 folders deep you lose your PR in most cases. This is because a puny PR 3 isn't strong enought to warrant a crawl that deep into a site. You could even see this in some of the deeper sections of the Yahoo! directory. The further you went the lower the PR and some sections were just so deep off the root they didn't even warrant a cache.

Of course the solution to the above would be to link to every page in the site on every teir or use the high PR from a root sitemap to feed spiders deeper. This also dissolves if you have a large site. A good example would be a country: USA (1 page) >> State (50) >> County (~3250) >> City (17,500*)

Obviously no single page could hold 20,000+ links and be crawled. Plus browsers would strain to render that coding. Then deciphering all the navigation. Its just not logical.

Junk Rewrites
Like The Tron Guy, It Should be avoided at all costs.
Nice Moose Knuckle by the way Jay!

These are the cases where the developer stuffs every variable thats not needed into the url. So the effect is a nonsensical jump from Root to a file 3-5 folders deep.

.com/ (PR 5) | (PR 3)
/folder/folder/folder/folder/folder/file.html (PR 2-3) | (PR 0)

The problem is that a site has to gain a significant amount of PR on the home page just to push the spiders into the rest of the site. This is why there are many complaints when a developer that has switched to mod rewrite static urls and complains, "I can't get my new urls to get indexed". You could be waiting months or years depending on how fast you can get inbound links.

It's not that they can't get indexed it just that your home/root pages aren't powerful enought to warrant a deep crawl. I see this alot with shopping cart/cms add-ons for mambo & oscommerce. For windows servers I would suggest using ISAPI Rewrite. ISAPI Rewrite gives you the same functionality and control as the mod_rewrite application for Apache.

The Tried and True Solution

Short and simple and 1 step away from the root at all times. I've come to this because I did all of the above and learned the hard way.

For example the site ~www.sbdpro.com was patterned after the Yahoo! Directory. It has consitantly for the last 2 years had a PR 4 home page. But with that structure it could never get the spiders deeper than 2 categories or 2 folders deep. Since the deepest depths in that directory is 8 tiers down there really was no solution or point of keeping this url structure.

It was changed about a year ago. All links from the root go to /directory/file-name.htm It didn't matter how many tiers down you went, all categories were now 1 tier away from the root, all subcategories were one tier from the root. The crosslinking all stayed the site had no index problems as all.

The effect was astonishing the PR still drops out in the 3rd tier with a PR 2 yet the spiders still followed the links through 5 more tiers of crosslinking & PR 0 pages to reach the 8th tier down.

My advice to all new mod rewrites.

1. Don't get married to your first try at mod rewritten urls. Change them because you're in it for the long run (hopefully). The above site ranked for "small business directory" page 1 consistantly during the change in Yahoo and MSN. It even jumped to page 2 in Google for that term as well and did a stretch for a while.

2. Keep your urls simple and close to the root. You will see more spider activity and have less headaches.

3. Make sure you rewriterule syntax is optimal and you are not bogging down your server. See this thread here >>

Other Resources

Webforgers.net - Mod Rewrite Tutorials
Ilovejackdaniels.com - Mod Rewrite Cheat Sheets

~ = 3rd party database of counties I bought.
* = From spidering Yahoo directory for city names under the state sections.


