Mar 27 2007
Using Robots.txt to Keep Your Joomla Pages Under Control Print
Tuesday, 27 March 2007

The technical side of Joomla SEO can be summed up in one sentence: keep your URLs under control.


Joomla really is a powerful tool for creating content-rich websites but its also easy to end up with a whole lot of useless URLs.


In today's post, we'll use MosTree as an example of how to manage Joomla URLs, using the wonderful sounding robots.txt file.


In recent weeks, I've blogged about a few ways to make MosTree more Search Engine Friendly (first update, second update), and we've also talked about how having a few high-quality pages on your site is much better than having a lot of low-quality pages. This post is a follow-up to both of those.

A Little Background for this Example

Last year we launched JoomlaYellowPages.com. The site has done well, and now lists nearly 200 Joomla companies worldwide. It's also done well in SEO terms. Search for Africa Joomla, Asia Joomla, Europe Joomla or any other geographic region and Joomla, and theres a good chance that JoomlaYellowPages.com will be high in the results.

What was the Problem?


We noticed very early on that Google was also indexing multiple pages for each listing, including the contact form, recommend page and others. One company = 4 or 5 URLs.


For example:


joomlayellowpages.com/listings/north_america/united_states/georgia/alledia/details/
joomlayellowpages.com/listings/north_america/united_states/georgia/alledia/contact/
joomlayellowpages.com/listings/north_america/united_states/georgia/alledia/review/
joomlayellowpages.com/listings/north_america/united_states/georgia/alledia/claim/

What Was The Solution?

What we did was use our robots.txt file, located in the root of the site, to stop Google indexing all the extra pages.


Normally if you have a component producing 100s of extra URLs, you can simply block the whole script from being indexed:


Disallow: /badcomponentforseo/


However, in this case we needed a scalpel rather than a sledgehammer. We wanted to Google to index only certain parts of MosTree and ignore the rest. So we used a wildcard * symbol to block all URLs with a specific beginning and ending, regardless of what was in the middle:


Disallow: /listings/*/*/*/review
Disallow: /listings/*/*/review
Disallow: /listings/*/*/*/Add_Listing
Disallow: /listings/*/*/Add_Listing
Disallow: /listings/*/Add_Listing
Disallow: /listings/*/*/*/Add_Category
Disallow: /listings/*/*/Add_Category
Disallow: /listings/*/Add_Category
Disallow: /listings/*/*/*/contact
Disallow: /listings/*/*/contact
Disallow: /listings/*/*/*/recommend
Disallow: /listings/*/*/recommend

How Can I Apply This To My Site?

Regularly check what kind of pages Google is indexing on your site and look for patterns. If there are a lot of PDF pages, or dozens of useless links from a particular component, you can act quickly to block them out with robots.txt. Use the site:mydomain.com search function or a tool such as WebCEO.com.


Among the most important things you can do is check your pages that are in Google's supplemental index. This is where you'll find lots of your low-quality pages, ripe for removal by robots.txt. If the pages don't contain useful information, dump them.

Read More About Robots.txt

Originally the wildcard wasn't supported by robots.txt but that has since changed. Both Google and Yahoo now recognize it:


Comments (6)Add Comment
...
written by Zorro, March 27, 2007
Excellent tip, I didn't know about the wildcard. Thanks!
While you're at it, consider adding Joomla's index2.php to robots.txt as well ...

Kind regards from Germany.
...
written by steve, March 27, 2007
Hi Zorro

You're right - thats a great extra tip. Someone left a comment on this site that their PDF pages were ranking higher than their real pages. Adding index2.php to robots.txt is a great solution.
...
written by Hummerbie, March 27, 2007
Here is another tip, if you site is focussed on images, use the option that the SEF patch from Joomlatwork does, and remove the /disallow/ on the media directory.
Using discriptive image names will give you some extra traffic.

As for the PDF problems, you should disable the PDF function in the Global configuration.
The generated pages in PDF are dead ends, because your visitor has no way to click to the homepage or other pages on you site.
Besides that, how many people use the button for PDF? Mostly the hit the Print button..
...
written by steve, March 27, 2007
Hi Hummerbie - thats a great tip. You're thinking that this will bring extra traffic from Google Images, right?

...
written by Brian Teeman, March 27, 2007
The special searches I do to find new and exciting sites for Joomla Weekly News regualry finds PDF and RSS links for sites before normal links.
...
written by Lever, February 06, 2008
Letting the SEs spider your RSS is always good, but how do you stop SEs from spidering search pages? Even with OpenSEF installed and running the SEs still manage to index the search pages.

Write comment
quote
bold
italicize
underline
strike
url
image
quote
quote
smile
wink
laugh
grin
angry
sad
shocked
cool
tongue
kiss
cry
smaller | bigger

busy
 
Joomla SEO Club and Book Logo
Search
Login
Blog Details

Subscribe by RSS

Creative Commons License All blog articles are licensed under a Creative Commons Attribution 3.0 United States License.
Top Comment Posters
Good Web Practices
(114 comments)
Klaus Nitsche
(78 comments)
Brian Teeman
(67 comments)
Hummerbie
(35 comments)
guido
(34 comments)
Ansiklopedi
(30 comments)
Amy Stephen
(29 comments)
Yannick Gaultier
(28 comments)
Cory
(27 comments)
Anthony Olsen
(18 comments)
Blog Categories
Accessibility
Affiliates
Alledia News
Analytics
Book Reviews
Business
Design
Domain Names
Domain Tips & Tricks
Drupal
E-Commerce
Extensions of the Month
General CMS Issues
Interviews
Joomla Affiliates
Joomla 1.5
Joomla Blogs
Joomla Hacks
Joomla Hosting
Joomla News
Joomla People
Joomla SEO eBook
Joomla SEO Rankings
Joomla Sites
Joomla South East
Joomla Templates
Joomla Tips & Tricks
Joomla URLs
Open Questions
PHP
Pay Per Click
Product Reviews
Rants
Scams
Recommended Sites
Search Engine Optimization
Security
SEO
SEO Club
SEO Competition
Site Reviews
Template Clubs
Yellow Pages
Virtuemart
Vlogs
Wordpress
Translate
right