CakePHP and SEO

I do have to admit that in the past I never really paid too much attention on SEO (Search Engine Optimization) or how search engines treat my sites in general.
This can be quite a pitfall and cost you quite a bit of visibility these days, though.

With this post I will not go too much into details, but pick out some major points that really can improve your page rank and visibility. I will also mainly focus on the technical aspects here (for content non-framework related stuff please consult your SEO expert).

robots.txt

This is a file for search engines (SE) to quickly find out which paths they are supposed to ignore – and more.

User-agent: *
Disallow: /outbound/
Disallow: /js/
Disallow: /css/
Disallow: /files/
...

Put your static files and controllers that you do not want to be indexed and/or followed.

sitemap.xml

Especially google, but also many other SE, likes to have such a pre-formatted way of telling it how your site is structured and how it should be recognized. It will also provide all the basic page URLs right away. You can read more about it here. There are even some validators out there you can use to check your sitemap is valid.

There are CakePHP plugins that are able to generate sitemaps for you. This way your sitemap.xml file will be created dynamically on demand and will always be up to date.

Avoid duplicate content

Slash or no slash: No slash!

As outlined in this article, it is very important to only route to either slash or not slash at the end of your actions. The first one is preferred by SE. This means, that /controller/action should 301 redirect to the correctly routed /controller/action/. But the second one is used in Cake. And since this one works out of the box with the current 2.x Router class, it might make sense to stick to it. Personally, I also favor the no-slash-option since it makes your life so much easier, especially with Cake.

Just place this snippet in your htaccess:

RewriteCond %{REQUEST_URI} (.*)/$
RewriteRule ^(.*)$ %{REQUEST_URI} [R=301,L]

If you must: Slash

But for the sake of completeness I will also outline how to get it work vice-versa. For the trailing slash option we need to do a little bit more.
There will be several steps necessary to achieve this side-wide.

Basically, we need to adjust the redirects in the AppController and the URL generation in the AppHelper. Also all custom Router::url() calls need be manually adjusted, of course.
The complete documenation on that I outsourced to TrailingSlash github rep.

For URLs generated in your controller via Router::url() you need to manually adjust the url:

$url = Router::url(array('some' => 'param')) . '/';
$this->Model->processOrStoreUrlInSomeWay($url);

Last but not least you need to modify your virtual host setup or the htaccess and add the following snippet:

RewriteCond %{REQUEST_URI} !(.*)/$
RewriteRule ^(.*)$ %{REQUEST_URI}/ [R=301,L]

It will redirect (301 – permanent) non-trailing-slash urls to the one with the trailing slash. Doing the redirect this early in the htaccess takes the overhead away from php and makes this very fast.

Now all internal linking (both links and redirects) should be addressed and working as expected.

Side note: Browsers will usually cache 301 redirects. If you debug your redirects, keep that in mind.

Multiple URLs to the same content

If there are pages that contain the same content or if an action can be visited via multiple different URLs we have a problem called "duplicate content" and can be penalized by SEs.

Use the canonical tag to tell the SE which one is the "real" page that should be indexed. All pages that have a canonical tag linked to the parent page (usually containing the same content) will not be regarded as duplicate content. Issue resolved.

<link rel="canonical" href="/controller/action/passed/" />

This should be in the <head> tag of your HTML layout.

You can safely add it to every page. If it is equal the current page it will just be ignored.
We usually use this to start off:

if (!isset($canonical)) {
    $canonical = $this->request->here();
}
if ($canonical !== false) {
    echo '<link rel="canonical" href=" . $thi->Html->url($canonical) . '">';
}

This can still produce DC, though, as /users/index/ would route to the same page as /users/.
You can avoid that by calling Router::url() instead which would produce the second url. Also make sure that you handle passed and named params as well as querystrings here. We do that in the controller scope (component usually) and pass the cleaned canonical url to the view and thus skipping the default and potentially incorrect $request->here().

Side-note: This is also a good time to talk about passed and named params and/or querystrings. Sometimes people asked when to use what. As a guideline passed params are more stable due to the fixed order and usually help building the url. They will then probably be part of the canonical link, as well. Named params and querystrings are usually more versatile and therefore more for adjustment and filtering. In most cases it makes sense to filter them out altogether and setting the canonical to the url with only passed params. For pagination without sorting/filtering the page param could be handled separately, though (see below), and whitelisted here.
There are many use cases where the querystrings might very well have a right to be in the canonical url. A whitelist of some sort might help here, as well.

Title and meta tags

The title together with the description" meta tag are very important. They both are used by SEs to find the page via seach terms and to display a small excerpt.

<meta name="description" content="My Description of this page!" lang="en" />

For pagination it is recommended to add the current page after the title:

if (!empty($this->params['named']['page'])) {
    $title .= ' (' . __('Page %s', $this->params['named']['page']) . ')';
}

More meta tags can be quite important:

The content type tag should already be part of your layout:

<?php echo $this->Html->charset(); ?>

The content language meta tag is used to tell the SE which language this page is in:

<meta http-equiv="content-language" content="de" />

For more specific language definition you can add the regional part:

<meta http-equiv="content-language" content="de-DE" />

But as they seem to become obsolete in the future, it seems that the most future-proof approach is to additionally set the lang attribute for the html tag:

<html lang="de">

Keywords can help the SE to decide if the page is relevant:

<meta name="keywords" content="Comma,Seperated,Keywords" />

Don’t use too many keywords on one page, though. SEs are known to penalize or blacklist your site for abuse.

For telling SEs how to treat the current page, you can use the robots meta tag:

<meta name="robots" content="index,follow,noarchive" />

This would tell the SE to index the current page, follow all links and not to cache (archive => available offline) this page.

<meta name="robots" content="noindex,nofollow" />

This would make the SE disregard this page and its links completely.

Although still widely used, some SE might ignore it. So more and more move to other more reliable options such as robots.txt or htaccess directly.

Speaking URLs

It is usually better for SEO to generate urls to posts similar to the above (without the id, but with a dashed slug).
Use the SluggedBehavior to automatically generate such a slug field from your title/name. Then all you need to do is to add this as passed param to your URLs:

$this->Html->link('My Article', array('action' => 'view', $article['Article']['slug'])); // generates sth like /articles/view/My-Article

You can then use custom routes to shorten it even further to something like /article/My-Article

Pro tip: You can keep the baked ids and use custom Route classes to map those ids to the slug in the route class. This way you don’t have to touch any view files.

Is there more?

Yes, there is. As SEO is a topic where there will always be new developments.
One new element, for example, is the hreflang-canonical-tag combination. We tried it and failed at it (so far), so we had to revert the changes made. The outcome was not what we expected.
If you succeeded in using it with a multilingual site, please share your doings and results.

Also, I have been thinking about modifying the core Router class and submitting this as a PR (Pull Request) and enhancement for future versions of CakePHP to handle this issue a little bit more gracefully. See my ticket for details.
This would allow us to do this in a clean way.

Appendix

This snippet helps to prevent double slashes to create duplicate content:

RewriteCond %{REQUEST_URI} ^(.*)//(.*)$
RewriteRule . %1/%2 [R=301,L]

It removes additional slashes leaving only a single one (at the end as well as between):

/controller/action// becomes /controller/action/
/controller//action becomes /controller/action
4.80 avg. rating (95% score) - 10 votes

8 Comments

  1. Hello,

    Same question for me, what plugin do you use to generate a sitemap.xml ?

    Bye

  2. I’ve taken a while to discover how to write the right language code in the HTML tag, I’d like to share it

       $o= new L10n;
        $langCode = $o->__l10nMap[$this->Session->read('Config.language')];  
    <html lang="">

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.