Thursday 08th of December 2016 09:04:35 PM

Apache .htaccess Examples :: Rewriting and Redirecting with mod_rewrite (and mod_alias)

PURPOSE AND BACKGROUND

I heavily rely on Apache's mod_rewrite module in my site's .htaccess file to do a number of helpful things. In order to learn enough about mod_rewrite, I studied a number of examples on the web. Some of those examples were very hard to find. As the official mod_rewrite documentation will tell you, the use of these modules can be a black magic art. I'm hoping that these examples will help someone else acquire a little bit of that black magic.

First, be sure to see the official Apache documentation, which is mirrored all over the web. In particular, review these sites:

  • Apache 1.3 URL Rewriting Guide
    • There are a LOT of good examples here. You should always start here for some textbook helpful examples of, in some cases, some complicated and useful rewriting code.

  • Module mod_rewrite
    • This is an extremely important document. There are a lot of nuances built into this document that are often quickly overlooked because the author decided to only spend a quick sentence on something very important. Pay attention to every detail of this document.

  • Module mod_alias
    • mod_alias is hardly as complex as mod_rewrite, but it's equally as important. Much of my .htaccess file could be rewritten much more simply using mod_alias. The only reason I lean so much on mod_rewrite is that mod_alias recurses down every subdirectory of mine, which includes my subdomains. Thus, if I use mod_alias directives, redirections I want on my main site show up on all of my subdomains as well. This is not desirable. I solve this problem by rewriting all of my mod_alias statements with mod_rewrite directives; mod_rewrite directives do not recurse down subdirectories and subdomains. If it weren't for my subdomains, I'd use mod_alias much more. Everyone should have a solid understanding of this module.

  • Module mod_asis
    • This is an honorable mention. My .htaccess sends the 403 Forbidden for a number of different very specific reasons. I should have used mod_asis to send those custom 403 error messages. Combining mod_asis with mod_alias and/or mod_rewrite gives the ability to build CONDITIONAL ERROR DOCUMENTS. (I leave this as an exercise)
Finally, note that this web page only scratches the surface. With the mod_rewrite directives like the ones involving chaining, passing thru, and skipping, mod_rewrite can turn an .htaccess configuration file into a powerful scripting language.



Some Examples Similar to Lines in My .htaccess File

Contents

Order Matters to mod_rewrite Directives

  • It is important to note that the relative order of the mod_rewrite directives is important.

  • For example, if you are having a problem with a redirect rule that keeps putting information about the real filesystem location in the target URL, try moving that redirect rule earlier in the file.

  • In most cases, if there is no other easy way to determine ordering, it is best to order redirect rules to URLs with explicit hostnames FIRST. This sort of ordering is reflected in the examples given below.

  • The examples below are meant to be taken in order. If I was to put these into an .htaccess file, I would leave them in the same order as is on this page.

Spelling of Referrer is REFERER

  • Remember that it's HTTP_REFERER. This is NOT the correct spelling of the word referrer, but it IS the correct spelling of the server variable.

Difference Between Redirecting with mod_rewrite and mod_alias

  • These next two blocks may appear to be equivalent, but they have at least one major difference.
RewriteRule ^servo.php$ http://www.tedpavlic.com/post_servo.php [R=permanent,L]
RewriteRule ^images($|/.*$) http://links.tedpavlic.com/images$1 [R=permanent,L]
Redirect permanent /servo.php http://www.tedpavlic.com/post_servo.php
RedirectMatch permanent ^/images($|/.*$) http://links.tedpavlic.com/images$1
  • The first block is implemented with mod_rewrite directives.

    Thus, the first block is NOT inherited by other .htaccess files that live in child directories underneath the main directory.

  • The second block is implemented with mod_alias directives.

    Thus, the second block IS INHERITED by other .htaccess files that live in child directories underneath the main directory.

  • In other words, suppose links.tedpavlic.com is a subdomain that is hosted out of a links folder that resides within the main www.tedpavlic.com document root. Suppose that links folder contains its own .htaccess file that makes no mention of either servo.php or images.

    When accessing http://links.tedpavlic.com/servo.php, the SECOND block will redirect this request back to http://www.tedpavlic.com/post_servo.php. However, the FIRST block will return a 404 File Not Found.

    When accessing http://links.tedpavlic.com/images, the SECOND block will redirect this request back to http://links.tedpavlic.com/images, which results in a redirect loop. However, the second block will return a 404 File Not Found.

  • mod_alias rules ride along top of the directory structure, regardless of the public structure of the web site and its subdomains. mod_rewrite rules are completely forgotten when a new .htaccess is found in a subdirectory.

    For my site, because of my subdomains, that means that the mod_rewrite was best for me. This may not be the case with your site.

Important Options

Options -Indexes +Includes +FollowSymLinks
  • -Indexes: I include this here to remind you that you are in control of your web site. If you don't like the way the webserver displays your very important content then change it. Rewrite it. Change how the webserver interprets requests. -Indexes to me is a symbol of control.

  • +Includes: This is more of a reminder to use .shtml files for your error documents (if you don't want to use error scripts). This will help you return good information to your users that they may be able to return to you incase they find a bug in your rules.

  • +FollowSymLinks: This is the important one. When using .htaccess mod_rewrite rewriting, this is required.

Turn the Engine On

RewriteEngine On
  • This is just a simple reminder that mod_rewrite needs to be turned on.

Redirect to Most Desirable Hostname or Subdomain

RewriteCond %{HTTP_HOST} !^www\.tedpavlic\.com$ [NC]
RewriteCond %{HTTP_HOST} !^$
RewriteCond %{REQUEST_URI} ^($|/.*$)
RewriteRule ^.* http://www.tedpavlic.com%1 [R=permanent,L]
  • My websites often have many aliases. These aliases are provided so I have some flexibility when I want to develop new content. These aliases are also provided so users have an easy way to remember my sites. However, regardless of user preference, I really want them to end up at one particular site. I also want search engines to only index ONE of those sites.

  • Note the use of the %1 rather than the typical $1. A %1 will match a group found in one of the RewriteCond statements. In this case, I'm picking off the whole REQUEST_URI so I can resubmit it to the subdomain. Note that I could have gotten rid of that third RewriteCond and done the match entirely in the RewriteRule line and used $1 instead. However, to keep consistency with my subdomains, I show it like this. This also avoids confusion with how the match works when the actual domain is found in the target. See mod_rewrite documentation and further information below for more details.

  • Notice the R=permanent. Not only does this rule rewrite the URL, but it issues a 301 permanent redirection. This should convince webbots to update their records to point to the central site.

  • Notice the L rewrite flag indicating that this is the last rule to be processed on this pass. Wait for the browser to continue the redirect. Then continue processing on the NEW URL. This simplifies rewriting rules later. This is the reason why I have this rule so early in my .htaccess file!!

  • Notice that the second line of this rule makes sure it does NOT apply when there is an empty HTTP_HOST variable. Browsers using older versions of the HTTP protocol may result in HTTP_HOST being empty. Let these users through without the redirect. Otherwise, you will put them in a deadly redirect loop. That's bad.

  • Note that when the explicit site hostname is given, in the target URL, the RewriteRule is interpretted differently and matches against a slightly different string. See mod_rewrite documentation for more information about this. This distinction is not important in this rule because I chose to match on REQUEST_URI

    Of course, when an element is floated, other content "flowsaround" it. This is familiar behavior with floated images, butthe same is true if you float a paragraph, for example. In Figure 7-64, we can see this effect (a margin has beenadded to make the situation more clear):

    P.aside {float: left; width: 5em; margin: 1em;}
    Figure 7-64

    Figure 7-64. A floating paragraph

    One of the first interesting things tonotice about floated elements is that margins around floated elements instead. I only chose to do this because it is necessary for me to do this within subdirectories that host my subdomains. (see below)

  • The following is the very similar RewriteRule block I use on each of my subdomains that lie inside subdirectories of my main site. Depending on what sort of redirect you are trying to do, this may be a better choice for you.
RewriteCond %{HTTP_HOST} !^links\.tedpavlic\.com$ [NC]
RewriteCond %{HTTP_HOST} !^$
RewriteCond %{REQUEST_URI} ^/links($|/.*$)
RewriteRule ^.* http://links.tedpavlic.com%1 [R=permanent,L]
  • Note the similarities and differences between these lines and the lines that I use in my main website. The purpose of this rule is to redirect any request to http://www.tedpavlic.com/links/.* to go directly to http://links.tedpavlic.com/.*.

  • One major difference is that in this case I'm only grabbing a portion of the REQUEST_URI to pass on to the subdomain. Note again that I use %1 here rather than $1.

  • Here, it is important that I match against the REQUEST_URI with a RewriteCond line because a request to http://www.tedpavlic.com/links/ will cause the RewriteRule line to match against the ABSOLUTE FILENAME from the FILE SYSTEM rather than just the relative filename from the document root. The RELATIVE FILENAME is ONLY USED WHEN the TARGET URL INCLUDES the web site host name.

  • The final important point to make here is that this rule COULD NOT have been placed in the main site's .htaccess file. This is because (UNLIKE mod_alias directives) the mod_rewrite rules do not recurse into subdomain subdirectories because each of my subdomains has its own special .htaccess file. As a consequence, if anyone requests a file from those directories directly under the main site, she will be redirected to the actual subdomain. Because of the existence of the subdomain's .htaccess file, any rules I make in the main .htaccess file to attempt to do the same redirections are disregarded. Thus, the rules must exist in the subdomain .htaccess file.

Forbid Access to Bad Webbots (and others?)

RewriteCond %{HTTP_USER_AGENT} nhnbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} naver [NC,OR]
RewriteCond %{HTTP_USER_AGENT} NHN.Corp [NC,OR]
RewriteCond %{HTTP_USER_AGENT} naverbot [NC,OR]
# Korean addresses that Naverbot might use
RewriteCond %{REMOTE_ADDR} ^61\.(7[89]|8[0-5])\. [OR]
# Korean addresses that Naverbot might use
RewriteCond %{REMOTE_ADDR} ^218\.(14[4-9]|15[0-9])\. [OR]
RewriteCond %{HTTP_USER_AGENT} Sleipnir [NC]
# Allow access to robots.txt and forbidden message
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteCond %{REQUEST_URI} !^/403\.shtml$
RewriteRule ^.* - [F,L]
  • Note the chaining of rewrite conditions. These condition lines implicitly have "ands" between each two line unless an OR is used. They also apply "anded" to the actual rewriting rule.

  • Note that this primarily applies to robots, so it would not have been a bad idea to also check to make sure HTTP_REFERER was empty. Most robots enter the site without any referrer. If you have HTTP_USER_AGENT checks that may accidentally catch real users, a second check making sure the referrer is empty wouldn't be a bad idea.

  • Note the explicit check for robots.txt and 403.shtml. Without this check, the robots will be forbidden from seeing your custom built 403 message and your robots.txt which tells the robot where it should and should not be.

  • Note the use of the F option on the rewrite rule. This instructs the web browser to send a 403 Forbidden.

  • Note the use of regular expressions to pick out IP address ranges. A strong grasp of regular expressions will be very helpful when writing these rules and conditions.

Forbid Access to Only Certain Types of Files from Certain Agents

RewriteCond %{HTTP_USER_AGENT} MMCrawler [NC]
RewriteCond %{REQUEST_URI} !^/.*\.(txt|tex|ps|pdf|php|htm|html|shtm|shtml)$ [NC]
RewriteCond %{REQUEST_URI} !^/robots\.txt$
RewriteCond %{REQUEST_URI} !^/403\.shtml$
RewriteRule ^.* - [F,L]
  • This rule keeps MMCrawler from grabbing anything but true CONTENT.

  • Again, notice the explicit exclusion of robots.txt and 403.shtml from this rule. In this rule, this is NOT NECESSARY since these are already excluded by the rest of the rule.

  • Note the rule expression could be more compact, but it is easier to read this way.

Forbid Access to Certain Documents

RewriteCond %{REQUEST_URI} (^|/)\.htaccess$ [NC,OR]
RewriteCond %{REQUEST_URI} ^/guestbook\.csv$ [OR]
RewriteCond %{REQUEST_URI} ^/post_template\.php$ [OR]
RewriteCond %{REQUEST_URI} ^/page_template\.php$ [OR]
RewriteCond %{REQUEST_URI} ^/hitlist\.db$ [OR]
RewriteCond %{REQUEST_URI} ^/includes($|/.*$) [OR]
RewriteCond %{REQUEST_URI} ^/ban($|/.*$) [OR]
RewriteCond %{REQUEST_URI} ^/removed($|/.*$)
RewriteRule ^.* - [F,L]
  • This is a simple rule. It shows more regular expressions and how to block access to important non-content files that just happen to live in the same directory (or close) as web content.

  • NOTE that the last three of these rules block entire directories AND ALL OF THEIR CHILDREN.

Prevent Good Spiders from Entering Traps for Bad Spiders

RewriteCond %{REMOTE_ADDR} ^61\.(7[89]|8[0-5])\. [OR]
# Googlebot
RewriteCond %{REMOTE_ADDR} ^64\.68\.82\. [OR]
RewriteCond %{REMOTE_ADDR} ^216\.239\.39\.5$ [OR]
RewriteCond %{REMOTE_ADDR} ^66\.249\.(6[4-9]|[78][0-9]|9[0-5])\. [OR]
# Yahoo Slurp
RewriteCond %{REMOTE_ADDR} ^66\.196\.(6[4-9]|(7|8|9|10|11)[0-9]|12[0-7])\. [OR]
RewriteCond %{REMOTE_ADDR} ^68\.142\.(19[2-9]|2[1-4][0-9]|25[0-5])\. [OR]
# msnbot
RewriteCond %{REMOTE_ADDR} ^207\.46\. [OR]
# psbot
RewriteCond %{REMOTE_ADDR} ^62\.119\.133\.([0-5][0-9]|6[0-3])$ [OR]
# Cyveillance
RewriteCond %{REMOTE_ADDR} ^63.148.99.2(2[4-9]|[34][0-9]|5[0-5])$
# Bots don't come from referrers
RewriteCond %{HTTP_REFERER} ^$
# Prohibit suckerdir and trapme access
RewriteCond %{REQUEST_URI} ^/(suckerdir|trapme)(/|$)
RewriteRule ^.* - [F,L]
  • There are a number of methods to trick spambots into areas that record their presence, submit their IP to authorities, and block them from further access to the site.

  • These methods often "poison" the spambot as well by providing ficticious e-mail addresses and (perhaps not obviously) recursive links. Clearly, it would be bad if a GOOD bot ever found its way into such traps. This would waste the resources of the good bot. This would also possibly submit bad content onto a search engine. It additionally might ban a legitimate bot from your site (and others).

  • This rule tries to prevent good bots from wandering into bad traps.

  • Notice the restriction is on a class of requests that start with a particular string.

Prevent Real People from Entering Traps for Bad Spiders

# Real people often do come from referrers. Protect them.
RewriteCond %{HTTP_REFERER} !^$
# Prohibit suckerdir and trapme access
RewriteCond %{REQUEST_URI} ^/(suckerdir|trapme)(/|$)
RewriteRule ^.* - [F,L]
  • It would also be bad if real people came upon these requests. Most likely, if they hear of these requests, it'll be from a link that some jerk has put on a page somewhere.

  • Again, remember that bots usually carry no HTTP_REFERER.

  • Since these traps are designed for bots, forbid access from links. Make sure the HTTP_REFERER is empty.

Setup an Environment for Bad Spider Traps

## Setup the sand traps, suckerdir and trapme
# This RedirectMatch makes sure there's a trailing / on "directories"
RedirectMatch /(suckerdir|trapme)$ http://www.tedpavlic.com/$1/
# This RewriteRule makes sure there's a trailing / on "directories"
RewriteCond %{REQUEST_URI} (suckerdir|trapme)/(.+)$
RewriteCond %{REQUEST_URI} !(suckerdir|trapme)/(.+)(\.(html?|php)|/)$
RewriteRule ^(.*)$ http://%{HTTP_HOST}/$1/ [R,L]
# This RewriteRule makes index.html the "DirectoryIndex"
RewriteRule ^(suckerdir|trapme)(/|/(.*)/)$ $1$2/index.html
# This RewriteRule actually generates the content for each query
RewriteRule ^(suckerdir|trapme)/.+$ $1/$1.php
  • This set of rules helps to "create" the traps mentioned above. The actual traps also involve some scripts that generate the bad content, but these rules make those scripts more believable. As you can see from the last line, the scripts that make all the magic of this trap are suckerdir.php and trapme.php. Note that the index.html file does not exist and the second-to-last line really isn't needed. It's just there for my own amusement in the wonder and power of mod_rewrite.

  • Notice the use of HTTP_HOST. If and when the site name changes in the future, this makes it easy to transport this rule to the new site name. REMEMBER that one of the FIRST RULES redirected to the desired site name, so HTTP_HOST at this point is a known quantity.

  • Remember that you can only use %{HTTP_HOST} with the mod_rewrite directives. This DOES NOT EXIST with the mod_alias directives.

  • The end effect of this rule is that any request to a DIRECTORY, html FILE, or php SCRIPT that BEGINS with suckerdir or trapme ACTUALLY gets EXECUTED by the suckerdir.php and trapme.php scripts WITHOUT THE AGENT EVER KNOWING.

  • These rules also make sure that requests that look like requests to directories without the trailing slash get redirected to the version that does have the trailing slash before actually getting processed. This helps convince a bot that it's looking at real content.

  • This shows a way to make dynamic content LOOK STATIC.

  • It also shows how one script can operate AN ENTIRE SITE and the user will PERCEIVE that the site is MANY pages with an entire DIRECTORY STRUCTURE.

Strip the Query Strings from Requests from Bots

# NOTE: It is okay to match bad bots here too. We just don't want to match
#       real human people.
# To match most bots, check out User-Agent and look for empty referrer
RewriteCond %{HTTP_USER_AGENT} (google|other_bots|wisenutbot) [NC]
RewriteCond %{HTTP_REFERER} ^$
RewriteRule ^.* - [E=HTTP_CLIENT_IS_BOT:1]
# Certain bots actually do have referrers. Catch them too.
RewriteCond %{HTTP_USER_AGENT} (becomebot) [NC]
RewriteRule ^.* - [E=HTTP_CLIENT_IS_BOT:1]
# Match a bot
RewriteCond %{ENV:HTTP_CLIENT_IS_BOT} ^1$
# Look for non-empty query string
RewriteCond %{QUERY_STRING} !^$
# Force it empty and tell the bot that it's a permanent change
RewriteRule ^(.*)$ http://%{HTTP_HOST}/$1? [R=permanent,L]
  • I've done both setting and checking environment variables here. This isn't necessary. Those lines could be combined, but I thought this was a good place to show an example of using environment variables in this way.

  • One advantage of using environment variables here is that it passes useful information back to my web scripts. In this case, checking the HTTP_CLIENT_IS_BOT environment variable lets me know that it has met my "bot criteria" setup here in the .htaccess file. I can then tailor my content for bots.

  • The first half of these rules identify probable web bots. Since nearly all bots always have empty referrers, it's easy to reduce false positives by checking for an empty HTTP_REFERER.

  • The second half strips the query string from all queries identified as being from bots by the first half.

  • This is useful to me since many of my pages use query strings to change the display format, but the content stays the same. These rules prevent redundant indexing of content.

  • Notice the use of the ? at the end of the rewriting rule. A single ? at the end of the rule REMOVES THE QUERY STRING. This is one of those documented features that is OFTEN OVERLOOKED.

  • NOTICE that this REDIRECTION does not occur if the QUERY_STRING is EMPTY. This prevents REDIRECT LOOPS!!

Some Cheap and Simple Redirects

# Redirect requests that should be going to subdomains directly
RewriteRule ^osufirst($|/.*$) http://osufirst.tedpavlic.com$1 [R=permanent,L]
RewriteRule ^schedule($|/.*$) http://schedule.tedpavlic.com$1 [R=permanent,L]
RewriteRule ^europa/schedule($|/.*$) http://schedule.tedpavlic.com$1 [R=permanent,L]
RewriteRule ^blocko($|/.*$) http://blocko.tedpavlic.com$1 [R=permanent,L]
RewriteRule ^europa/blocko($|/.*$) http://blocko.tedpavlic.com$1 [R=permanent,L]
# Redirect some renames that may still be linked elsewhere under old names
RewriteRule ^servo.php$ http://www.tedpavlic.com/post_servo.php [R=permanent,L]
RewriteRule ^riley\.(jpg|gif|png)$ http://links.tedpavlic.com/riley.$1 [R=permanent,L]
  • Notice the permanent keyword. Using this keyword is identical to using R=301. If no keyword is given, R=302 is implied, which performs a temporary redirect.

  • Other status codes may be used as long as they correspond to REDIRECT operations. In other words, other status codes may be used as long as they are greater than 299 and less than 400.

  • To indicate that a page is gone or forbidden, use the G or F flags, respectively and replace the target URL with a - (hyphen). See the next section for an example.

Redirects with Other Status Messages

# Now pages that have just been removed and replaced
RewriteRule ^opinions.php$ http://www.tedpavlic.com/general_posts.php [R=seeother,L]
# Now pages that have just been removed entirely
RewriteRule ^analog_ee.php$ - [G,L]
RewriteRule ^teaching.php$ - [G,L]
RewriteRule ^wav($|/.*$) - [G,L]
RewriteRule ^toys_.*.s?html?$ - [G,L]
# Now forbid some pages that I want to keep around but don't want people to see
Redirect 403 /phpinfo.php
  • seeother provides a good alternative to a permanent redirect. It isn't implying that the target page is a replacement, but it is stating that the desired page is gone, but a similar page is available.

  • gone does not take a target parameter with mod_alias directives and takes a simple hyphen (-) as a parameter with mod_rewrite directives. Use this status when a page has been permanently removed. Note that using G instead of R=gone is NOT possible using mod_rewrite's RewriteRule since - is given for redirection.

  • The final mod_alias rule uses 403 to indicate that all requests to /phpinfo.php should be given the 403 Forbbiden status. There is no mod_alias keyword for forbidding pages; however, mod_rewrite provides the F parameter.

  • NOTE that the final rule uses a mod_alias keyword, so it applies to all subdirectories. This includes subdomains that happen to be hosted beneath this directory. So this single rule prohibits phpinfo.php access on ANY of my subdomains.

  • Being smart about these very technical issues helps to make sure search engines (and users in general) stay up to date with the layout of your site.

Case Insensitive robots.txt

# Any request for robots.txt, regardless of case, should go through
RewriteRule ^robots.txt$ robots.txt [NC]
  • This rule allows robots.txt to be fetched with any name. It makes the robots.txt file completely case insensitive.

  • With the addition of this rule, it is a good idea to make every other robots.txt condition case insensitive with the NC parameter.