Thursday, January 26, 2012

mod-rewrite




this article covers mod-rewrite from gekko.

Mod-rewrite provides a way to modify incoming URL requests, dynamically, based onregular expressions (alias regex) rules. These rules allows us to map arbitrary 
URLs onto our internal URL structure in any way you like, meaning rename the files name called onto URL to anything as wished. It`s written in a file .htaccess.



Here is an example:


.htaccess file content 

Options +FollowSymLinks
RewriteEngine on
RewriteRule ^page1\.html$ page2.html [R=301,L] 

1. .htacces file is where we rewrite our sites incoming URL structure. 
2. the rules are read top to bottom.
3. so more abstract rules are written first and then comes more specific, meaning a page name redirect.

First-off,
Options +FollowSymLinks
These directive instructs Apache to follow symbolic links within site.
Symbolic links are "abbreviated nicknames" for things within  site and are usually disabled by default. 
Since mod_rewrite relies on them, we must turn them on

Second,
RewriteEngine on
The "RewriteEngine on" directive does exactly what it says.
Mod_rewrite is normally disabled by default and this directive enables the processing of subsequent mod_rewrite directives. 

Third,
RewriteRule ^page1\.html$ page2.html [R=301,L] 
In this example, we have a caret at the beginning of the pattern, and a 
dollar sign at the end. These are regex special characters called 
anchors
The caret tells regex to begin looking for a match with the 
character that immediately follows it, in this case a "p".
The dollar sign anchor tells regex that this is the end of the string we want to 
match. 
In this example there is ^page1\.html followed by page2.html. 
page1\.html  and ^page1\.html$ are interchangeable expressions and match the same string.
However, page1\.html matches any string containing exactly and only page1.html" (apage1.html for example) anywhere in the URL, but ^page1\.html$ matches only a string which is exactly equal to page1.html

In a more complex redirect, anchors (and other special regex characters) are often essential. 

And Finally,

In the above example, we also have an [R=301,L]

These are called flags in mod_rewrite and they're optional parameters. 

R=301 instructs Apache to return a 301 status code with the delivered page and, when not included as in [R,L], defaults to 302. 

mod_rewrite can return any status code that you specify in the 300-400 range and it REQUIRES 
the square brackets surrounding the flag.

The  L flag tells Apache that this is the last rule that it needs to 
process. 

So this is what the mod rewrite and regx does huh???

Indeed it helps solving many redirects problems with a more cleaner and custom mechanism to access web page but the crust of mod-rewrite is security. 

In above example the ^page1\.html will only allow page1.html to access page2.html nothing else.
As page1 is masked as page2 it provides another layer of security to keep the file encapsulated.
By using mod-rewrite developers can limit the access to the incoming URL structure by defining specific data types to access the page as a parameter in the regex. 


There are whole range of other stuffs that you can do with mod-rewrite. 
More Ref`s:
http://etext.lib.virginia.edu/services/helpsheets/unix/regex.html
http://gnosis.cx/publish/programming/regular_expressions.html
http://httpd.apache.org/docs/current/rewrite/


Some Quick References:
---------------------------------------------------------
Patterns ("wildcards") are matched against a string Special characters 

    . (full stop) - match any character
    \* (asterisk) - match zero or more of the previous symbol
    \+ (plus) - match one or more of the previous symbol
    ? (question) - match zero or one of the previous symbol
    \\? (backslash-something) - match special characters
    ^ (caret) - match the start of a string
    $ (dollar) - match the end of a string
    [set] - match any one of the symbols inside the square braces.
    [^set] - match any symbol that is NOT inside the square braces.
    (pattern) - grouping, remember what the pattern matched as a special variable
    {n,m} - from n to m times matching the previous character (m could be omitted to mean >=n times)
    (?!expression) - match anything BUT expression at the current position. Example: "^(/(?!(favicon.ico$|js/|images/)).*)" => "/fgci/$1"
    
----------------------------------------------------------

[abc]     A single character: a, b or c
[^abc]     Any single character but a, b, or c
[a-z]     Any single character in the range a-z
[a-zA-Z]     Any single character in the range a-z or A-Z
^     Start of line
$     End of line
\A     Start of string
\z     End of string
.     Any single character
\s     Any whitespace character
\S     Any non-whitespace character
\d     Any digit
\D     Any non-digit
\w     Any word character (letter, number, underscore)
\W     Any non-word character
\b     Any word boundary character
(...)     Capture everything enclosed
(a|b)     a or b
a?     Zero or one of a
a*     Zero or more of a
a+     One or more of a
a{3}     Exactly 3 of a
a{3,}     3 or more of a
a{3,6}     Between 3 and 6 of a
--------------------------------------------------------------------

Regular Expression  
foo  The string "foo"
^foo  "foo" at the start of a string
foo$  "foo" at the end of a string
^foo$  "foo" when it is alone on a string
[abc]  a, b, or c
[a-z]  Any lowercase letter
[^A-Z]  Any character that is not a uppercase letter
(gif|jpg)  Matches either "gif" or "jpeg"
[a-z]+  One or more lowercase letters
[0-9\.\-]  ?ny number, dot, or minus sign
^[a-zA-Z0-9_]{1,}$  Any word of at least one letter, number or _
([wx])([yz])  wy, wz, xy, or xz
[^A-Za-z0-9]  Any symbol (not a number or a letter)
([A-Z]{3}|[0-9]{4})  Matches three letters or four numbers
--------------------------------------------------------------------------------

No comments: