apachemod-rewriteurl-rewritingaem

Apache rewrite condition - directory check not working for trailing slash


Within our virtual host config, I am trying to use a RewriteCond to check if a trailing slash request is requesting a .html page that exists on our website. If so, 302 redirect it to the proper .html page. If not, provide 404.

Note: All of our website pages on our site end in .html

This works: https://example.com/content/go/first-level/ (302 redirects to https://example.com/content/go/first-level.html)

This works: https://example.com/content/go/first-level/second-level (302 redirects to https://example.com/content/go/first-level/second-level.html)

This does not:

This does not https://example.com/content/go/first-level/second-level/third-level (Provides 404 and remains https://example.com/content/go/first-level/second-level/third-level)

This is because https://example.com/content/go/first-level/second-level/third-level.html page is not actually a directory, so when I do my directory test, it fails. However, I don't think I can do the -f test because my %{REQUEST_URI} is going to contain the slash which will cause the .html part to fail.

Notes: our site uses .html extensions, so the goal of the code below is to 302 redirect (will update to 301 later) trailing and non trailing slash URLs to .html pages and to 404 any non existent page requests with trailing and non trailing slashes.

# Handle requests to trailing slash if directory exists, add .html
# Fails for the last page in directory structure
RewriteCond %{REQUEST_URI} /content/go/.*
RewriteCond %{REQUEST_URI} /$
RewriteCond %{REQUEST_URI} !.*.json$
RewriteCond %{REQUEST_URI} !.*.sjson$
RewriteCond %{DOCUMENT_ROOT}/%{REQUEST_URI} -d
RewriteRule ^(.*)/$ $1.html [L,R=302]

# Handle requests to trailing slash if directory does not exist, 404
RewriteCond %{REQUEST_URI} /content/go/.*
RewriteCond %{REQUEST_URI} /$
RewriteCond %{REQUEST_URI} !.*.json$
RewriteCond %{REQUEST_URI} !.*.sjson$
RewriteCond %{DOCUMENT_ROOT}/%{REQUEST_URI} !-d
RewriteRule ^(.*?)$ $1 [L,R=404]

# Handle non trailing slash if page exists, add .html
# Working
RewriteCond %{REQUEST_URI} /content/go/.*
RewriteCond %{REQUEST_URI} !.*.json$
RewriteCond %{REQUEST_URI} !.*.sjson$
RewriteCond %{DOCUMENT_ROOT}%{REQUEST_URI}\.html -f
RewriteRule ^(.*) $1.html [L,R=302]

Solution

  • This works: https://example.com/content/go/first-level/ (302 redirects to https://example.com/content/go/first-level.html)

    But why should this be dependent on whether /content/go/first-level/ exists as a directory and not whether the file first-level.html itself exists?

    This is because https://example.com/content/go/first-level/second-level/third-level.html page is not actually a directory, so when i do my directory test, it fails.

    Presumably you mean ../third-level is not a directory, not ../third-level.html (the intended file target).

    (Aside: You should avoid having filesystem directories and files with the same basename when dealing with extension-less requests since there is an inherent conflict that can take additional steps to overcome.)

    # Handle requests to trailing slash if directory exists, add .html
    # Fails for the last page in directory structure
    RewriteCond %{REQUEST_URI} /content/go/.*
    RewriteCond %{REQUEST_URI} /$
    RewriteCond %{REQUEST_URI} !.*.json$
    RewriteCond %{REQUEST_URI} !.*.sjson$
    RewriteCond %{DOCUMENT_ROOT}/%{REQUEST_URI} -d
    RewriteRule ^(.*)/$ $1.html [L,R=302]
    

    I'm not sure why you are doing a "directory test" at all? Or why /content/go/first-level/second-level should map to a directory in order to redirect to /content/go/first-level/second-level.html? (And not actually checking that the .html file exists?) Checking whether the request maps to a directory does not appear to be part of your stated requirements?

    check if a trailing slash request is requesting a .html page that exists on our website. If so, 302 redirect it to the proper .html page. If not, provide 404.

    This would only seem to require 1 rule. Requests with or without a trailing slash can be handled by the same rule. You don't need a separate rule to trigger a 404, since that should happen by default.

    If I understand your question correctly, a request for /content/go/path/to/file (no trailing slash) or /content/go/path/to/file/ (with a trailing slash) should be 302 redirected to /content/go/path/to/file.html if that file exists.

    If /content/go/path/to/file maps to a directory then so be it. However, if that /content/go/path/to/file.html exists then that will take priority.

    I'm assuming this is to be used directly in the <VirtualHost> container and not in a <Directory> section inside that virtual host.

    As mentioned this only requires 1 rule, for example:

    # Redirect to ".html" file if it exists (handles optional trailing slash)
    RewriteCond %{REQUEST_URI} !\.(html|json|sjson)$
    RewriteCond %{DOCUMENT_ROOT}$1.html -f
    RewriteRule ^(/content/go/.+?)/?$ $1.html [R=302,L]
    

    Explanation:

    Any request to /content/go/path/to/file that does not map to a .html file and does not map to a directory will naturally 404. If it does map to a directory then you'll get a 403, unless you have a directory index document to handle the request.


    A quick look at your rules:

    # Handle requests to trailing slash if directory exists, add .html
    # Fails for the last page in directory structure
    RewriteCond %{REQUEST_URI} /content/go/.*
    RewriteCond %{REQUEST_URI} /$
    RewriteCond %{REQUEST_URI} !.*.json$
    RewriteCond %{REQUEST_URI} !.*.sjson$
    RewriteCond %{DOCUMENT_ROOT}/%{REQUEST_URI} -d
    RewriteRule ^(.*)/$ $1.html [L,R=302]
    

    The check for /content/go/ should be handled by the RewriteRule pattern. The .* at the end of this pattern is superfluous (regex match anywhere by default). There is no anchor at the start of the regex, so this matches /content/go/ anywhere in the URL-path.

    The second condition that checks for the trailing slash is superfluous since you have already ascertained that a trailing slash is present in the RewriteRule pattern.

    !.*.json$ - The literal dot should be backslash-escaped and the .* is superfluous (as mentioned above). However, these two checks are superfluous since you have already ascertained that the requested URL ends with a slash, so these two negated conditions will always be successful.

    The REQUEST_URI server variable includes the slash prefix, so the expression %{DOCUMENT_ROOT}/%{REQUEST_URI} will result in a double slash when these variables are expanded. This double slash will ultimately be resolved away when the filesystem check occurs, so it should still "work". However, you have correctly omitted the slash separator in the last/3rd rule.

    However, as mentioned at the top of my answer, I don't see why you would "blindly" redirect to a .html file (which may or may not exist) if the original request happens to map to a directory?

    # Handle requests to trailing slash if directory does not exist, 404
    RewriteCond %{REQUEST_URI} /content/go/.*
    RewriteCond %{REQUEST_URI} /$
    RewriteCond %{REQUEST_URI} !.*.json$
    RewriteCond %{REQUEST_URI} !.*.sjson$
    RewriteCond %{DOCUMENT_ROOT}/%{REQUEST_URI} !-d
    RewriteRule ^(.*?)$ $1 [L,R=404]
    

    When you trigger a 404 (ie. R=404) then the substitution string (2nd argument to the RewriteRule directive) is ignored. In this case, you should simply use - (hyphen) as the substitution to explicitly indicate "no substitution".

    (.*?) - the non-greedy capture is not serving any purpose here. .* would do the same (and arguably more efficient).

    However, I'm not sure why you need to trigger a 404 for any request with a trailing slash that does not map to a directory? This will happen by default. (But you presumably want to serve the corresponding .html file in this scenario?)

    # Handle non trailing slash if page exists, add .html
    # Working
    RewriteCond %{REQUEST_URI} /content/go/.*
    RewriteCond %{REQUEST_URI} !.*.json$
    RewriteCond %{REQUEST_URI} !.*.sjson$
    RewriteCond %{DOCUMENT_ROOT}%{REQUEST_URI}\.html -f
    RewriteRule ^(.*) $1.html [L,R=302]
    

    But URLs with a trailing slash that don't map to a directory but do map to a .html trigger a 404 (by the 2nd rule above)?

    No need to backslash-escape the literal dot in the TestString on that last condition since this carries no special meaning here.

    All canonical requests for /content/go/<whatever>.html will also be unnecessarily processed by this rule, which will ultimately fail (unless you happen to have files with a double html extension). You should exclude requests that already end with .html before the filesystem check. (Filesystem checks are relatively expensive so should be avoided where possible.)