apachemod-speling

Apache mod_speling falsely "correcting" URLs?


I've been tasked with moving an old dynamic website from a Windows server to Linux. The site was initially written with no regard to character case. Some filenames were all upper-case, some lower-case, and some mixed. This was never a problem in Windows, of course, but now we're moving to a case-sensitive file system.

A with a quick find/rename command (thanks to another tutorial) got the filenames to all lowercase.

However, many of the URL references in the code still point to these mixed-case filenames, so I enabled mod_speling to overcome this issue. It seems to work OK for the most part, with the exception of one page: I have a file name haematobium.html, which, everytime a link points to .../haematobium.html, it gets rewritten as .../hæmatobium.html in the browser.

I don't know how this strange character made its way into the filename in the first place, but I've corrected the code in the HTML document to now link to haematobium.html, then renamed the haematobium.html file itself to match.

When requesting .../haematobium.html in Chrome, it "corrects" to .../hæmatobium.html in the address bar, and shows an error saying "The requested URL .../hæmatobium.html was not found on this server."

In IE9, I'm promted for the login (this is a .htaccess protected page), I enter it, and then if forwards the URL to .../h%C3%A6matobium.html, which again doesn't load.

In my frustration I even copied haematobium.html to both hæmatobium.html and hæmatobium.html, still, none of the three pages actually load.

So my question: I read somewhere that mod_speling tries to "learn" misspelled URLs. Does it actually rename files (is that where the odd character might have come from)? Does it keep a cache of what's been called for, and what it was forwarded to (a cache I could clear)?

PS. there are also many mixed-case references to MySQL database tables and fields, but that's a whole 'nother nightmare.


Solution

  • [Cannot comment yet, therefore answering.]

    Your question doesn't make it entirely clear which of the two names (two characters ae [ASCII], or one ligature character æ [Unicode]) for haematobium.html actually exists in your Apache's file system.

    Try the following in your shell:

    $ echo -n h*matobium.html | hd
    

    The output should be either one of the following two alternatives. This is ASCII, with 61 and 65 for a and e, respectively:

    00000000  68 61 65 6d 61 74 6f 62  69 75 6d 2e 68 74 6d 6c  |haematobium.html|
    00000010
    

    And this is Unicode, with c3 a6 for the single character æ:

    00000000  68 c3 a6 6d 61 74 6f 62  69 75 6d 2e 68 74 6d 6c  |h..matobium.html|
    00000010
    

    I would recommend using the ASCII version, it makes life considerably easier.

    Now to your actual question. mod_speling does neither "learn", nor rename or cache its data. The caching is either done by your browsers, or by proxies in between your browsers and the server.

    It's actually best practice to test these cases with command line tools like wget or curl, which should be already available or easily installable on any Linux.

    Use wget -S or curl -i to actually see the response headers sent by your web server.