javascriptangularamazon-s3unicodeunicode-normalization

Accessing S3 multibyte Unicode character filename files using TypeScript (JavaScript) from the browser


I am updating an Angular web application to play any spoken language audio files retrieved from an AWS S3 bucket. Many of the files in the S3 bucket have (will have) multibyte-Unicode file names (because the application will be supporting global users). AWS (S3) is encoding the filename in a manner that I cannot easily replicate in the browser. An application backend Lambda function sends the filename of the audio file to retrieve and the Angular application then instantiates a HTMLAudioElement which sends an HTTP GET request to S3.

A. Filename as on Windows (before upload to S3):
Godzilla [Blue Öyster Cult] instrumental #12.wav

B. Filename as ingested with the S3 console:
Godzilla %5BBlue %C3%96yster Cult%5D instrumental %2312.wav

C. Filename as shown in the S3 console (download link):
Godzilla+%5BBlue+%C3%96yster+Cult%5D+instrumental+%2312.wav

D. Filename as returned from the application backend Lambda (same as A):
Godzilla [Blue Öyster Cult] instrumental #12.wav

E. The HTTP GET browser audio.load() filename:
Godzilla+[Blue+O%CC%88yster+Cult]+instrumental+%2312.wav

Note: D & E (above) were determined using the browser network development tool

The file was uploaded to S3 via the S3 console from Windows. The filename that's saved in the backend RDS database matches the Windows filename. Because the Lambda is retrieving the filename from the RDS database the Lambda is returning the Windows filename to the Angular UI on the browser. The browser audio.load() is converting the multibyte "incorrectly" to access the file on S3 (browser: O%CC%88yster (UTF-8 COMBINING DIAERESIS) vs. S3: %C3%96yster). It looks like the browser is focused on converting the accent instead of a multibyte character as S3 seems to have done.

I am not allowed to strip the multibyte characters in favor of an ASCII character set. I'm looking for a way (without hand coding mappings for every possible multibyte-character conversion) (and ideally without a new dependency) to "convince" the browser to behave in the same way that S3 does... I guess the question boils down to, "What is S3's logic for converting multibyte characters in files names?" Can anyone offer an approach to achieve this?

Note: The Angular application already has logic to handle the typical S3 special character cases correctly. This question is just focused upon international character sets.


Solution

  • The approach below uses normalize('NFC') to translate each Unicode character into its canonical decomposed form and encodeURIComponent() to apply the UTF-8 character encoding to the entire URL without altering the / and the : characters while changing space characters to + characters. The source url is split() using / : space as delimiters. The / : space as delimiter are included in the output from split() and the segments of the URL that are not a / : space delimiter have normalize('NFC') and encodeURIComponent() applied.

    const splits = '[\/: ]';
    const splitter = new RegExp(`(?=${splits})|(?<=${splits})`, 'g');
    
    export function s3Url(url: string): string {
      return url.split(splitter).reduce((s3Url: string, segment: string) => s3Url + (' ' === segment ? '+' : '\/' === segment || ':' === segment ? segment : encodeURIComponent(segment.normalize('NFC'))), '');
    }