I understand the various reasons to use the HTML lang
attribute for pages and content-within-pages of different languages,1 but there are two specific cases where I'm not sure whether or not to use it:
Foreign proper nouns. For example, <span lang="en">Charles</span>
would indicate a pronunciation of /t͡ʃɑɹlz/ whilst <span lang="fr">Charles</span>
would indicate a pronunciation of /ʃaʁl/. But also, <span lang="en">Andrea</span>
would indicate a female name whilst <span lang="it">Andrea</span>
would indicate a male name (perhaps for some sort of automatic translation that had to determine the appropriate pronouns to use).
Loanwords. For example, 'café' can be written in English with or without an accent on the final 'e', but is always pronounced using French phonetics as /ˈkæfeɪ/. But normally 'cafe' in French would be pronounced as /kaf/ and in English as /keɪf/ (neither of which exist as words in those languages). So presumably 'cafe/café' should be marked up as en
(i.e. inherit the tag from the wider English text they appear in) to ensure both spellings produce the correct pronunciation.
Styleguides differ on when loanwords should be styled differently (usually italicised), generally based on how common they are, so café is probably used commonly enough in English to not need highlighting. But many less-common foreign words are used in English whilst retaining their original pronunciation (e.g. this great example); should they be marked up to indicate the pronunciation rules they are following, in the same way one would mark up a phrase like <span lang="en">I'm French, so bad <i lang="fr">pain</i> causes me great pain</span>
?
I've never seen instances of lang
attributes for either case in the wild, even on sites that are generally good with their a11y/i18n markup, and I can't find any specific reference to either case online (WCAG, etc.).
I'm looking for source-supported answers for both:
1: see https://www.w3.org/International/questions/qa-lang-why.en and What is the 'lang' attribute of the <html> tag used for?
I'm a screen reader user, and I can answer on the practical side. When a screen reader encounters a lang attribute, a voice for that language is taken to read the text, assuming there is one available on the system. IF there is no lang attribute, there's simply no voice switch, and the default voice is used.
So, for screen readers, having language information is always beneficial. It allows to use the correct voice to read the text in the intended language / accent. Not having it is a lot more problematic... that's a common problem to have a page in a foreign language read with the wrong voice because there is no lang attribute anywhere. For example, reading English with a French voice.
There is probably no big harm in having more lang attributes than less, except maybe a little computing charge while loading voices, and maybe a little exaggerated accent sometimes. If there is no voice available for a given language on the system, or if the user explicitly disabled automatic voice switch, nothing happens. That's as simple as that.
So in all your examples, I encourage you to keep lang attributes as you wrote. When playing with words and languages as your example with "pain", it may even be crucial to well understand.
At best it will help understanding, at worst nothing will happen, and at worst really worst maybe the user can be annoyed a little by an exaggerated Frenchie pronunciation. In case of doubt, for examples such as "Charles" and "café", you can probably omit the lang attribute to avoid that effect, as both are common enough in French and English. There are probably no exact rules to determine what would be common enough or not. That's certainly just common sense.
ON the search engine side, as far as I know, Google quite ignore lang attribute and prefer making their own guess, probably because a lot of websites existing for ages don't give correct language information.