htmlnon-ascii-charactersiconvrecodefrench

Convert all french accents into HTML character format


I have for example a bunch of HTML pages like this :

<!DOCTYPE html>
<html>
<head><title>Table des matières</title>
<meta http-equiv="Content-Type" content="text/html; charset="utf-8"" />
<meta name="generator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<meta name="originator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<!-- 3,html,xhtml,charset="utf-8" -->
<meta name="src" content="content_final.tex" />
<link rel="stylesheet" type="text/css" href="content_final.css" />
 <script type="text/javascript" src="./jquery.js">
</script>
<script type="text/javascript">
$(document).ready(function() {
function capitalizeFirstLetter(string) {
return string.charAt(0).toUpperCase() + string.slice(1).toLowerCase();
}
$('div.caption span.id').each(function() { var result = $(this).text().replace(':','');
result=capitalizeFirstLetter(result);
$(this).text(result);
});
});
</script>
</head><body
>
<!--l. 125--><div class="crosslinks"><p class="noindent">[<a
href="chapter1.html" >next</a>] [<a
href="#tailcontent.html">tail</a>] [<a
href="/sciences/index.html" >up</a>] </p></div>
<h2 class="likechapterHead"><a
 id="x2-1000"></a>Table des matières</h2>
<div class="tableofcontents">

But impossible to convert all french accents in these HTML pages like above the accent in "Table des matières" with "è" appearing instead of "&egrave;".

I tried 2 things :

  1. for i in $(ls *.html); do iconv -f iso-8859-1 -t utf8 $i > $i"_new"; mv -f $i"_new" $i; done

=> the accents are not converted

  1. for i in $(ls *.html); do recode ..html $i; done

=> I have the following errors :

recode: section5.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section6.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section7.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section8.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section9.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: table_of_contents.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
...

I don't know what to do to convert all these french accents ?

Has anyone got an idea or suggestion to convert all possible french accents ? I would like to use iconv, recode or sed commands.

UPDATE 1: taking a basic example, here is the message I get for a single file :

$ recode ..html table_of_contents.html
recode: table_of_contents.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2' 

What's wrong ?

UPDATE 2: here is the output of my original HTML pages :

$file -i index.html

$ index.html: text/x-tex; charset=iso-8859-1

and the head of the index.html :

<!DOCTYPE html>
<html>
<head><title>Table des matières</title>
<meta http-equiv="Content-Type" content="text/html; charset="utf-8"" />
<meta name="generator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<meta name="originator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<!-- 3,html,xhtml,charset="utf-8" -->
<meta name="src" content="content_final.tex" />
<link rel="stylesheet" type="text/css" href="content_final.css" />
 <script type="text/javascript" src="./jquery.js">
</script>
<script type="text/javascript">
$(document).ready(function() {
function capitalizeFirstLetter(string) {
return string.charAt(0).toUpperCase() + string.slice(1).toLowerCase();
}
$('div.caption span.id').each(function() { var result = $(this).text().replace(':','');
result=capitalizeFirstLetter(result);
$(this).text(result);

If I apply the command :

$ recode -vfd u8..html index.html

Request: UTF-8..:libiconv:..ISO-10646-UCS-2..HTML_4.0
Shrunk to: UTF-8..ISO-10646-UCS-2..HTML_4.0
Recoding index.html... done

and

<!DOCTYPE html>
<html>
<head><title>Table des matires</title>
<meta http-equiv="Content-Type" content="text/html; charset="utf-8"" />
<meta name="generator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<meta name="originator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<!-- 3,html,xhtml,charset="utf-8" -->
<meta name="src" content="content_final.tex" />
<link rel="stylesheet" type="text/css" href="content_final.css" />
 <script type="text/javascript" src="./jquery.js">
</script>
<script type="text/javascript">
$(document).ready(function() {
function capitalizeFirstLetter(string) {
return string.charAt(0).toUpperCase() + string.slice(1).toLowerCase();
}
$('div.caption span.id').each(function() { var result = $(this).text().replace(':','');
result=capitalizeFirstLetter(result);
$(this).text(result);
});
});
</script>

as you can see, the "è" has disappeared.

What can I do ?


Solution

  • Assuming the source file encoding is UTF-8. Following command worked in my environment:

    $ recode -vfd u8..html index.html
    

    Output:

    $ locale charmap
    UTF-8
    
    $ file -i index.html
    index.html: text/html; charset=utf-8
    
    $ recode -vfd u8..html index.html
    Request: UTF-8..:iconv:..ISO-10646-UCS-2..HTML_4.0
    Shrunk to: UTF-8..ISO-10646-UCS-2..HTML_4.0
    Recoding index.html... done
    

    You can use the command options to debug the error in this way:


    Update If the encoding/charset is iso-8859-1 then you need to use:

    $ recode -vfd iso-8859-1..html index.html
    Request: ISO-8859-1..ISO-10646-UCS-2..HTML_4.0
    Recoding index.html... done
    
    #Or use following. 
    
    $ recode -vfd lat1..html index.html
    Request: ISO-8859-1..ISO-10646-UCS-2..HTML_4.0
    Recoding index.html... done
    

    The ISO-8859-1 has following aliases in recode:

    l1 
    lat1
    latin1
    Latin-1
    819/CR-LF 
    CP819/CR-LF 
    CSISOLATIN1 
    IBM819/CR-LF 
    ISO8859-1 
    iso-ir-100 
    ISO_8859-1 
    ISO_8859-1:1987
    

    You can use anyone of the above in the command.