I have for example a bunch of HTML pages like this :
<!DOCTYPE html>
<html>
<head><title>Table des matières</title>
<meta http-equiv="Content-Type" content="text/html; charset="utf-8"" />
<meta name="generator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<meta name="originator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<!-- 3,html,xhtml,charset="utf-8" -->
<meta name="src" content="content_final.tex" />
<link rel="stylesheet" type="text/css" href="content_final.css" />
<script type="text/javascript" src="./jquery.js">
</script>
<script type="text/javascript">
$(document).ready(function() {
function capitalizeFirstLetter(string) {
return string.charAt(0).toUpperCase() + string.slice(1).toLowerCase();
}
$('div.caption span.id').each(function() { var result = $(this).text().replace(':','');
result=capitalizeFirstLetter(result);
$(this).text(result);
});
});
</script>
</head><body
>
<!--l. 125--><div class="crosslinks"><p class="noindent">[<a
href="chapter1.html" >next</a>] [<a
href="#tailcontent.html">tail</a>] [<a
href="/sciences/index.html" >up</a>] </p></div>
<h2 class="likechapterHead"><a
id="x2-1000"></a>Table des matières</h2>
<div class="tableofcontents">
But impossible to convert all french accents in these HTML pages like above the accent in
"Table des matières
" with "è
" appearing instead of "è
".
I tried 2 things :
for i in $(ls *.html); do iconv -f iso-8859-1 -t utf8 $i > $i"_new"; mv -f $i"_new" $i; done
=> the accents are not converted
for i in $(ls *.html); do recode ..html $i; done
=> I have the following errors :
recode: section5.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section6.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section7.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section8.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: section9.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
recode: table_of_contents.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
...
I don't know what to do to convert all these french accents ?
Has anyone got an idea or suggestion to convert all possible french accents ? I would like to use iconv
, recode
or sed
commands.
UPDATE 1: taking a basic example, here is the message I get for a single file :
$ recode ..html table_of_contents.html
recode: table_of_contents.html failed: Invalid input in step `CHAR..ISO-10646-UCS-2'
What's wrong ?
UPDATE 2: here is the output of my original HTML pages :
$file -i index.html
$ index.html: text/x-tex; charset=iso-8859-1
and the head of the index.html
:
<!DOCTYPE html>
<html>
<head><title>Table des matières</title>
<meta http-equiv="Content-Type" content="text/html; charset="utf-8"" />
<meta name="generator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<meta name="originator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<!-- 3,html,xhtml,charset="utf-8" -->
<meta name="src" content="content_final.tex" />
<link rel="stylesheet" type="text/css" href="content_final.css" />
<script type="text/javascript" src="./jquery.js">
</script>
<script type="text/javascript">
$(document).ready(function() {
function capitalizeFirstLetter(string) {
return string.charAt(0).toUpperCase() + string.slice(1).toLowerCase();
}
$('div.caption span.id').each(function() { var result = $(this).text().replace(':','');
result=capitalizeFirstLetter(result);
$(this).text(result);
If I apply the command :
$ recode -vfd u8..html index.html
Request: UTF-8..:libiconv:..ISO-10646-UCS-2..HTML_4.0
Shrunk to: UTF-8..ISO-10646-UCS-2..HTML_4.0
Recoding index.html... done
and
<!DOCTYPE html>
<html>
<head><title>Table des matires</title>
<meta http-equiv="Content-Type" content="text/html; charset="utf-8"" />
<meta name="generator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<meta name="originator" content="TeX4ht (http://www.tug.org/tex4ht/)" />
<!-- 3,html,xhtml,charset="utf-8" -->
<meta name="src" content="content_final.tex" />
<link rel="stylesheet" type="text/css" href="content_final.css" />
<script type="text/javascript" src="./jquery.js">
</script>
<script type="text/javascript">
$(document).ready(function() {
function capitalizeFirstLetter(string) {
return string.charAt(0).toUpperCase() + string.slice(1).toLowerCase();
}
$('div.caption span.id').each(function() { var result = $(this).text().replace(':','');
result=capitalizeFirstLetter(result);
$(this).text(result);
});
});
</script>
as you can see, the "è
" has disappeared.
What can I do ?
Assuming the source file encoding is UTF-8. Following command worked in my environment:
$ recode -vfd u8..html index.html
Output:
$ locale charmap
UTF-8
$ file -i index.html
index.html: text/html; charset=utf-8
$ recode -vfd u8..html index.html
Request: UTF-8..:iconv:..ISO-10646-UCS-2..HTML_4.0
Shrunk to: UTF-8..ISO-10646-UCS-2..HTML_4.0
Recoding index.html... done
You can use the command options to debug the error in this way:
-v
Verbose output. Useful to find in which step the error occurred.-f
Forces the completion even if error occurred. You can compare the output file with original to figure out which character/location is giving trouble.-d
For HTML, recode doesn't convert ASCII characters. Avoids conversion of < > " &
etc. html characters.Update If the encoding/charset is iso-8859-1
then you need to use:
$ recode -vfd iso-8859-1..html index.html
Request: ISO-8859-1..ISO-10646-UCS-2..HTML_4.0
Recoding index.html... done
#Or use following.
$ recode -vfd lat1..html index.html
Request: ISO-8859-1..ISO-10646-UCS-2..HTML_4.0
Recoding index.html... done
The ISO-8859-1
has following aliases in recode:
l1
lat1
latin1
Latin-1
819/CR-LF
CP819/CR-LF
CSISOLATIN1
IBM819/CR-LF
ISO8859-1
iso-ir-100
ISO_8859-1
ISO_8859-1:1987
You can use anyone of the above in the command.