pythonpython-3.xutf-8execshell-exec

Issue reading Portuguese text file with "ê : você" character in Python


I have a text file in Portuguese that I created using PHP, which contains sentences with the character "ê" (e with circumflex accent). I'm trying to read this file in Python, but I'm encountering issues specifically with the "ê" character. I have ensured that both the PHP file and Python script are using the UTF-8 encoding.

Python script work fine in terminal but when I call this python file from php exec() or shell_exec() function python could not read text file content properly and print this error:

'ascii' codec can't encode character '\xea' in position 6: ordinal not in range(128)

What could be causing this issue and how can I resolve it?

I have already tried the following steps:

  1. Saving the PHP file with UTF-8 encoding.
  2. Specifying the UTF-8 encoding explicitly when opening the file in Python.
  3. Verifying that the default encoding in Python is set to UTF-8.

operating system: Linux

Python default encoding: utf-8

text file content:

Se você tem 1 laranja e 1 limão faça esse delicioso bolo!

Python code:

filename = "newfile.txt"
with open(filename, "r", encoding="utf-8") as file:
 # Read the first line of the text file
 file_content = file.readline().strip()
 print(file_content)

terminal print:

terminal side print work

php file code:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
<title>read python</title>
</head>
<body>
<?php 
$pythonScript = "read.py";
$command = "python3 " . $pythonScript;
$output = shell_exec($command); 
echo $output; 
?>
</body>
</html>

I appreciate any insights or suggestions on how to handle this issue. Thank you!


Solution

  • The initial \ufeff in the ASCII string is the byte order mark (BOM) character sometimes used as a signature for a UTF-8 file. Use encoding='utf-8-sig' to remove that. The rest of the string is correct so the problem is the encoding of the display, not Python. If your terminal isn't configured for UTF-8 it will mis-decode the result. On Windows with Python 3.11 in the command prompt a string with that content prints correctly: Se você tem 1 laranja e 1 limão faça esse delicioso bolo!.

    @MarkTolonen is right, terminal was not configured for UTF-8, I set local utf-8 in terminal before using exec() function in php, now that is working.

    PHP code:

    $locale='pt_BR.UTF-8'; 
    setlocale(LC_ALL,$locale); 
    putenv('LC_ALL='.$locale);