I am working on a website that will allow businesses to store a description. The problem that I am currently running into is when text is copied and pasted from Microsoft word, and a few other places the strings are being returned, but not as the original characters. I do not have the best understanding of how utf8 works, but I thought that was supposed to handle this.
My question is this. Am I incorrect in thinking that utf8 will handle characters from word. If so, what is the proper way to accomplish this?
We have
<?xml version="1.0" encoding="UTF-8"?>
at the top of every page.
The characters are being changed over by the time they make it into the database and are being saved as the different character. I have done a decent amount of searching around and haven't come to a good conclusion why they are being changed. A few example characters being switched are:
From word
– changed to â
From a clients webiste
’ to ’
‘ to ‘
I would like to make it so that they will be able to copy from almost everywhere and it will format correctly. How would you recommend me doing that?
SOLVED!! The problem ended up being an issue with my web.xml configuration. I was not forcing the web to use spring's utf8 configuration. The solution (if using spring) was as follows:
The problem ended up being a configuration problem with spring. Thank you for the help.
Spring configuration:
`<filter>
<filter-name>encodingFilter</filter-name>
<filter-class>org.springframework.web.filter.CharacterEncodingFilter</filter-class>
<init-param>
<param-name>encoding</param-name>
<param-value>UTF-8</param-value>
</init-param>
<init-param>
<param-name>forceEncoding</param-name>
<param-value>true</param-value>
</init-param>
</filter>
<filter-mapping>
<filter-name>encodingFilter</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>`
This would happen if you take text that has been converted to bytes using UTF8, then read tghe bytes using a single-byte ASCII encoding.
You need to find out where in your code that happens, and fix it to read the bytes as UTF8.