I've gone over this several times and tried different fixes, but I really wanted to find the definitive solution to this issue. And I believe I have.
In most cases, the root of the problem is not defining which character set should be used to decode the data. When undefined, the browser will 'guess' which character set it should use, or use a default setting. Most browsers (and other software) use UTF-8 as the default character because it's the best combination of performance and flexibility. But unfortunatelly UTF-8 does not include our so loved pound sign.
What really bugs me is that our friendly pound sign isn't a mark-up or control character, but it can actually cause your XML document to be invalid. This tiny little misunderstanding can spiral into a big bad bug....
Enough chit-chat, as always, here's the grub:
The problem
User sees a replacement character (e.g. '?', '?') instead of £. This will happen when...
- You have not escaped special characters in you XML document
eg.: "&" instead of "& a m p ;" (without the spaces, silly blogger)
OR... - The wrong character set is being used to decode the document.
eg.: UTF-8 instead of ISO-8859-1
A fellow programmer friend of mine asked me this today... Well, it's really a matter of opinion, but the idea of this post is to find a general solution that will always work.
Problems with & pound;:
- You have to process all your data/input to find/replace £ for & pound;. It's simple, but this could be in so many places/instances it may not be feasible.
- You can only do the above if the content is available as a string to be manipulated server-side with ASP/PHP. It won't work if the £ sign is on a HTML file and you need to fetch it via Ajax.
- & pound; is only converted to £ in html/xml documents. If you want to display it as text you will have to convert it back to £ and still set the correct encoding.
- £ is one symbol. There may be other characters you will eventually need to use that will be subject to the same issue.
The solution(s)
1. Escape your characters... (Use & pound; instead of £)
This works well for HTML and XML documents, but relies on the data being properly formatted. You can't simply encode every character on a document because a) it would quadruple in size and b) you don't want to encode mark-up characters.
What you can do, if you are a good programmer and your system is flexible enough, is to remove all the mark-up and encode the data alone - then put the mark-up back as it should be. Sounds tricky? Well, it's even trickier than it sounds...
Which lead me to solution numero duo:
2. Especify a character set to decode the document (ISO-8859-1)
Like I said, the pound sign isn't a mark-up or control character so as long as you're creating valid xHTML/Xml pages, you shouldn't need to worry about encoding every sign under the sun. The quickest and most reliable method to avoid this problem is to define the character set you'd like the client to use when decoding your documents.
If you're using the pound sign and other Latin special characters, you should use the ISO-8859-1Character Set.
ASP - Use Response.Charset property.
<% Response.Charset = "ISO-8859-1" %>
PHP - Use the header() function.
<?php header("Content-Type:application/xhtml+xml; charset=ISO-8859-1"); ?>
JSP - Use the
page directive e.g.: <%@ page contentType="text/html; charset=ISO-8859-1" %>
XML - must be on the first line<?xml version="1.0" encoding="ISO-8859-1"?>
xHTML - must be on the first line
<?xml version="1.0" encoding="ISO-8859-1"?>
xHTML - Using the Content-Type meta tag
<meta equiv="Content-Type" content="application/xhtml+xml; charset=ISO-8859-1">
HTML - Using the Content-Type meta tag
<meta equiv="Content-Type" content="application/xhtml+xml; charset=ISO-8859-1">
References:
- Setting the HTTP Charset Parameter
- ISO-8859-1Character Set