xslt and nbsp
1. | nbsp usage, the definitive, full answer (and you thought it was 42?) | ||||
Let's consider this simple stylesheet: <?xml version="1.0" encoding="ISO-8859-1"?> This stylesheet is stored on the hard disk as a series of bytes. The bytes match characters according to the ISO-8859-1 encoding (see the encoding pseudo-attribute on the XML declaration?). When the XML parser reads in this as an XML document, it decodes the bytes into Unicode characters. It also parses the document, recognising things like start tags (e.g. <p>), built-in entity references (e.g. &) and character references (e.g.  ). The parser knows that & stands for an & character (because it knows XML) and knows that   stands for a non-breaking space character (because it knows XML and Unicode). The parser reports to the XSLT processor when elements occur and what characters text is made up of, but doesn't report whether a particular character was originally serialized as the plain character (an actual space character), an entity reference or a character reference. As far as an XSLT processor is concerned, therefore, the following elements in the stylesheet (or in an XML source document) would all be reported as *exactly* the same (a p element containing a text node whose string value is a double-quote character): <p>"</p> The two p elements serialized in the stylesheet, look like: <p>Non-breaking&nbsp;space</p> For the first p element, the XML parser reports the string (here containing no escaping of any kind - every character is a literal character): Non-breaking space For the second p element, the XML parser reports the string (here containing an underscore character as a stand-in for a non-breaking space, since you can't see non-breaking spaces in emails): Non-breaking_space The XSLT processor builds a result tree from the stylesheet, which contains these text nodes and looks something like: / This tree exists in memory. All the characters are Unicode characters. Once the XSLT processor has finished its transformation, it serializes this result tree. There are three methods that it could use to serialize the result tree: xml, html and text, which is controlled by the method attribute of xsl:output. It could also use any encoding - any mapping of characters to bytes - which is controlled by the encoding attribute of xsl:output. The most straight-forward output method is the XML output method. In the XML output method, element nodes are serialized as a start tag, followed by content, followed by an end tag. Any characters in the element content that have to be escaped due to XML rules are escaped. So if you have a less-than sign in your text node, then it is automatically escaped to <. If you have an ampersand in your text node then it is automatically escaped to &. If you have a character that can't be represented by the encoding that you're using, then it is escaped using character references (e.g.  ). Let's use a really really basic encoding, ASCII, which only covers 128 characters (and doesn't include non-breaking spaces). You can usually make your stylesheet generate ASCII with: <xsl:output encoding="ASCII" /> The non-breaking space character isn't covered by ASCII, so the non-breaking space character has to be escaped in the serialization using a character reference. So the serialization of the output tree will look like: <html> If you used an encoding that covers the non-breaking space character, such as ISO-8859-1 or UTF-8 or UTF-16, then the non-breaking space character would be output as a literal non-breaking space character, and you'd get (substituting _ for non-breaking space characters again): <html> Trouble arises, however, when you try to view a document that's been saved using UTF-16 in an editor that doesn't support UTF-16 . The editor always tries to interpret the sequence of bytes that it reads from the file as ISO-8859-1 characters. It's a bit like taking an English document and trying to read it as if it were written in German. Some of the words might make sense, but most of the time you get gobbledy-gook. Specifically, because UTF-16 uses two bytes for every character whereas ISO-8859-1 uses one, when you try to read a UTF-16 document as if it were ISO-8859-1, you see two characters for every one character that you expect. The first byte in a UTF-16 character is usually the same as the byte that is used in ISO-8859-1 to mean the Ă character, while the second byte is the one that actually contains the information. So you tend to see Ă_ rather than just _, for example. Let's return to looking at the possible serializations of the result tree. The next possible serialization is HTML. HTML is serialized more-or-less the same as XML, with a few differences. The difference that is pertinent here is that when you use the html output method, XSLT processors are allowed to use the entities defined in HTML rather than as a native character (if the character can be represented in the encoding) or a character reference (if it can't). In our case, XSLT processors are allowed to serialize the non-breaking space character as the HTML character entity reference . So serializing as HTML, you may get: <html> Finally, let's consider the text output method. In the text output method, everything aside from text nodes are ignored, and the text is output without any automatic escaping. If a character can be represented in the encoding that you use, then it will be serialized as a native character. If it can't be, then the XSLT processor gives you an error. In our case, assuming that we're using an encoding that supports the non-breaking space characters, we'd get something like (again with _ representing the non-breaking space): Non-breaking spaceNon-breaking_space > And, how would you suggest someone actually get ' ' into the Hopefully, what I've explained above makes it clear that a browser that sees a non-breaking space character as an Ă followed by a non-breaking space character is making that error because it is reading the result of the transformation as if it is in one encoding (e.g. ISO-8859-1) when in fact it is in another encoding (e.g. UTF-16). There are several solutions:
Cheers, Jeni P.S. There is another solution that will work with some processors, but not all - disabling output escaping for the text node that contains the relevant characters. But since you can solve the problem a lot more elegantly with one of the methods above, there's no reason to use it. Jeni Tennison | |||||
2. | nbsp doesn't work | ||||
This is the all-time #1 FAQ. Regardless, just pick one: 1.   2.   3. after putting <!DOCTYPE xsl:stylesheet [<!ENTITY nbsp " ">]> at the top of your stylesheet, after the XML declaration but before anything else (or reference any DTD containing that entity declaration); 4. type the character directly, if your keyboard and/or OS provide a way for you to do so, and your editor can be counted on to save the document in an encoding that supports that character, and you've made the encoding declaration match your editor's output. | |||||
3. | nbsp in output | ||||
> I'm generating HTML from XML > The output HTML needs to contain some " ". But until now I could not > find a way to implement that. is, by definition   Just put   (or  , the hex equivalent) in your stylesheet to represent the non-breaking space character in the stylesheet tree and result tree. when the result tree is output, the character will be output as either   or assuming you have <xsl:output method="html"/> in the stylesheet. Wendell Piez outlines a use in tables with empty cells. Outputting spaces in html table cells Use   for a non-breaking space. Your XML parser does not pick up the named entity because it hasn't been declared. But a numbered character reference (which is what   is) will be recognized -- #160 is a non-breaking space. You can even declare nbsp in an internal subset of your stylesheet if you want a friendlier representation of the character. >There is some code before this that generates a table. if the value of "blah" is blank, and I was outputing this to html, then >netscape would >not handle blank <td/> fields in an elegant manner because it would shift >the next column over one to replace the blank column. Normally, I would insert an ' ' >between each <td> tag so that netscape would render a space and not ignore the cell, but as >you know, '&' is reserved in xml. I tried &, but that doesn't render a space but rather >the real '&' symbol. So my question is what is the best way to solve this problem? > | |||||
4. | Another explanation | ||||
In an attempt to reduce the number of 'how do I get ' questions, I have tried to update Dave Pawson's FAQ on the subject: text follows. I also sent a message to the list owners to see if we can get the search mechanism tweaked to make it easier to find I actually found it quite hard to locate definitive answers on the subject which cover all the angles, partly because it has been discussed so many times, and partly becuase some need to be edited for language ;-) I have paraphrased my recollections of what has been said about dealing with badly configured / old browsers. I would welcome pointers to actual messages off the list which I could quote instead, and any improvements on the ones I have chosen. How to output   in HTML [ existing text from the nbsp topic ] is by definition   Just put   (or  ) in your stylesheet to represent the non-breaking space character in the stylesheet tree and result tree. when the result tree is output, the character will be output as either   or assuming you have <xsl:output method="html"/> in the stylesheet. > I thought the entity was predefined in xml. It is not predefined. Only < > & " ' are predefined. You can either use   or  , or you can define an entity like nbsp for the same. Try: <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE xsl:stylesheet [ <!ENTITY nbsp " "> ]> <xsl:stylesheet xmlns="http://www.w3.org/1999/XSL/Transform" version="1.0"> Apparently one motivation for trying to get into the output is to cope with browsers that either cannot handle the encoding being used or have been set up incorrectly (the advice is to set to 'auto detect' if this option is available). Mike Brown:   in an XML document always refers to UCS character code U+00A0. This character must be encoded upon output in a document. If your document is encoded as ISO-8859-1 or US-ASCII, the character will manifest as the single byte A0 (in hex, or 160 in decimal). If your document is encoded with UTF-8, it will be the pair of bytes C2 C0. If you are looking at the UTF-8 encoded document in an editor or shell/terminal window that doesn't know to interpret hex C2 C0 as a UTF-8 sequence, then you'll probably see  (the character in many character sets/fonts at position hex C2, aka decimal 192) followed by an invisible character (C0, which if interpreted as an ISO-8859-x character happens to be invalid in HTML). If you don't like the encoding your XSLT processor gives you normally, you can use the encoding attribute on the xsl:output element to specify a particular encoding (provided your processor knows how to deal with it). Ref: http://www.w3.org/TR/xslt#output If you are having to deal with old browsers and/or misconfigured clients which you do not have the power to change, then you might be left with no choice other than getting into the output. There is no nice way to do this (as I hope we have already established, the standards are constructed such that it should not be necessary). But if it has to be done, here are the choices, and their caveats: Choose a processor such as Saxon which gives you additional control over the serialisation: Saxon for example. Caveat: ties you to one processor. Use <xsl:text disable-output-escaping="yes">&nbsp;</xsl:text>, possibly with the DTD subset trick described above to keep the stylesheet readable. Caveat: disable-output-escaping doesn't have to be honoured by the processor. Even if it seems to work, it can be fragile because it may be ignored if you later decide to send the ouput via a DOM, or you use variables and node-set() to store part of your output. See also DOE Use an element or processing instruction to represent the non-breaking space, and substitute it with a custom serialiser. Caveat: hard work, and ties you to a specific processor or class of processors. Wendell Piez outlines a use in tables with empty cells. Outputting spaces in html table cells Use   for a non-breaking space. Your XML parser does not pick up the named entity because it hasn't been declared. But a numbered character reference (which is what   is) will be recognized -- #160 is a non-breaking space. Some references: On the finer points of encodings and character references: List archive Mike Brown on browser character encodings List archive | |||||
5. | nbsp, why doesn't it work | ||||
" " is an HTML entity. XML only knows three entities: "<" ">" "&" Therefore all other characters that you need must be with their char code, as you have found with " ". because XSLT files have to be well formed XML and in XML (and HTML) entities must be defined before use. Most HTML browsers implictly use a catalogue that (implictly) defines the entities in the HTML DTD including nbsp but in general it's just an undefined reference, unless you define it. |