xml whitespace, xslt
1. | Ignorable whitespace nodes are always included in the source document tree | |||
Question: Why don't XSLT processors pick up the ignorable whitespace info delivered by validating parsers (via SAX or DOM) and exclude/drop those nodes automatically when constructing the source document tree tree? It would save me to manually specify respective non-mixed-content-model elements using <xsl:strip-space />. Michael Kay: The designers of XSLT 1.0 made a decision that the result of processing a source document should not depend on whether or not it was read by a validating parser. I think this was done partly to ensure predictable results, and partly because of belief (in 1999) that DTDs were on the way out. David Carlisle: In particular, the XML rec says: An XML processor MUST always pass all characters in a document that are not markup through to the application. A validating XML processor MUST also inform the application which of these characters constitute white space appearing in element content. CR: So a non validating parser (which is probably the kind most often used with XSLT) need not report which white space nodes are in element content. Making this configurable (only) from XSLT makes it much easier to port stylesheets across different parsers, or at least it would have been if the parsers had implemented the XML spec as written. | ||||
2. | White space handling - Avoiding problems | |||
1) NEVER call for help about the xslt until you have developed some html that works exactly right for you. Then you can ask for help with the xslt to duplicate the html that works. 2) Different xslt processors may give different results on some of the more esoteric features. Always try SAXON as one of your tests, if you possibly can. It handles more things corrrectly or intelligently than just about other processor. 3) ALWAYS say what processor and browser you are using. 4) For producing html output, make sure to tell the xslt processor, using <xsl:output method='html'/>. This will cause the output of html-legal but not-well-formed output, such as <br> (you still have to use <br/> in your code, though). 5) Whitespace may be displayed differently by different browers in different situations. Examine the raw html output to see if the spaces are ***really*** there or not. 6) More than one whitespace characters are usually displayed as just one by most browers unless the text is wrapped in <pre>, <quote>, or <code> tags (there might be some others that I forgot). The nonbreaking space   always displays (see next item). 7) some browsers (especially older ones) don't interpret these encodings the same way. For both the html and xml output method, the XSL system must output   as a legal construction no matter what output encoding is specified. (If 160 is in the encoding, it may use character data, otherwise a numeric character reference or a named entity reference (eg ) will be used. For the text output method (or a non standard output method) it may not be possible to output characters that are not in the output encoding. And if   does make it through to the html output and so be left for the browser to interpret, from HTML3 onwards the browser is *supposed* to resolve the numeric reference as a Unicode code point and use that to ask the rendering subsytem for an appropriate glyph. It should *not* interpret it as a code point within the current encoding or the system's default code page, though this frequently happens, and is even sometimes quoted as "correct" behaviour. The (very good) idea behind this is that (if properly implemented and supported by the OS as well as the browser) it lets html pages contain characters that are outside the client's default encoding repertoire. XML and HTML documents always represent documents consisting of Unicode characters, even when the documents are transported in encodings based on other character sets. What this means is that   is always U+00A0, regardless of what the encoding of the document it appears in is. Numeric character references are always to Unicode code points. 8) If you are "using" msxml3, make sure that your browser is really using it by running xmlinst.exe. That will adjust the registry so that the browser will actually use it. Then tell us that you have done so. (It's getting harder to find on the MS site, but you can still track it down). People who are outputting to html and having white space preservation problems should mock up the output they are aiming for *in html first* and try the result in a browser. You can tweak your xml and your xslt till you're blue in the face, but if the resulting html puts the white space you're hoping to see in places where a browser is allowed or required to ignore or mimise it, your efforts will be wasted. | ||||
3. | Control over whitespace in html output | |||
> In this case the linebreak causes undesirable whitespace to appear > in the resulting document in certain browsers. At the moment I'm > having to rely on the rather crude solution of setting the html > output indent to "no" to avoid this, but this seems a rather > inelegant solution... That's the most elegant solution you'll find :) By default, processors indent HTML output, which means they can add whitespace to make it easier to read. The indent attribute on xsl:output is specifically designed to control that. Mike Kay adds: Yes, but the rules also say that when you use indent="yes", the processor must only add extra whitespace in places where it won't show in the browser. So if it makes a difference to what you see on the browser screen, something is wrong. Jeni follows this with: 'Something', yes :) I think that rule is a bit vague: "how an HTML user agent would render the output" could mean anything. Mike Brown intersperses with: I raised these exact points with xsl-editors a while back [1] and [2], and to date I don't think there has been any public discussion of it nor did any changes in the recommendation appear in the XSLT 1.1 draft. [1] http://lists.w3.org/Archives/Public/xsl-editors/1999OctDec/0033.html (ignored) [2] http://lists.w3.org/Archives/Public/xsl-editors/2000JulSep/0041.html (point 2 was struck down; point 1 ignored) Jeni continues: I could have some CSS that defines { white-space: pre } for all elements - is the XSLT processor supposed to detect that and therefore not add white space in text content throughout? Generally, is an XSLT processor implementer supposed to go through every HTML user agent (which doesn't necessarily mean browser - there are other applications that retrieve and parse HTML for their own nefarious purposes) and test whether the tabs and new lines they add make a difference? Or does 'an HTML user agent' refer to a specific one, with implementers free to choose which user agent they want to work with? I suppose that the most authoritative guidance you can hope to get about how HTML should be rendered is to use the sample stylesheet for HTML 4.0 from the CSS2 Rec. I can't see anything there that would imply that the sample that Andy gave, which involved adding a line break before a closing td element, should make a difference to the rendering. So in this case I'd tend towards blaming a dodgy browser (which ones caused problems, btw, Andy?) rather than the XSLT processor. And David C comes back: No, they trust the HTML agent implementor faithfully implements the HTML 4 spec: B.3 SGML implementation notes, B.3.1 Line breaks SGML (see [ISO8879], section 7.6.1) specifies that a line break immediately following a start tag must be ignored, as must a line break immediately before an end tag. This applies to all HTML elements without exception. which means you can insert a linebreak for indentation purposes at those places. (Or W3C specs are not consistent or browsers don't implement the specs correctly, and neither of those could be the case, surely.) > On a similar note, is there any way to specify strict xhtml output? > I've experimented with adding the xhtml doctype declaration but this > hasn't really affected the output of the elements, e.g. <br/> in the > stylesheet still gets output as <br> in spite of the doctype... The rules for the html output method say that <br/> must be output as <br> and so on. If you want well-formed XML output, such as XHTML, then you should tell the processor that's what you want: <xsl:output method="xml" /> You can then set the doctype-public and doctype-system as you know, and use the XHTML namespace as appropriate. However, you should be a little careful doing this. The XML serialisers usually serialise in a way that breaks the HTML compatibility guidelines in XHTML 1.0 (e.g. they output <br/> rather than <br /> and use empty element for everything. Some processors have a special extension XHTML output method to bring XHTML output into line with the guidelines - I don't know if Xalan has. | ||||
4. | How do I preserve whitespace in the output | |||
add the following to your top level children of your xsl:stylesheet element: <xsl:preserve-space elements="text"/> | ||||
5. | Stripping extra whitespace | |||
An example is below. The script "normalize.xsl" will normalize *everything*, Note that I'm only speaking above of not using <xsl:copy> to *normalize* attributes ... if one needs to just copy attributes or other nodes, then <xsl:copy> works just fine. Input file <?xml version="1.0"?> <test xmlns:abc="http://www.CraneSoftwrights.com/s/"> <hello> this is a test </hello> <world attr1=" an attribute here"> another test </world><hello> a second element named hello</hello> <?pitarget a value is here ?> <!-- a commment is here--> <again abc:attr2="yet another test"/> </test> Stylesheet file: normalize.xsl <?xml version="1.0"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Tranform" version="1.0"> <xsl:output method="xml" indent="yes"> <xsl:template match='/'> <!--build XML declaration--> <xsl:pi name='xml'>version="1.0"</xsl:pi> <xsl:apply-templates/> <!--copy content--> </xsl:template> <xsl:template match='*'> <!--elements are easy--> <xsl:copy> <xsl:apply-templates select='*|@*|comment()|pi()|text()'/> </xsl:copy> </xsl:template> <!--hand-craft attributes--> <xsl:template match='@*'> <xsl:variable name="prefix" expr="substring-before( name(.), ':')"/> <xsl:choose> <xsl:when test="$prefix = ''"> <xsl:attribute name="{local-part(.)}"> <xsl:value-of select="normalize-space(.)"/> </xsl:attribute> </xsl:when> <xsl:otherwise> <xsl:attribute name="{local-part(.)}" namespace="{namespace(.)}"> <xsl:value-of select="normalize-space(.)"/> </xsl:attribute> </xsl:otherwise> </xsl:choose> </xsl:template> <!--hand-craft other node types--> <xsl:template match="pi()"> <xsl:pi name="{name(.)}"> <xsl:value-of select="normalize-space(.)"/> </xsl:pi> </xsl:template> <xsl:template match="comment()"> <xsl:comment> <xsl:value-of select="normalize-space(.)"/> </xsl:comment> </xsl:template> <xsl:template match="text()"> <xsl:value-of select="normalize-space(.)"/> </xsl:template> </xsl:stylesheet> output file <?xml version="1.0"?> <test xmlns:abc="http://www.CraneSoftwrights.com/s/"> <hello>this is a test</hello> <world attr1="an attribute here">another test</world> <hello>a second element named hello</hello> <?pitarget a value is here?> <!--a commment is here--> <again abc:attr2="yet another test"/> </test> | ||||
6. | White space in HTML | |||
The XSL facilities for whitespace control (xsl:preserve-space, etc) are irrelevant to this: they only affect whitespace that is surrounded on both sides by tags. Whitespace that has text before or after it (or in this case, both before and after) should be copied to the output file. But if the output is HTML, this is irrelevant, because newline characters in HTML are equivalent to spaces. If you want a visible line break in the page as displayed by the HTML browser, your options are: a) generate a <PRE> </PRE> element around the output text b) convert the newlines in the text to <BR> tags (a nice little exercise in the use of substring-before, concat, etc) | ||||
7. | wrap-option or white-space-treatment | |||
"wrap-option" controls the line-breaking treatment of non-stripped return/line-feed characters. "whitespace-treatment" controls the stripping on "non-meaningful" non-printing characters (certain redundant spaces, cr, lf, tab, etc.) | ||||
8. | Stripping Whitespace | |||
Q expansion I have these included in my XSL stylesheet: <xsl:output method="text" indent="no"\> <xsl:strip-space elements="*"/> But the resulting text output file still has all the whitespace from my XSL stylesheet. Anyone know how to get rid of it? In my stylesheet, I was outputting text characters like such: </xsl:if> // </xsl:template> I was only intending to output the double slashes, but the whitespace leading and trailing the slashes were also being output to my resulting text file. There are two fixes: 1) enclose the output in text elements, like this: </xsl:if> <xsl:text>//</xsl:text> </xsl:template> 2) use variables, like this: [set global variable] <xsl:variable name="double_slash" select="'//'"/> [reference it in template] </xsl:if> <xsl:value-of select="$double_slash"/> </xsl:template> | ||||
9. | Preserving line breaks from the stylesheet | |||
> How can I create output that preserves non visible line > breaks (from the Stylesheet )? Whitespace in the stylesheet that does not appear in the contents of xsl:text elements is stripped before processing, according to the XSLT spec. If using the top-level element <xsl:output method="html" indent="yes"/> does not give you the results you want, you can add <xsl:text> </xsl:text> to insert a linefeed character where desired. Try the indent attribute first though, and remember that whitespace has significance in HTML as a word separator, so it's not always desirable to have "pretty" HTML. | ||||
10. | Convert newlines to BR | |||
I have text that contains newlines and I want to replace these with <BR/>. I can't use <PRE> for the output because I need word-wrap to work. Any ideas on how to do the conversion? Given the input document: <doc> <p>This is some text.</p> <programlisting><![CDATA[This is a paragraph with some newlines does it work?]]></programlisting> </doc> The stylesheet: <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:template match="/"> <html> <head/> <body> <xsl:apply-templates/> </body> </html> </xsl:template> <xsl:template match="p"> <p><xsl:apply-templates/></p> </xsl:template> <xsl:template match="programlisting"> <span style="font-family:monospace"> <xsl:call-template name="br-replace"> <xsl:with-param name="word" select="."/> </xsl:call-template> </span> </xsl:template> <xsl:template name="br-replace"> <xsl:param name="word"/> <!-- </xsl:text> on next line on purpose to get newline --> <xsl:variable name="cr"><xsl:text> </xsl:text></xsl:variable> <xsl:choose> <xsl:when test="contains($word,$cr)"> <xsl:value-of select="substring-before($word,$cr)"/> <br/> <xsl:call-template name="br-replace"> <xsl:with-param name="word" select="substring-after($word,$cr)"/> </xsl:call-template> </xsl:when> <xsl:otherwise> <xsl:value-of select="$word"/> </xsl:otherwise> </xsl:choose> </xsl:template> </xsl:stylesheet> Produces: <html> <head> </head> <body> <p>This is some text.</p> <span style= "font-family:monospace">This is a paragraph<br> with some newlines<br> does it work?</span> </body> </html> Mike Brown adds: This is exactly the same method that Steve Muench posted, but with comments explaining what's going on: skew.org Also, there's no need to put the newline (which, incidentally is not the same as a carriage return, as implied by $cr) in a result tree fragment. Just use ' ' | ||||
11. | Whitespace | |||
<b>RootNodeName : </b> I want a whitspace to be added after the colon In XML, whitespace = any of: space, newline, carriage return, or tab In a stylesheet, whitespace-only text nodes get stripped by default, unless they are in an xsl:text element. This does not apply in your case because the text node child of the 'b' element contains the CDATA 'RootNodeName : ' with the space characters intact. If you look at your output document, you should see that the space characters are in fact there. I think you are asking a question about HTML and the way browsers render space characters. In HTML, one or more consecutive whitespace characters are considered to be one "word separator" that is intended to be rendered in a manner appropriate to the language/charset/font in use. In Western language scripts, spaces are rendered as a "breaking space", which can wrap to the next line if there is not enough room, and is considered insignificant at the end of a line. If you do not like the way space characters behave, perhaps you do not really want whitespace -- maybe you want a literal "non-breaking space" character, which looks like whitespace but is not considered whitespace in XML or HTML. If you ensure that a non-breaking space appears in your HTML, it should achieve the effect of adding the space exactly where you want it. The non-breaking space is  or   in XML, and   or in HTML. See what happens if you do this: <b>RootNodeName : </b> If you have your output method set to html, you should see in the output document, and the browser will render it the way you intended. | ||||
12. | How to Preserve White-Space | |||
> In my xml file,I've to preserve whitespace for a particular > tag(text)... DTD is as follows... > <!ELEMENT Name (#PCDATA)> > How to set preserve whitespacing property Add to your DTD: <!ATTLIST Name xml:space (default | preserve) "preserve"> This is equivalent to doing this in your document: <Name xml:space="preserve">...</Name> See section 2.10 of the XML 1.0 Recommendation for details. | ||||
13. | whitespace | |||
>why do I need to use xsl:text? because if you miss out the xsl:text then 9 times out of 10 you get extra white space in your result tree. This may or may not matter, depending on what you are doing. <foo> <xsl:text>a</xsl:text> <xsl:text>b</xsl:text> </foo> adds <foo>ab</foo> to your output, but allows you to indent the stylesheet however you like, <foo> a b </foo> adds <foo>   a  \  b </foo> to your result tree. | ||||
14. | Preserving whitespace in output | |||
(In the stylesheet) By default white space text nodes are removed from the stylesheet. Put your white space in xsl:text elements. (In the source XML) Try xml:space="preserve", e.g. on the xsl:for-each statement. | ||||
15. | Indents and new lines in XSL file appearing in output file | |||
Try wrapping the significant text in xsl:text element. Whitespace-only text nodes are stripped from the tree built from the stylesheet, so when whitespace that you added to the stylesheet for readability isn't mixed with printing characters, the whitespace doesn't appear in your output. <xsl:template match="HEADER"> <xsl:text>idr=</xsl:text><xsl:value-of select="ID"/><xsl:text>|</xsl:text> <xsl:text>buyer=</xsl:text><xsl:value-of select="BID"/><xsl:text>|</xsl:text> </xsl:template> If you need linebreaks, you can put them (or their numeric character references) inside the xsl:text elements: <xsl:template match="HEADER"> <xsl:text>idr=</xsl:text><xsl:value-of select="ID"/><xsl:text>| </xsl:text> <xsl:text>buyer=</xsl:text><xsl:value-of select="BID"/><xsl:text>|&#A;</xsl:text> </xsl:template> | ||||
16. | White space control when formatting for ascii text | |||
> Is there a way to make the white space more visibile for better > debugging? <xsl:template match="text()"> <xsl:value-of select="translate(.,'	 ','→↓')"/> </xsl:template> (affects text nodes reached via xsl:apply-templates, not the ones you create with xsl:text) Although this explanation might help even more... <field name="SSN" type="varchar2(9)" keytype="NOT NULL"> <desc>Social Security Number </desc> </field> When you do stuff like this, note that the whitespace that is adjacent to the non-whitespace text is considered part of the same text node. This whitespace is not in a node by itself. If it were, then your xsl:strip-space would get rid of it. Before the strip-space takes effect, you have the following nodes (I'll use \t and \n to represent tabs for indenting and newline characters, respectively): |__text '\t' |__element 'field' in no namespace | \attribute 'name'='SSN' | \attribute 'type'='varchar2(9)' | \attribute 'keytype'='NOT NULL' | \namespace prefix 'xml'='http://www.w3.org/XML/1998/namespace' |__text '\n\t\t' |__element 'desc' in no namespace | | \namespace prefix 'xml'='http://www.w3.org/XML/1998/namespace' | |__text 'Social Security Number\n\t\t' |__text '\n\t' xsl:strip-space will get rid of just the text nodes that contain whitespace only. If you can't get the whitespace out of your other text nodes by adjusting the original XML, try using the normalize-space() function, which will chop off leading and trailing whitespace and condense consecutive whitespace characters down to a single space character. | ||||
17. | MSXML vs. Saxon - different handling of tabs and newlines | |||
> I am observing an interesting difference in the way MSXML and > Saxon are treating tabs and newlines in my XML instance when viewing > the resulting HTML. The difference is that for MSXML3, the input you supply to the XSLT processor is in the form of a DOM, and MSXML3 is doing extra whitespace stripping by default when you build the DOM (i.e. before the tree gets anywhere near the XSLT processor). I believe it's possible to suppress this. There are varying views on whether they are conformant in this area, but since you are building the DOM using a proprietary Microsoft API, it's hard to point to the spec that they are not conforming to. The final result certainly defeats the intended effect of the XSLT whitespace rules. It's actually a problem implementing the whitespace-stripping rules when you take input from a DOM, since there's a reasonable expectation that the XSLT processor shouldn't modify the input tree, and doing whitespace-stripping on the fly as you navigate the tree is likely to be incredibly expensive. If you supply a DOM as input to Saxon, I copy the whole thing into a new data structure (which is also expensive). | ||||
18. | Matching Text that may have white-space | |||
<root> <contact> <name> Suk. </name> <phone> </root> > <xsl:apply-templates select="//phone[(../name)='Suk.']"/> > > gives zero results. > > Is this the fault of the processor, or is there another way > of doing this? The processor is behaving correctly. Use the normalize-space() function. | ||||
19. | Whitespace problem | |||
When I select all children of a node, applying normalize-space, the white space between the nodes is omitted. <el> <child>Some text</child> <child>More text</child> </el> When I have <xsl:template match="el"> <xsl:value-of select="*"/> </xsl:template> In text output mode it produces
If you call normalize-space too early then you get the string value of the nodes, which junks any element children. So what you want to do is use apply-templates, then have a template matching "text()" that just does <xsl:value-of select="normalize-space(.)"/> That way you apply normalisation (with a z) to every text part of the mixed content, but not to any element nodes. Solution, add: <xsl:template match="text()"> <xsl:value-of select="normalize-space(.)"/> </xsl:template> | ||||
20. | White space explanation | |||
I summarize some techniques on White space problem I used recently as an : article . I posted the initial version in some other list and got some very informative feedback. | ||||
21. | Whitespace article | |||
There is an article by Bob DuCharme on xml.com that talks about how to deal with whitespace issues in XSLT. The article is the most understandable, practical, and complete treatment I have seen. I figured it's appropriate for readers of this group, since many of the problems that beginners ask us are whitespace-related, and frustration with perceived limitations in whitespace handling can often discourage beginners from continuing with XSLT. The article has 3 parts | ||||
22. | Treatise on whitespace handling | |||
See xml.com | ||||
23. | xsl:attribute problem with xml:space | |||
No, this isn't a bug. Because the whitespace after the <xsl:attribute name="a"> element is significant (by virtue of xml:space), it is copied to the result tree as a text node which will form part of the content of the <foo> element. You can't write another attribute ("b") after you have written a child text node to the containing element. With the default error recovery action, Saxon reports the error and then recovers from it by ignoring the <xsl:attribute> instruction. | ||||
24. | Whitespace and linebreaks | |||
well yes that's what you did, but using the &lb; is the same as hitting return on your keyboard: it inserts a newline char. You probably don't want that, you probably want to insert an xsl:text node containing such a thing, so you could if you want define lb to be <!ENTITY lb "<xsl:text> </xsl:text>"> but really, I wouldn't. XSLT's white space stripping rules are rather simple once you get used to them and after a while you just know that <xsl:if test="..."> <wibble/> does not introduce space before <wibble/> but <xsl:if test="..."> wobble does introduce a newline and two spaces before wobble as they are part of a non-white text node, however if you hide things in entities <xsl:if test="..."> &lb; Now you can't say if that newline and two spaces before the &lb; will be stripped or not unless you go back and check exactly how you defined &lb;. and really you are not saving much typing: <xsl:text> </xsl:text> ain't so bad, you get used to typing <xsl:template match=" 100 times every stylesheet, you can get used to using xsl:text as well:-) | ||||
25. | Cleaning off leading and trailing WS | |||
We are using the following piece of code for this. It's just for screens but you can change it to work on program listings as well. <!-- normalized screens --> <xsl:template match="screen/text()"> <xsl:variable name="before" select="preceding-sibling::node()"/> <xsl:variable name="after" select="following-sibling::node()"/> <xsl:variable name="conts" select="."/> <xsl:variable name="contsl"> <xsl:choose> <xsl:when test="count($before) = 0"> <xsl:call-template name="remove-lf-left"> <xsl:with-param name="astr" select="$conts"/> </xsl:call-template> </xsl:when> <xsl:otherwise> <xsl:value-of select="$conts"/> </xsl:otherwise> </xsl:choose> </xsl:variable> <xsl:variable name="contslr"> <xsl:choose> <xsl:when test="count($after) = 0"> <xsl:call-template name="remove-ws-right"> <xsl:with-param name="astr" select="$contsl"/> </xsl:call-template> </xsl:when> <xsl:otherwise> <xsl:value-of select="$contsl"/> </xsl:otherwise> </xsl:choose> </xsl:variable> <xsl:value-of select="$contslr"/> </xsl:template> <!-- eats linefeeds from the left --> <xsl:template name="remove-lf-left"> <xsl:param name="astr"/> <xsl:choose> <xsl:when test="starts-with($astr,'
') or starts-with($astr,'
')"> <xsl:call-template name="remove-lf-left"> <xsl:with-param name="astr" select="substring($astr, 2)"/> </xsl:call-template> </xsl:when> <xsl:otherwise> <xsl:value-of select="$astr"/> </xsl:otherwise> </xsl:choose> </xsl:template> <!-- eats whitespace from the right --> <xsl:template name="remove-ws-right"> <xsl:param name="astr"/> <xsl:variable name="last-char"> <xsl:value-of select="substring($astr, string-length($astr), 1)"/> </xsl:variable> <xsl:choose> <xsl:when test="($last-char = '
') or ($last-char = '
') or ($last-char = ' ') or ($last-char = '	')"> <xsl:call-template name="remove-ws-right"> <xsl:with-param name="astr" select="substring($astr, 1, string-length($astr) - 1)"/> </xsl:call-template> </xsl:when> <xsl:otherwise> <xsl:value-of select="$astr"/> </xsl:otherwise> </xsl:choose> </xsl:template> | ||||
26. | Unexpected white space in output | |||
There are only three possible sources of space in the output
The most common cause of unwanted space is type 2 from things like <xsl:template match="kjg"> something <xsl:apply-templates/> which adds a newline and three spaces before something. However: <span class="inlineNoteNumber"> [<xsl:number level="any" />]</span></a> That should not produce any space. So since you are producing html output indenting is on by default. The system can only add space where it would _not_ change the rendering in a browser, but that's difficult to get right (and the browsers don't follow the html spec either usually) so it may be a bug somewhere add <xsl:output indent="no"/> and see if that stops it (if so report it to your xslt system maintainers as a bug).
that's not an anomaly that's a faq. You are outputting in utf-8 and then looking at the file in a latin-1 encoded editor. This means that you see essentially random characters. ASCII letters happen to have the same encoding in utf8 and latin1 so look the same, and a nbsp happens to use two bytes in utf8 which if viewed as latin1 come out as an acented a followed by a nbsp. Either fix your viewer to understand utf8 or output your file as latin1 by adding encoding="iso-8859-1" to xsl:output/ | ||||
27. | Result still indented despite indent="no" | |||
simplest (which is what I do) is to give in to overwhelming force and stick xml:space="preserve" on the top level element of your source file. Then styling with a stylesheet refenced via the xml-styleshet Pi more or less works as expected. Plan b is to use xml-stylesheet PI to reference a styleseet that generates a small html file that uses javascript to reload the xml source after setting preserveWhiteSpace. This also works but I gave up with this in the end as it just seems wrong/expensive (you download and parse the file the first time only to throw it away) and also at the time the script interface on mozilla was under development /constant change so getting it work cross browser was a pain, although I gather that latter problem is no longer there. Michael Kay adds The behavior of the Microsoft processor is not due to a different interpretation of the semantics of xsl:strip-space and xsl:preserve-space. Microsoft's XSLT processor is behaving the same as the other processors. The difference is that their XML parser (by default) removes the whitespace before the XSLT processor gets to see it, and before the XSLT rules come into play. Since the conformance rules for XSLT talk only about transforming source trees into result trees, anything that happens to the data before it is turned into an XSLT source tree is outside the scope of the XSLT specification, so legalistically, Microsoft's product is not non-conformant. It's just different from all the others. | ||||
28. | Whitespace | |||
I do a lot of XML-2-TEXT processing, of which some are tab-delimited based. I have some troubles understanding what <xsl:value-of /> (and other constructs) do with tab characters, when you explicitly do not want them to be normalized to spaces. Here's what I trialled and errorred: Two entities: <!ENTITY tab "	" > <!ENTITY separator "" > A character map: <xsl:character-map name="separator"> <xsl:output-character character="&separator;" string="&tab;"/> </xsl:character-map> An applied output method: <xsl:output method="text" indent="no" use-character-maps="separator" /> A variable: <xsl:variable name="tabchar" select="'	'" /> With the following statements gives: (a tab) <xsl:value-of select="$tabchar" /> (a tab) <xsl:value-of select="'	'" /> (no tab) <xsl:value-of select="'&tab;'" /> (no tab) <xsl:value-of select="'&separator;'" /> (tabs) <xsl:value-of select="somenode" separator="{$tabchar}" /> (tabs) <xsl:value-of select="somenode" separator="	" /> (no tabs) <xsl:value-of select="somenode" separator="{'&tab;'}"/> (no tabs) <xsl:value-of select="somenode" separator="{&separator;}" /> Basically the same story applies to other instructions, like copy-of, <xsl:text> etc. I was under the impression that it didn't matter whether you had a numerical entity reference of something, or a named entity reference. If I replace the tab mapping for something different, say " ", the spaces are kept. Or for "|||", the string will be kept. Though I can resolve this by using a global variable, or intersperse my code everywhere with xsl:text, things go worse when using functions etc returning strings with tabs. Mostly, they are lost somewhere along the process. So my hope was on using a character map, so that I can freely substitute the separator in the end with a tab character. It works for everything, including a bunch of spaces, but not for tab characters With newlines, btw, it is even a bit different: it works when putting the numerical character reference into the character map (the eqv. does not work for tabs), it does not work if I put the named entity inside the character map. Ans: > was under the impression that it didn't matter whether > you had a numerical entity reference of something, or a named entity > reference. That's true they are all resolved before xslt starts so the xslt engine can not treat them differently. except you have to interpret "true" in the previous sentence with some care:-) tabs in attribute values are normalised by an XML parser to spaces unless they are entered as numeric character references. and due to the details of the way entities are expanded your tab entity expands to a tab character not to the character reference, so if you put a literal tab or tab entity ref into any xml attribute the xml application (eg the xslt engine) sees a space. To stop this you need to use the character reference to #x09; or if you want to use a named entity, define it to be a character reference, not a character as in tab2 below <!DOCTYPE x [ <!ENTITY tab "	" > <!ENTITY tab2 "&#x09;" > ]> <x> <a x=" " b="	" c="&tab;" d="&tab2;"/> </x> run that past an xml parser and it will report <!DOCTYPE x [ <!ENTITY tab "	" > <!ENTITY tab2 "&#x09;" > ]> <x> <y d=" " c=" " b=" " a=" "/> </x> a and c have value a space, b and d have value a tab. isn't white space fun?? A further query, One more question that I still don't get. If XML parses these entities before it gets to XSLT, and if XML has predefined entities & < > ' and " why can't I use the following to the same success? <!DOCTYPE x [ <!ENTITY tab "	" > <!ENTITY tab2 "&#x09;" > ]> It should produce '	' where I previously got ' ' (tab). Hmm, thinking out loud here: if I get rid of that character map, and I stick to the lovely &tab; (defined as your &tab2;), do I risk loosing the tab characters in the output stream when I tossle and hossle them hence and forth through my templates during a single pass? Meaning, can normalize-space, strip-space, xsl:value-of, xsl:variable + pipelining, xsl:function etc be a spoil-sport for me? (if so, character maps are safer, if not, I can as well get rid of them) brought this response essentialy the reason is that amp is _defined_ to be already double quoted, precisely so that use of amp survives this entity expansion still as a quoted character that is nt taken as markup, but Basically this is an edge case where intuition or simplifications like "entities expanded first" don't really help. The xml spec specifies a particular algorithm for normalising attribute values, and how it interacts with character and entity reference. Most of the time it just does "the obvious thing" but sometimes like here there's no avoiding just stepping through the algorithm and seeing what happens. (Or as I just did in fact use a parser like rxp and trust that Richard read the specified algorithm carefully. The questionner then continues with this exploration I guess this is more about XML than it is about XSLT, but since XSLT *is* XML, I thought to look it up, now that I understand where to look. The following example actually explained enough, just by looking at it ( http://www.w3.org/TR/xml/#intern-replacement ) The following declarations: <!ENTITY % pub "Éditions Gallimard" > <!ENTITY rights "All rights reserved" > <!ENTITY book "La Peste: Albert Camus, © 1947 %pub;. &rights;" > Becomes the following replacement string: La Peste: Albert Camus, (c) 1947 Éditions Gallimard. &rights; Then: The general-entity reference "&rights;" would be expanded should the reference "&book;" appear in the document's content or an attribute value. My guess is that this also happens when using named predefined entity references. So, you are absolutely right, there's a difference in expansion of named entity references and numeric entity references, where the latter are replaced in place, and the former actually end up in the document. In our scenario this would give: This declaration <!ENTITY tab "&#x09;"> Becomes this replacement string: &#x09; This replacement string is used inside the XML document (which is the xslt stylesheet), and after parsing becomes the literal string: 	 Only if it were parsed again, it would be replaced with a [tab] character. Whereas the following (from your resolution): This declaration <!ENTITY tab "&38;#x09;"> Becomes this replacement string: 	 This then, will be expanded in the XML as a literal [tab] character. Never knew there were so much trickery involved. I wonder what normalize-space(normalize-space('&tab;')) would do. Will it remove the whitespace? Well, never mind, to be sure of correct handling, I think I stick to the safe-haven of character maps, now that I learned how to apply them. Thanks for all the help, David! |