XSLT, include html, mix html and xml
1. | How to use HTML as input documents? | |||||
Basically, use sx, Tidy, JTidy, or NekoHTML John Cowan's TagSoup can also be used, but the last I tried, it messed up pre element content whitespace. NekoHTML is blizz with Java XSLT engines, works just like a SAX parser and e.g. Saxon have an option in their CLI to use if for parsing. | ||||||
2. | Embedding HTML in XML documents using HTML dtd | |||||
>I would like to enable HTML tags within my XML file - using the HTML >dtd. For example, if there is a list in the XML: ><list> ><UL> > <LI>List item 1</LI> > <LI>List item 2</LI> > </UL> ></list> > >What do I have to add to the XML, the DTD and the XSL to be able to >convert this to a list when I generate an HTML file? I would >like to use the HTML dtd to make this work. There are two approaches to this problem: either you can use what you know about your XML to say "the content of a 'list' element is HTML and should be copied directly" or you can explicitly put the UL and LI elements in the HTML namespace within the source XML, and then within your stylesheet say "all HTML elements should be copied". In either case, you need to know about the xsl:copy and xsl:copy-of elements. xsl:copy copies the current node, but none of its contents or attributes. xsl:copy-of copies a node set that you select, including all of its contents and any attributes or namespace nodes. The first is simpler but less extensible: when you find a 'list' element, you make a copy of its element content: <xsl:template match="list"> <xsl:copy-of select="*" /> </xsl:template> Given an input of: <list> <UL> <LI>List item 1</LI> <LI>List item 2</LI> </UL> </list> This will give: <UL> <LI>List item 1</LI> <LI>List item 2</LI> </UL> The problem is that you have to do something similar anywhere else where you have HTML elements within your XML elements and you want them copied. It might be that 'lists' are the only elements where HTML elements occur, in which case this is the easiest solution. The second solution is to use namespaces to explicitly say that the UL and LI elements are HTML elements. To do that, you associate a namespace prefix (a string that you can choose) to a namespace name (a string that you can choose, but that should probably be a URI pointing to a DTD, schema, or human-readable documentation about the elements you're using). For common XML dialects like HTML, there is usually a namespace name defined somewhere, and using that namespace name could enable you to use other people's stylesheets that also process elements in that namespace. In the case of XHTML, the namespace name is: http://www.w3.org/1999/xhtml You can associate the prefix 'html' with this namespace name using a namespace attribute: xmlns:html="http://www.w3.org/1999/xhtml" You don't have to use the prefix 'html' - you can use anything you want. This attribute should be put on an element that is an ancestor of the HTML elements (or is itself an HTML element). A namespace attribute makes a namespace 'in scope' (i.e. usable) for the element that it's on and all its descendents. Usually you'd put it on your document element (i.e. the top-most element). In your case, you could put it on the 'list' element: <list xmlns:html="http://www.w3.org/1999/xhtml"> ... </list> Within the 'list' element, any elements that are within the HTML namespace need to be given qualified names to indicate that fact. You do this by adding the prefix (i.e. 'html') and a colon before the name of the element, so: <list xmlns:html="http://www.w3.org/1999/xhtml"> <html:UL> <html:LI>List item 1</html:LI> <html:LI>List item 2</html:LI> </html:UL> </list> As a quick aside, XHTML defines that element names should be in lower case, so I'd make this: <list xmlns:html="http://www.w3.org/1999/xhtml"> <html:ul> <html:li>List item 1</html:li> <html:li>List item 2</html:li> </html:ul> </list> for compliance to that standard. In terms of the DTD for the source XML, DTDs and namespaces don't mix particularly well: you have to use the same qualified names within the DTD as you use within your XML, which means that the prefix is fixed within the DTD. [You could get around this using a parameter entity.] If you have to validate your source XML against a DTD, then the DTD should hold something like: <!ELEMENT list (html:ul)> <!ATTLIST list xmlns:html CDATA #FIXED 'http://www.w3.org/1999/xhtml'> <!ELEMENT html:ul (html:li+)> <!ELEMENT html:li (#PCDATA)> You may be able to draw on some of the XHTML modularisation work to import relevant parts of the HTML DTD, but they may not be using qualified names, I'm not sure. Within the XSLT stylesheet, you have to ensure that all the relevant namespaces are declared so that whenever you use a qualified name (like 'html:UL'), the namespace declaration for it is 'in scope'. This usually means putting the namespace attribute on the xsl:stylesheet document element: <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:html="http://www.w3.org/1999/xhtml"> ... </xsl:stylesheet> Again, you don't have to use the 'html' prefix, but you *do* have to make sure that the namespace name (the http://www.w3.org/1999/xhtml URI) is the same in your source XML and your stylesheet. It's actually the namespace name (or URI) that is used to determine the namespace that an element is in, not its prefix. Within your stylesheet, then, you can now place the rule "copy all HTML elements". The following template matches any element in the source that's within the HTML namespace (whether it's within a 'list' or not): <xsl:template match="html:*"> <xsl:copy-of select="." /> </xsl:template> However, when you're producing HTML output, copying is a bad idea because while the XSLT processor will produce something that is technically correct XML, it will not be interpreted correctly by the vast majority of HTML browsers. The above, for example, produces: <html:ul xmlns:html="http://www.w3.org/1999/xhtml"> <html:li>List item 1</html:li> <html:li>List item 2</html:li> </html:ul> because it literally copies everything, including the namespace nodes. Instead, then, you should create by hand the relevant elements and attributes, giving them names corresponding to the local part of their name, without the namespace prefix: <xsl:template match="html:*"> <xsl:element name="{local-name()}"> <xsl:for-each select="@html:*"> <xsl:attribute name="{local-name()}"> <xsl:value-of select="." /> </xsl:attribute> </xsl:for-each> <xsl:apply-templates /> </xsl:element> </xsl:template> This has the added advantage that if you have any specialised XML embedded within your HTML elements, it will be treated as that specialised XML rather than simply copied without paying attention to what it is. So, to summarise:
| ||||||
3. | HTML to XML conversion | |||||
1. XML GLobal has a two way conversion process which is the subject of scrutiny for it's use in retaining XML semantics within the confines of an HTML document. IT has been referred to by it's ancronym of XHML which stands for Extensible hybrid markup language. DO not worry - it's not being proposed as another standard! ;-) The idea is that our XML Search Engine can then index HTML data the same way it captures XML documents for indexing. It converts XML to html but retains the original context by embedding the XML tags within the html document that start with a namespace. By default, ours appear as <xhml:MyOriginalXmlTagHere>some content</xhml:MyOriginalTagHere>. This avoids collisions with other namespaces. It still renders the user the exact same page becuase browsers ignore the tags they don;t understand due to forwards compatibility issues. Our goals is to offer this as a way to make the WWW more intelligent to search through one day. The HTML to XML transformation is only valid on properly formed html. When we parse the HTML described in the earlier POST (XHML), we ignore all tags except those which start with the prefix namespace (in our case <xhml:foo>). I do not think there would be any purpose to transforming any tags from html with the exception of properly constructed <html><head><title><body><p> tags. 2. A company called Percussion has a two way conversion engine they unleashed this week at XML 99. It converts XML to html but retains the original context by embedding the XML tags within the html document inside a Span ID="PREFIX - OriginalTagHere> They have a two way demo for XML to HTML to XML conversions. | ||||||
4. | Exclude-result-prefixes | |||||
5. | How to embed HTML in XML | |||||
> I want to embed HTML into an XML.The problem which I am facing is that the > XSLT processor gives errors for HTML tags and forces me to define them in > the DTD. > > Is there some simple way in which I can make the XSLT processor to ignore > the HTML tags and I dont have to define it in the DTD. HTML and XML do not mix. You can pretend your HTML is one character data section, copy it to a text node, and rely on disable-output-escaping functionality in XSLT to work with it, but ultimately you may find it easier to make your HTML well-formed XML (XHTML; see the spec at w3.org) and work with that in your XML instead. As for validation, an XML parser is feeding to the XSL processor some information about the logical contents of the XML document. It is possible that you are invoking an XSL processor and it is invoking an XML parser for you. The XSL processor does not know about the DTD. You should check the documentation for your XSL processor and/or XML parser to figure out how to get it to parse without validating. You can't half-validate, though. If you're validating, you must declare all the elements that are used in the document. No way around that. | ||||||
6. | Including BR in an XSL template | |||||
>Some browsers just don't like the <BR/> format, and display two empty >lines with the <BR></BR> format. Putting it all together: 1. xsl:output method="html" specified 2. Stylesheet says to output <BR/> 3. XSLT processor takes (1) into account when performing (2), so it actually generates <BR> As already mentioned, the above behavior is required by the XSLT spec (Section 16.2, paragraph 4), so you can depend on it. | ||||||
7. | HTML include problem | |||||
> I have created my XML documents with schemas and XSLs. Is it possible to >use 'include' method in XSL so I don't have to type the same piece of 'html' >code in each XSL? Those 'html' code are static for all XSLs. E.g. header and >footer. >Could anyone help me out with my problem. >I want to embed an HTML doucment(by giving the path of the file) into either >an XSL or XML . >Could you tell me how? I'll answer these together since they are both the same question. The first thing is that you need to make sure that your HTML files are well-formed XML. XSLT can't process anything that isn't well-formed XML. You can make your HTML well-formed by running it through Tidy. Once you have your HTML files well-formed you can include them into your output in two main ways. Firstly, you can declare them as external parsed entities within the stylesheet and then put references to them wherever you want them put. For example: <?xml version="1.0"?> <!DOCTYPE xsl:stylesheet [ <!-- declares header.html as an external parsed entity --> <!ENTITY header SYSTEM "header.html"> ]> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="/"> <html> <head><title>Test Page</title></head> <body> <!-- includes header.html directly --> &header; <xsl:apply-templates /> </body> </html> </xsl:template> </xsl:stylesheet> [See archive for more details.] Secondly, you can access them using document and create copies of their content. For example: <xsl:template match="/"> <html> <head><title>Test Page</title></head> <body> <xsl:copy-of select="document('header.html')" /> <xsl:apply-templates /> </body> </html> </xsl:template> The second method is probably better in most situations because it limits the effective size of the stylesheet. However, it can give quite verbose output if you've declared the HTML namespace in your HTML (which you should probably have done). xsl:copy-of and xsl:copy both give complete copies of all the nodes within the node set (i.e. the document in this case), including all the namespace nodes, so you get xmlns="http://www.w3.org/1999/xhtml" on every element. A final approach is to have a couple of templates that copy the elements by hand (as it were) and thus don't include the namespace (best to define a separate mode to do this): <xsl:template match="*" mode="copy"> <xsl:element name="{local-name()}"> <xsl:copy-of select="@*" /> <xsl:apply-templates mode="copy" /> </xsl:element> </xsl:template> <xsl:template match="text()" mode="copy"> <xsl:value-of select="." /> </xsl:template> And then apply templates to the document in copy mode: <xsl:template match="/"> <html> <head><title>Test Page</title></head> <body> <xsl:apply-templates select="document('header.html')" mode="copy" /> <xsl:apply-templates /> </body> </html> </xsl:template> | ||||||
8. | How to embed xsl:value-of into html tag | |||||
You want an attribute value template, like so: <xsl:template match="productName"> <input type="text" name="productName" value='{.}' size="25" maxlength="30" /> </xsl:template> There's another way to do it with <xsl:attribute>, but this is the easy way. It's not a workaround either. An AVT (recognize it by the { } ) is explicitly provided as a way of saying "inside this attribute, evaluate this expression instead of taking it as a literal." | ||||||
9. | Embedding html in xml problem | |||||
>I want to put lots of html code between <top> and </top> (and all the rest >too). So in my xsl I can have something like: > > <xsl:template match="top"> > <table width="100%" border="0" height="30"> > <tr> > <td> > <!-- whatever it takes to grab the values between the top tags, >value-of or whatever --> > </td> > </tr> > </table> > <xsl:call-template name="left"/> > <xsl:call-template name="bottom"/> > </xsl:template> You will find XSL a lot easier and more gratifying to use if you think in terms of the abstract node trees that you're manipulating rather than the string representation of those trees. It is very very very rare that you need to resort to CDATA sections and disable-output-escaping. If you're wanting to grab all the content within the 'top' element and just copy it node-for-node, then you're looking for xsl:copy-of. xsl:copy-of takes the nodes that you specify (so the HTML content of your 'top' element) and copies them exactly as they are, with all their attributes and all their content intact: <xsl:template match="top"> <table width="100%" border="0" height="30"> <tr> <td> <xsl:copy-of select="node()" /> </td> </tr> </table> <xsl:call-template name="left"/> <xsl:call-template name="bottom"/> </xsl:template> If the content of the 'top' element has some elements in it that you need to process (e.g. embedded in the HTML, you have some XML elements that indicate where the portal user's name should go) then you need to step through the HTML, applying templates that, by default, copy the elements and apply templates to their content, with exceptions for those embedded XML elements: <xsl:template match="*"> <xsl:copy> <xsl:copy-of select="@*" /> <xsl:apply-templates /> </xsl:copy> </xsl:template> <xsl:template match="insert-user"> <xsl:value-of select="$user" /> </xsl:template> Note that the default template (match="*") will match on *any* element. It may be worthwhile using namespaces to indicate which elements within your input are HTML and which are XML, because you can then limit the copying template to copying only the HTML elements: <xsl:template match="html:*"> ... </xsl:template> Or the other alternative is to use 'modes' to limit the application of the template. So in your 'top' matching template have: <xsl:template match="top"> <table width="100%" border="0" height="30"> <tr> <td> <xsl:apply-templates mode="copy" /> </td> </tr> </table> <xsl:call-template name="left"/> <xsl:call-template name="bottom"/> </xsl:template> and make the copying template be both within that mode and apply templates in that mode: <xsl:template match="*" mode="copy"> <xsl:copy> <xsl:copy-of select="@*" /> <xsl:apply-templates mode="copy" /> </xsl:copy> </xsl:template> | ||||||
10. | HTML in XML | |||||
| We have | some fields in our database that has HTML tags in the data to emphasize the | text in the browser. A good example of this is to have parts of data being | bold using <STRONG> tag, or displaying information | using bullets with <UL> | tags. I create XML using this data, and when I send the XML to the XSLT | processor, I don't get any results back if there are any HTML tags in the | data. | Is there any way I can have the XSLT processor to ignore the HTML Tags? If you are generating query results that contain HTML fragments like: <ROWSET> <ROW> <SKU>12345</SKU> <DESC>Big Red Apple</DESC> <BLURB> <ul> <li><b>Delicious</b></li> <li><b>Nutritious</b></li> </ul> </BLURB> </ROW> </ROWSET> And you want the HTML "blurb" to showup "verbatim" in the output: for this you'd use (assuming the current node is "ROWSET/ROW"): <xsl:value-of disable-output-escaping="yes" select="BLURB"/> This will produce: <ul> <li><b>Delicious</b></li> <li><b>Nutritious</b></li> </ul> | ||||||
11. | How to copy HTML elements | |||||
Here is a complete HTML 4.01 XSL <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="no" omit-xml-declaration="yes"/> <xsl:param name="reportDir"/> <!-- Hack: copy HTML elements over --> <xsl:template match="/"> <xsl:apply-templates/> </xsl:template> <xsl:template match=" a | abbr | acronym | address | applet | area | b | base | basefont | bdo | big | blockquote | body | br | button | caption | center | cite | code | col | colgroup | dd | del | dfn | dir | div | dl | dt | em | fieldset | font | form | frame | frameset | h1 | h2 | h3 | h4 | h5 | h6 | head | hr | html | i | iframe | img | input | ins | isindex | kbd | label | legend | li | link | map | menu | meta | noframes | noscript | object | ol | optgroup | option | p | param | pre | q | s | samp | script | select | small | span | strike | strong | style | sub | sup | table | tbody | td | textarea | tfoot | th | thead | title | tr | tt | u | ul | var"> <xsl:copy> <xsl:copy-of select="@*"/> <xsl:apply-templates/> </xsl:copy> </xsl:template> <!-- End Hack: copy HTML elements over --> <!-- ADD template here --> </xsl:stylesheet> | ||||||
12. | Headers and Footers in HTML | |||||
I'm working on a project which has standard headers and footers, and we use the following to call the include files; <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:include href="../document.xsl" /> <xsl:param name="title">Search Results</xsl:param> <xsl:template name="body"> <!-- your data here --> </xsl:template> </xsl:stylesheet> to call the document.xsl which is like so: <?xml version="1.0" encoding="utf-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="*"/> <!-- The main template that wraps the html tags, header and footer content around the body template --> <!-- when it finds the top level xml tag (document in my case) the template is called --> <xsl:template match="document"> <html lang="EN-US" dir="LTR"> <head> <script type="text/javascript" language="javascript"> <!-- any javascript you may want here --> </script> <link rel="stylesheet" type="text/css" href="stylesheet.css"/> <title><xsl:value-of select="$title"/></title> </head> <body> <!-- Your header info here --> <xsl:call-template name="body"/><!-- this calls the body template in the main xsl --> <!-- Your footer info here --> </body> </html> </xsl:template> You will notice that each page has its own title because of the param in the original xsl | ||||||
13. | Parsing HTML as XML | |||||
| ||||||
14. | Embed and don't escape xml in html | |||||
Evan Lenz has a tidy solution for this at xmlportfolio.com |