1. | Muenchian techniques |
I bumped into this idea while writing up a section in my forthcoming book on Building Oracle XML Applications for O'Reilly, where I was trying to understand and subsequently explain the cases where using the database to do the data "shaping"/grouping was best and where doing a single query and allowing XSLT to do the grouping was best. The data "shaping" techniques supported in Microsoft ADO and in Oracle using nested CURSOR() expressions end up achieving the result using many individual data cursors and trips across the networks. Depending on the amount of data this might be too much network traffic. The XSLT grouping approach can be better in some scenarios when doing a single query to join a master and a detail table doesn't result in tons of duplicated data. So, given a trivial example like a DEPT and an EMP table... If you want to include just a little information (like Dept Name) from the DEPT table, then it's probably better to join DEPT and EMP and let XSLT "group" the joined data. This is especially true if there are hundreds or thousands of employees. If instead you need 30 columns of information from DEPT and 30 columns of info from EMP, then joining them will produce lots of redundant DEPT info. In that case it's better to let the DB to the data shaping to deliver an XML document that pre-grouped into departments. Jeni Tennison expands The Muenchian technique is a grouping method discovered by Steve Muench, and explained on this list by the man himself: http://sources.redhat.com/ml/xsl-list/2000-05/msg00276.htmlSteve M. Using keys, Steve increased the efficiency of grouping when compared to the old technique of using the preceding-sibling or following-sibling axes and predicates. Unfortunately, Steve discovered this *after* Mike Kay's book came out. I have added a special note to my copy at the bottom of page 560 to help remind me to use the Muenchian technique instead whenever I have a grouping problem to solve. In addition, there was some discussion a few weeks ago about the best way of comparing nodes when using this technique. The two options use either generate-id() or counting the number of nodes in a node set union. These were elucidated by the XPath guru, David Carlisle . I believe that it's still up in the air about which is most efficient - this is probably processor dependent. Finally, if you want to learn more about keys, I learnt a lot from the online illustration from Crane Softwrights Ltd. at Ken Holmans site I hope these pointers are useful to you. | |
2. | Muenchian grouping, history |
Oh I didn't mean to undermine that. In fact I was one of the first to see that: SM mailed me (and MK I think, can't remember) outlining his idea to use keys for grouping just shortly before he mailed it to this list. One of the things that finally weaned me off xt (which didn't support keys) Ah found SM's mail (Fri, 5 May 2000, originally sent to Michael Kay then forwarded by Steve to me as well) Not sure I should post other people's mail though, although an interesting historical record:-) Here's Steve's public posting a few days later
| |
3. | Kaysian Technique - Member intersection and difference function |
This is surprisingly difficult. Saxon and xt both provide intersection() as an extension function. The only way of doing it within the standard is to rely on union: count($nodeset) = count($nodeset | $node) will be true iff $node is a member of $nodeset. I had a think about this in the bath. The following expression finds the intersection of $ns1 and $ns2 $ns1[count(.|$ns2)=count($ns2)] Oliver Becker adds If we add a single exclamation mark $nodes1[count(.|$nodes2)!=count($nodes2)] it will become the difference of the two node-sets $nodes1 and $nodes2. Ken Holman begs to differ, and offers: I figured you would need: ( $ns1[count(.|$ns2)!=count($ns2)] | $ns2[count(.|$ns1)!=count($ns1)] ) and offers the example <?xml version="1.0"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:data="ken" version="1.0"> <xsl:output method="text"/> <data:data> <item>1</item> <item>2</item> <item>3</item> <item>4</item> <item>5</item> </data:data> <xsl:template match="/"> <!--root rule--> <xsl:variable name="ns1" select="document('')//data:data/item[position()>1]"/> <xsl:variable name="ns2" select="document('')//data:data/item[position()<5]"/> <xsl:for-each select="$ns1[count(.|$ns2)=count($ns2)]"> Intersection: <xsl:value-of select="."/> </xsl:for-each> <xsl:for-each select="( $ns1[count(.|$ns2)!=count($ns2)] | $ns2[count(.|$ns1)!=count($ns1)] )"> Difference: <xsl:value-of select="."/> </xsl:for-each> </xsl:template> </xsl:stylesheet> | |
4. | Muenchian examples |
DaveP Source file <?xml version="1.0" standalone="yes"?> <wrapper> <state>xxxx</state> <state>yyyy</state> <state>zzzz</state> <st>xxxx</st> <st>xxxY</st> <st>xxxZ</st> <st>xxxA</st> <st>xxxB</st> <st>xxxC</st> </wrapper> Stylesheet. Note that I used James Clark's XT here. Substitute the appropriate extension namespace if you want to use Saxon or Xalan. <?xml version="1.0"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" xmlns:xt="http://www.jclark.com/xt" extension-element-prefixes="xt"> <xsl:output method="xml"/> <xsl:strip-space elements="*"/> <xsl:template match="/"> <!-- Cycle the definitions of ns1 through the following wrapper/state wrapper/st[1] wrapper/st --> <xsl:variable name="ns1" select="wrapper/st"/> <xsl:variable name="ns2" select="wrapper/st"/> Original values: ns1 <xsl:for-each select="$ns1"> "<xsl:value-of select="."/>" </xsl:for-each> ns2 <xsl:for-each select="$ns2"> "<xsl:value-of select="."/>" </xsl:for-each> ============================= difference: in $ns1, not in $ns2 <xsl:for-each select="xt:difference($ns1,$ns2)"> "<xsl:value-of select="."/>" </xsl:for-each> Intersection: Present in both <xsl:for-each select="xt:intersection($ns1,$ns2)"> "<xsl:value-of select="."/>" </xsl:for-each> Kaysean technique of member - count($ns2) = count($ns2 | $ns1) <xsl:value-of select="count($ns2) = count($ns2 | $ns1)"/> Kaysean technique of intersection - $ns1[count(.|$ns2)=count($ns2)] <xsl:for-each select="$ns1[count(.|$ns2)=count($ns2)]"> "<xsl:value-of select="."/>" </xsl:for-each> Becker Difference: $ns1[count(.|$ns2)!=count($ns2)] <xsl:for-each select="$ns1[count(.|$ns2)!=count($ns2)]"> "<xsl:value-of select="."/>" </xsl:for-each> Holman techique: Difference( $ns1[count(.|$ns2)!=count($ns2)] | $ns2[count(.|$ns1)!=count($ns1)] ) <xsl:for-each select="$ns1[count(.|$ns2)!=count($ns2)]| $ns2[count(.|$ns1)!=count($ns1)]"> "<xsl:value-of select="."/>" </xsl:for-each> </xsl:template> </xsl:stylesheet> And the output, for one example, as given. Original values: ns1 "xxxx" "xxxY" "xxxZ" "xxxA" "xxxB" "xxxC" ns2 "xxxx" "xxxY" "xxxZ" "xxxA" "xxxB" "xxxC" ============================= difference: in $ns1, not in $ns2 Intersection: Present in both "xxxx" "xxxY" "xxxZ" "xxxA" "xxxB" "xxxC" Kaysean technique of member - count($ns2) = count($ns2 | $ns1) true Kaysean technique of intersection - $ns1[count(.|$ns2)=count($ns2)] "xxxx" "xxxY" "xxxZ" "xxxA" "xxxB" "xxxC" Becker Difference: $ns1[count(.|$ns2)!=count($ns2)] Holman techique: Difference( $ns1[count(.|$ns2)!=count($ns2)] | $ns2[count(.|$ns1)!=count($ns1)] ) Ed Blachman adds this rider, and another example, with a mathematical concern over the definitions. Regarding the Kaysian technique and its offshoots: there's standard nomenclature for distinguishing between Becker and Holman's difference expressions. Becker's implements what is commonly called the "set difference", whereas Holman's implements the "set symmetric difference". (See set difference and set symmetric difference for details and substantiation.) The references come from a site called "Eric Weisstein's World of Mathematics", which is hosted by Wolfram Research (publishers of Mathematica). I don't know anything about Weisstein or the site that's independent of what I see there (which includes his bio and the claim that the site is "the world's most complete mathematics resource"). The site has a hardcover version, which has been published as The CRC Concise Encyclopedia of Mathematics by CRC Press. (It can be bought in hardcover or CD-ROM from, for example, amazon.com.) The references above are the two pages on the site that define the terms "set difference" and "symmetric difference". The two pages are concise and clear and, as far as I can see as a long-ago math major, unobjectionable (except, perhaps, for the fact that Weisstein insists on putting on both of these two pages how the term being defined is implemented in Mathematica). I found these pages by doing a Yahoo search; I don't remember exactly what I looked for, but it was probably something like "set difference". I came across a bunch of pages that gave definitions for difference functions in one or another computer language, all of which agreed with xt:difference (ie with Becker). Those did *not* strike me as authoritative because they didn't discuss the symmetric difference (ie Holman's concept), leading me to believe they might not be well-grounded, and as an ex-math person, I think of set theory as a branch of math and wanted a math-oriented reference. In contrast, mathworld.wolfram.com looks to me to be authoritative. Ken Holman doubtless didn't just come to his assumption out of the blue. There is probably a community out there that commonly users the term "difference of two sets" as Holman does, to refer to what I'd now (after seeing Weisstein's site) call the symmetric difference. The XSL FAQ is probably not the right place to host an argument over math terminology. (And here's another example: what you call member -- the boolean answer to the question "is every element of ns1 also an element of ns2?" I would call "subset", saving member for the boolean answer to the question "is the set ns1 an element of the set ns2?". Example 2. set1: { a, b, c } set2: { a, b, c, d } set3: { a, q, { a, b, c } } set1 is a subset of set2. set1 is a *member* of set3 -- but *not* a subset of set3. In the XSLT context, we don't have math's generalized sets, but rather the specialized construct the node-set -- a set of nodes. In other words, you can't make an XSLT set that looks like set3 above. So the distinction is a math geek's distinction, except that if people are going to have to differentiate between Becker's difference and Holman's difference, I think they're more likely to do well in talking with random outsiders by saying "set difference" and "symmetric difference" instead.) Ed Blachmans example. The stylesheet is as above, but modifies the variables. <?xml version="1.0" standalone="yes"?> <wrapper> <st ns1="yes">xxxx</st> <st ns1="yes" ns2="yes">xxxY</st> <st ns2="yes">xxxZ</st> <st ns1="yes" ns2="yes">xxxA</st> <st>xxxB</st> <st ns1="yes">xxxC</st> </wrapper> <?xml version="1.0"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" xmlns:saxon="http://icl.com/saxon" extension-element-prefixes="saxon"> <xsl:output method="text"/> <xsl:strip-space elements="*"/> <!-- <xsl:variable name="ns1" select="wrapper/state[2]"/> <xsl:variable name="ns2" select="wrapper/st"/> --> <xsl:variable name="ns1" select="wrapper/st[@ns1]"/> <xsl:variable name="ns2" select="wrapper/st[@ns2]"/> <xsl:key name="target-node-set" match="/wrapper/st" use="."/> <xsl:template match="/"> <!-- Cycle the definitions of ns1 through the following wrapper/state wrapper/st[1] wrapper/st --> Original values: ns1 <xsl:for-each select="$ns1"> "<xsl:value-of select="."/>" </xsl:for-each> ns2 <xsl:for-each select="$ns2"> "<xsl:value-of select="."/>" </xsl:for-each> ============================= difference: in $ns1, not in $ns2 <xsl:for-each select="saxon:difference($ns1,$ns2)"> "<xsl:value-of select="."/>" </xsl:for-each> Intersection: Present in both <xsl:for-each select="saxon:intersection($ns1,$ns2)"> "<xsl:value-of select="."/>" </xsl:for-each> Kaysean technique of member - count($ns2) = count($ns2 | $ns1) <xsl:value-of select="count($ns2) = count($ns2 | $ns1)"/> Kaysean technique of intersection - $ns1[count(.|$ns2)=count($ns2)] <xsl:for-each select="$ns1[count(.|$ns2)=count($ns2)]"> "<xsl:value-of select="."/>" </xsl:for-each> Becker Difference: $ns1[count(.|$ns2)!=count($ns2)] <xsl:for-each select="$ns1[count(.|$ns2)!=count($ns2)]"> "<xsl:value-of select="."/>" </xsl:for-each> Holman techique: Difference( $ns1[count(.|$ns2)!=count($ns2)] | $ns2[count(.|$ns1)!=count($ns1)] ) <xsl:for-each select="$ns1[count(.|$ns2)!=count($ns2)]| $ns2[count(.|$ns1)!=count($ns1)]"> "<xsl:value-of select="."/>" </xsl:for-each> </xsl:template> </xsl:stylesheet> Original values: ns1 "xxxx" "xxxY" "xxxA" "xxxC" ns2 "xxxY" "xxxZ" "xxxA" ============================= difference: in $ns1, not in $ns2 "xxxx" "xxxC" Intersection: Present in both "xxxY" "xxxA" Kaysean technique of member - count($ns2) = count($ns2 | $ns1) false Kaysean technique of intersection - $ns1[count(.|$ns2)=count($ns2)] "xxxY" "xxxA" Becker Difference: $ns1[count(.|$ns2)!=count($ns2)] "xxxx" "xxxC" Holman techique: Difference( $ns1[count(.|$ns2)!=count($ns2)] | $ns2[count(.|$ns1)!=count($ns1)] ) "xxxx" "xxxZ" "xxxC" | |
5. | Allouche's method |
The Allouche Method helps you control whitespace in your result. When the XSLT processor reads in the stylesheet, it strips out any whitespace-only text nodes (text nodes that are made up purely of whitespace), but it leaves in any text nodes that have non-whitespace characters in them. Usually you'd get around this by wrapping the text that you actually want added within an xsl:text element. So for example: <xsl:text>(</xsl:text> <xsl:value-of select="$expression" /> <xsl:text>)</xsl:text> Rather than doing that, you can use an empty xsl:text element (or indeed any other XSLT element, but xsl:text is good because it's short and it doesn't give you any output) to delimit the whitespace that you don't want. So instead of the above, I could do: <xsl:text />(<xsl:value-of select="$expression" />)<xsl:text /> (Note: The other method for dealing with this kind of situation is to wrap the text in a concat() in the xsl:value-of, e.g.: <xsl:value-of select="concat('(', $expression, ')')" /> That's probably a little more efficient and reduces the size of the stylesheet node tree, but the Allouche Method is more general.) | |
6. | Validating an enforcing a list of attribute values - Novatchev-Tennison method |
I have <thesis faculty="arts" department="history" options="foo bar blort"> <thesis> where foo.sty, bar.sty, and blort.sty are among the allowed options (ie declared entities in the DTD) for translation to LaTeX's \usepackage{foo,bar,blort}. Just a safety check to make sure that only valid styles are present. here's a solution that doesn't involve recursive templates, as long as you don't mind repeating in the stylesheet the list of allowed options: <?xml version="1.0"?> <!DOCTYPE xsl:stylesheet [ <!ATTLIST thesis:opt id ID #REQUIRED> ]> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:thesis="http://www.silmaril.ie/thesis" exclude-result-prefixes="thesis"> <thesis:opts> <thesis:opt id="foo" /> <thesis:opt id="bar" /> </thesis:opts> <xsl:template match="thesis"> <xsl:text>Options: </xsl:text> <xsl:variable name="options" select="@options" /> <!-- change the context to the stylesheet document --> <xsl:for-each select="document('')/*"> <!-- get the id attributes of any options identified by ID --> <xsl:for-each select="id($options)/@id"> <xsl:value-of select="." /> <xsl:if test="position() != last()">, </xsl:if> </xsl:for-each> </xsl:for-each> </xsl:template> </xsl:stylesheet> It has the advantage that you can deal with bad values (e.g. if 'blort' wasn't actually allowed) within the stylesheet - if that's an advantage. It also works in cases where the DTD isn't available for some reason - - you wouldn't have the information about the @options attribute being an ENTITIES attribute *or* the entities that they could use in it! Basically, the stylesheet could be used for validation even if the source XML couldn't be validated according to the DTD. | |
7. | Piez method |
The Piez's Method is for iterating a number of times. As you know, there's no way that you can do a for loop in XSLT in the same way as you would in a procedural programming language - you can only iterate over a number of nodes. However, what you can do is choose the number of nodes you iterate over, and then use the position() of that node to indicate the number of the iteration. So you set up a random set of nodes (I usually use nodes in the stylesheet itself): <xsl:variable name="random-nodes" select="document('')//node()" /> Then you pick from them the number that you want, and iterate over them: <xsl:for-each select="$random-nodes[position() < $number]"> ... </xsl:for-each> Whatever you want to do, held in the xsl:for-each, is repeated $number times. (Note: You can do the same thing with a recursive template, and the Piez Method can be tricky if you can't find enough random nodes to use, and can take up a lot of memory if you collect too many random nodes, but most of the time it's a lot less bother than writing a recursive template.) Oliver Becker adds: BTW: if you really want *all* nodes of a document (no matter of XML source or stylesheet), you should select / | //node() | //@* | //namespace::* I think particularly the namespace nodes increases the number of nodes significant (once declared in some element every child element has its own namespace node for this namespace). Jeni's final caution: If you have "elements that you need to use the random nodes iteration hack for" then you don't want to use the *random nodes* iteration hack, you want to iterate over the elements. If you care about the identity of the nodes that you're iterating over, then you should iterate over the nodes that you care about. Jeni's next final comment is: The main reason I use nodes from the stylesheet is because that set of nodes will remain constant until you change the stylesheet - you have a predictable set - it's either going to have more than 100 nodes in it or less than a 100 nodes in it. So if you know that you need to iterate up to 100 times, no matter what the source XML looks like, then it's beneficial to have a node set you *know* will be (at least) that size. The set of random nodes should be tailored to the number of times you need to iterate. I use the following sets: document('')//* document('')//node() document('')//node() | document('')//@* document('')//node() | document('')//@* | document('')//namespace::* The other advantage, as Wendell said, is that it drums home the point that they're random nodes - you really don't care about what they are or where they come from. If you *do* care, or if the number of times that you're iterating is dependent on the size of the source XML in any way then you should be using nodes from the source XML. If you want to iterate the same number of times as you have nodes of type X, then you should be iterating over that set of nodes. If you want to iterate twice that amount, then have a nested xsl:for-each and iterate over them twice. In particular, you should *never* be using the Piez method to get a counter that you use in a positional predicate in the same node set each time, like: <xsl:for-each select="$random-nodes[position() <= count($my-nodes)]"> <xsl:apply-templates select="$my-nodes[position()]" /> </xsl:for-each> Which gives exactly the same result (but in a very roundabout way) to: <xsl:for-each select="$my-nodes"> <xsl:apply-templates select="." /> </xsl:for-each> (which itself is probably better as <xsl:apply-templates select="$my-nodes"/>, but that's not important right now). | |
8. | Muenchian explanation |
I'd never dare to disagree with Mike, especially when he's right ;) The reason I usually include the [1] when I'm explaining this method of accessing unique values is that it flows naturally from the test that you're doing. What you're doing is comparing the context node with the first node returned by the key. If you translate into the set logic expression it would be: count(. | key('rows', name)[1]) = 1 In comparison, if you took: generate-id(.) = generate-id(key('rows', name)) and naively translated it to: count(. | key('rows', name)) = 1 you'd get a very different result (it would return true if the context node was the *only* node in the document with that name). In XPath 2.0 terms the comparison is: . == key('rows', name)[1] And I *think* that if you did: . == key('rows', name) then you would get an error if there was more than one node returned by the key (but it might be that it's a recoverable error that's covered by the fallback conversions - I don't find the XPath 2.0 WD particularly clear on this point). So in general if you're trying to assess whether two nodes are the same, it's important to pull out the two nodes individually. The only reason that you can get away with *not* using the [1] if you're using the generate-id() method of comparing nodes is because generate-id() automatically looks at only the first node in the node set. Thus I'll usually miss [1] out in practice, but when I'm helping/training/teaching/writing I tend leave [1] in to make what's happening more explicit. | |
9. | Muenchian Grouping |
<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xalan="http://xml.apache.org/xalan"> <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/> <xsl:key name="by-restr" match="r_ele" use="restr" /> <xsl:template match="/entry"> <entry> <xsl:variable name="rtf"> <xsl:for-each select="r_ele"> <r_ele> <reb> <xsl:value-of select="reb"/> </reb> <restr> <xsl:for-each select="re_restr"> <xsl:sort select="."/> <xsl:value-of select="."/> <xsl:if test="position() != last()"> <xsl:text> ,</xsl:text> </xsl:if> </xsl:for-each> </restr> </r_ele> </xsl:for-each> </xsl:variable> <xsl:for-each select="xalan:nodeset($rtf)/r_ele"> <xsl:if test="generate-id(.) = generate-id(key('by-restr', restr)[1])"> <group> <rebs> <xsl:for-each select="key('by-restr', restr)"> <reb> <xsl:value-of select="reb" /> </reb> </xsl:for-each> </rebs> <re_restrs> <xsl:if test="restr != '' "> <xsl:call-template name="tokenise-restr"> <xsl:with-param name="str" select="restr" /> <xsl:with-param name="delim" select="','" /> </xsl:call-template> </xsl:if> </re_restrs> </group> </xsl:if> </xsl:for-each> </entry> </xsl:template> <xsl:template name="tokenise-restr"> <xsl:param name="str" /> <xsl:param name="delim" /> <xsl:if test="not(contains($str, $delim))"> <re_restr> <xsl:value-of select="$str" /> </re_restr> </xsl:if> <xsl:if test="substring-after($str, $delim) != ''"> <re_restr> <xsl:value-of select="substring-before($str,$delim)" /> </re_restr> <xsl:call-template name="tokenise-restr"> <xsl:with-param name="str" select="substring-after($str, $delim)" /> <xsl:with-param name="delim" select="$delim"/> </xsl:call-template> </xsl:if> </xsl:template> </xsl:stylesheet> |