1. | Text replace |
| Michael Kay To display a list of software releases with ", " delimited that are mentioned in XML as ";" delimited:
<procedure software="3.3.7;3.3.8;3.3.9;3.4.0;3.4.2;3.4.4"> it's simpler than that, in 2.0 you can do
<xsl:value-of select="replace(@software, ';', ', ')"/> |
2. | Spell check? |
| Dimitre Novatchev
> As far as the XSLT 2.0 working draft goes in
> regards to bringing Perl type text processing to the XML
> developer it is still up to the developer to fine-tune these
> capabilities to cover their specific needs. For example, a spell
> checker.
>
> Can anyone who may have extended experience in regards to the
> development of such capabilities using XSLT share with the rest of us
> your experience?
These days I had fun with an f:binSearch() function and then, logically, with f:spell().
I have a dictionary of about 47000 English wordforms, on which I search with f:binSearch()
I had to produce a faster fn than the current quadratical str-split-to-words template -- this is the f:getWords() function.
All these functions can be downloaded from the FXSL CVS at Sourceforge The combination of these functions works quite well.
This transformation (test-FuncSpell.xsl):
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:f="http://fxsl.sf.net/"
exclude-result-prefixes="f xs"
>
<xsl:import href="../f/func-getWords.xsl"/>
<xsl:import href="../f/func-spell.xsl"/>
<xsl:output omit-xml-declaration="yes"/>
<xsl:variable name="vDelim"
as="xs:string"> ,—:.-	 !?;</xsl:variable>
<!-- Space, Coma, mdash, Colon, Dot, Dash, Tab, NL, CR, Exclamation mark,
Question mark, Semicolon -->
<!-- To be applied on ../data/othello.xml -->
<xsl:template match="/">
<xsl:variable name="vwordNodes" as="element()*">
<xsl:for-each select="//text()/lower-case(.)">
<xsl:sequence select="f:getWords(., $vDelim, 1)"/>
</xsl:for-each>
</xsl:variable>
<xsl:variable name="vUnique" as="xs:string+">
<xsl:perform-sort select="distinct-values($vwordNodes)">
<xsl:sort select="."/>
</xsl:perform-sort>
</xsl:variable>
<xsl:variable name="vnotFound" as="xs:string*"
select="$vUnique[not(f:spell(.))]"/>
<xsl:value-of separator="
"
select="$vnotFound"/>
A total of <xsl:value-of select="count($vwordNodes)"/> words
were spelt, (<xsl:value-of select="count($vUnique)"/>) distinct.
<xsl:value-of select="count($vnotFound)"/> not found.
</xsl:template>
</xsl:stylesheet>
when applied on othello.xml (around 29000 words)
produces this result:
<a-list-of-567-unknown-words-ommitted/>
A total of 28622 words
were spelt, (3669) distinct.
567 not found.
So, checking 3669 distinct words in 7015 milliseconds makes
523.02 words/sec. The actual speed is faster, as the total time includes splitting up the words and finding the distinct words.
Among the unknown words are such nice words as:
affordeth
affrighted
ariseth
arithmetician
arrivance
bethink
betimes
bewhored |
3. | Breaking string into substrings |
| David Carlisle
> In my project I am dealing with at least 10 delimeters ?
If you don't need to do different things for different
delimiters, then: The XPath2 tokenize function allows
the delimiter to be specified by a regular expression, so in that case
you can just specify whatever you want, eg ^[a-zA-Z]+ for any run of
non (ascii) letters being a delimiter. |
4. | Unwanted spaces in built strings |
| Michael Kay When building a string in a variable and you want to
avoid document node creation it's preferable to use item()+ than
xs:string+ as that allows the merging of adjacent text nodes before
atomization, creating a sequence of one item therefore bypassing the
separator issue? Mike Kay responds, logical as ever, with
No, if you want to build a string in a variable then you should define the
type of the variable as xs:string. You can either do:
<xsl:variable name="foo" as="xs:string">
<xsl:value-of>
<xsl:text>abc</xsl:text><xsl:value-of select="'def'"/>
</xsl:value-of>
</xsl:variable>
or
<xsl:variable name="foo" as="xs:string">
<xsl:sequence select="concat('abc', 'def')"/>
</xsl:variable>
or of course
<xsl:variable name="foo" select="concat('abc', 'def')"/>
Personally I prefer to work entirely with strings. If you don't need text
nodes, don't create them.
David Carlisle offers the same advice, with
If you want to build a (sequence of) string(s) in a variable
then you have to use xs:string+ as otherwise you will get other things
in your sequence (like nodes). Sometimes you don't need to worry about
the difference between a string and a text node, but somethimes you do
(as you've demonstrated). In general the distinction is far more
important in 2.0 than 1.0. So, if you want a sequence of strings use xsd:string and use
separator="" on a value-of if you don't want the spaces.
If you want a sequence of nodes and/or strings use item()+
(but personally if I knew there was a possibility of adding an unwanted
space I'd just always add separator="" rather than run through the six
point simple content construct in my head every time to see if the
spaces wouldn't be added)
From Andrews earlier query, Michael Kay justifies
the WG decision: > Indeed, the key here is that text nodes get merged together at stage 2
> before the atomization at stage 3 - which is why I'm still
> confused if I
> specify xs:string+ instead of item()+:
>
> <xsl:variable name="foo" as="xs:string+">
> <xsl:text/>abc<xsl:sequence select="'def'"/>
> </xsl:variable>
>
> <xsl:variable name="foo2" as="xs:string+">
> <xsl:text/>abc<xsl:value-of select="'def'"/>
> </xsl:variable>
>
>
> <xsl:value-of select="$foo" separator=","/>
>
> Gives ,abc,def
>
> <xsl:value-of select="$foo2" separator=","/>
>
> Gives ,abc,def
>
> I would have expected the same behaviour as specifying item()+ as
> atomization occurs after the merging of the text nodes in $foo2
>
> This suggests that by specifying xs:string Saxon is jumping
> the gun and
> converting the text nodes (zero length text nodes as well -
> the leading
> comma) to strings before stage 2.
Indeed, it's converting text nodes to strings before it even starts the
"constructing simple content" process. The overall process here is:
1. Construct the value of the variable 1a. evaluate the sequence constructor, producing a sequence of text nodes
(foo2)
or a sequence containing a mixture of text nodes and strings (foo) 1b. apply the "function conversion rules" to convert the result of 1a
to the required type (xs:string+). This causes text nodes to be
individually atomized to strings 2. xsl:value-of select="$foo" then invokes the "constructing simple content"
rules to convert a sequence of text nodes and/or strings to a single text
nodes.
In both cases the input is a sequence of strings, so the rules for
joining
adjacent text nodes don't kick in.
It's certainly true that this whole business is going to generate a lot of
questions. We've tried to design the rules so that they are (a) backwards
compatible, and (b) do the "right" thing in common cases. The downside of
this is that the rules are quite complicated, and when they don't do the
obvious thing, it's quite tricky to work out why. |
5. | value-of with separator |
| David Carlisle
I have trouble understanding the separator-attribute of value-of.
This is my template:
<xsl:template match="example">
<helloWorld>
<xsl:value-of select="element()/text()" separator=", "/>
</helloWorld>
</xsl:template>
The element "example" contains several child-nodes with text. The
above expression gives the expected values but without the separator
("TextTextText"). But if i change the value-of expression to this...
<xsl:value-of select="*" separator=", "/>
... I also get the values, now separated with commas ("Text, Text, Text").
Now I wonder why the result of the first expression contains no
separator while the other one does. Any explanations?
if the doc is
<x>
<a>one<!-- here -->two</a>
<a>three</a>
<a>four</a>
</x>
and the current node is x then
element()/text() will select four text nodes with values "one" "two"
"three" "four"
so
<xsl:value-of select="element()/text()" separator=", "/>
will generate one text node with value
"one, two, three, four" * will select three element nodes, each with name a and with string values
"onetwo" "three" "four" so <xsl:value-of select="*" separator=", "/> will generate one text node with string value
"onetwo", "three", "four" >Isn't it that adjacent text nodes are being merged before the
> separator is applied?
no the separator is added between the string value of each item in the
sequence, resulting in a string that is then used to generate a single
text node, this node may merge with other text nodes generated under the
same parent, but that happens after separators are added. |
6. | Entities |
| Michael Kay, David Carlisle
I have to replace occurences of something like this "^12" (a custom
placeholder) with a special character, for example a cedilla
(http://en.wikipedia.org/wiki/Cedilla). Therefore i use Michael's
"identity template" in combination with this replace function:
<xsl:value-of select="replace(., '\^12', 'ç')"/>
Many questions arise around this approach (at least for me):
1.) If i use the above i get "the entity is not declared", but when i
use "&" instead of the cedilla everything works fine. How do i
know which entities are declared by default and which not? How can i
declare an entity of my own?
2.) What is the difference of the usage of - for example - "&" and
"&"? When do i have to use the one and not the other?
3.) And where can i find a good overview of enitites?
David answers
Entities are expanded by the parser _before_ XSLT starts, so XSLT sees
the same input whether you use the entity reference, or just use the
character directly, or if you use a numeric character reference (which
doesn't need to be declared).
So if your keyboard or editor allows you just to type a c-cedila
character then you can just do that (if your editor uses iso-8859-1
you'd need to say your xsl file was in that encoding by putting
<?xml version="1.0" encoding="iso-8859-1"?>
at the top, or you can use the numeric reference & # x e 7 ; > Do i have to nest replace-functions for each of them in one
> another like ...
> <xsl:value-of select="replace(replace(., '\^12', 'ç'), '\^13',
> '&')"/>
> ... or is there a more elegant solution for this?
in the most general case of course, nesting replace() calls is always an
option but here I suspect that you can do something like
<xsl:analayze-space select="." regex="\^([0-9]+)">
<xsl:matching-substring>
<xsl:value-of select="$replacements[number(regex-group(1))]"/>
<xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
together with something that makes $replacements a sequemce with a
c-cedila in position 12
eg <xsl:variable name="replacements" select="(
'a','b',...,'&#e7;', '&')
"/> MK adds
> 2.) What is the difference of the usage of - for example -
> "&" and "&"? When do i have to use the one and not the other?
The first is technically an "entity reference", the second is a "character
reference". The only difference is that entities have to be declared in the
DTD, except for the five built-in ones. Numeric character references can be
used without needing a declaration.
> 3.) And where can i find a good overview of enitites?
I use Bob duCharme's "Annotated XML Specification" - very useful because it
gives the actual text of the specification, then Bob's explanation of what
it really means.
>
> Finally i have also a non-entity question:
> I have to replace many of equivalent placeholders in the same
> text too. Do i have to nest replace-functions for each of
> them in one another like ...
> <xsl:value-of select="replace(replace(., '\^12', 'ç'),
> '\^13', '&')"/> ... or is there a more elegant solution for this?
>
As well as DC's solution, another approach is to have a table of
replacements: <xsl:variable name="mods" as="element(mod)*">
<mod from="\^12" to="ç"/>
<mod from="\^13" to="&"/>
</xsl:variable>
and run through them with a recursive function:
<xsl:function name="f:multi-replace" as="xs:string">
<xsl:param name="in" as="xs:string"/>
<xsl:param name="mods" as="element(mod)*"/>
<xsl:choose>
<xsl:when test="$mods">
<xsl:sequence select="f:multi-replace(
replace($in, $mods[1]/@from, $mods[1]/@to),
subsequence($mods, 2))"/>
</xsl:when>
<xsl:otherwise>
<xsl:sequence select="$in"/>
</xsl:otherwise>
</
</ |
7. | Conditional display of parameters |
| Michael Kay
> I have three global xsl:param's which I need to display if
> any of them are selected:
>
> <xsl:param name="showReplicates" select="1"/>
> <xsl:param name="showMetadata" select="1"/>
> <xsl:param name="showElectronicSignature" select="1"/>
>
> if all the flags are turned on, the output would be:
>
> (View - Reps, Meta-Data, Signatures)
<xsl:value-of select="string-join(
('Reps'[$showreplicates=1], 'MetaData'[$showMetadata=1],
'Signatures'[$showElectronicSignature=1]),
', ')"/> |
8. | Restrict a string to n words |
| Michael Kay
> What I need is a function that will limit a string to a
> certain number of words.
In XSLT 2.0, that's
tokenize($in, '\W')[position() = 1 to $n]
where $in is your input string and $n is the number of words.
It's a fair bit harder in XSLT 1.0 (most things are). |
9. | First n words of a text nodes |
| Michael Kay
> I have an XML file like this:
> >
> > <a> lorem ipsum text ... lorem ipsum <b> dolor </b> lorem ipsum ...
> > lorem ipsum </a>
> >
> > I want to create a string of the five words from <a> right
> > before the <b> tag, and a string of the five words
> > immediately after <b>. All I can do is create strings from
> > the beginning or the end of the <a> tag, but basically I want
> > the text in the middle, relative to the child <b> node.
In 2.0, assuming the <a> element is the context node, the two sequences are
given by
subsequence(tokenize(b/following-sibling::text(), '\s'), 1, 5)
and
reverse(subsequence(reverse(tokenize(b/preceding-sibling::text(),
'\s')), 1,5)) |
10. | String match, multiple targets |
| Andrew Welch
> Basically I want to have something that does this:
> contains('$d/ris:organ/text()', 'Hamburg' or 'Koblenz' or 'xxx'...)
> ===> Compare 1 String with multpile strings.
>
some $x in ('Hamburg', 'Koblenz', 'xxx') satisfies
contains($d/ris:organ/text(), $x)
E.g.
<xsl:if test="some $x in ('Hamburg', 'Koblenz', 'xxx') satisfies
contains($d/ris:organ/text(), $x)">
Florent Georges adds
And the OP almost certainly wants the following instead:
some $x in ('Hamburg', 'Koblenz', 'xxx') satisfies
contains($d/ris:organ, $x) |
11. | Quotes in XST 2.0 |
| David Carlisle
>I want to hold a string containing both single and double quotes (apos
>and quot) in a variable.
><xsl:variable name="x" select="'...'"/>
>I enclose the XPath expression in double quotes, hence I'll have to use
>entity references or numerical character references to refer to that
>character from within the expression. Correct.
>I enclose the string in single quotes, hence - I think - I'll have to
>use entity references or numerical character references to refer to that
>character from with the string. And this is wrong.
in xpath1 it is impossible to have a string literal with both quotes.
that's not much hardship in xslt2 as, as you say, you can use
<xsl:variable name="x">'"'"'"</xsl:variable>
form but it does cause inconvience in other contexts,
xpath2 allows you to quote the character used to delimit the string by
doubling it so you can have a string literal such as
"'"""
which is the two character string '"
Mike Kay expands on this...
And of course you can escape the attribute delimiter using an XML entity
reference.
<xsl:if test="$x = 'He said, "I can''t"'"> In 1.0 you can't have a string literal containing both single and double
quotes. Use concat: <xsl:variable name="quot">"</xsl:variable>
<xsl:variable name="apos">'</xsl:variable>
<xsl:if test="$x = concat('He said, ', $quot, 'I can', $apos, 't', $quot)"> |