1. | What's a regex :-) | |||
DaveP Since its not so long ago since I barely knew what it was, I've done an idiots guide to regex in XSLT 2.0. Usual format, linked from here, written as a stylesheet creating the html you see. Similarly misspelled too. | ||||
2. | Pretty printing comments with long line length | |||
For what it's worth, XSLT 2.0 makes this a whole lot easier with its regular expression support. For example, you could use: <xsl:template match="comment()"> <xsl:text><!-- </xsl:text> <xsl:analyze-string select="normalize-space(.)" regex=".{{1,69}}(\s|$)"> <xsl:matching-substring> <xsl:if test="position() != 1"><xsl:text> </xsl:text></xsl:if> <xsl:value-of select="." /> <xsl:text>
</xsl:text> </xsl:matching-substring> </xsl:analyze-string> <xsl:text> --></xsl:text> </xsl:template> [Note the doubling of the {}s in the regex attribute because it's an attribute value template; that's caught me out twice today!] You could use something more subtle than normalize-space() to preserve *some* of the spacing within the comment but not others. | ||||
3. | Searching for url in text | |||
XSLT 2.0 makes this process a lot easier: you can use <xsl:analyze-string> with regular expressions to do this kind of string processing. In XSLT 2.0, the template would look more like: <xsl:template name="hyperlink"> <xsl:param name="string" select="string(.)" /> <xsl:analyze-string select="$string" regex="http://[^ ]+"> <xsl:matching-substring> <a href="{.}"> <xsl:value-of select="." /> </a> </xsl:matching-substring> <xsl:non-matching-substring> <xsl:value-of select="." /> </xsl:non-matching-substring> </xsl:analyze-string> </xsl:template> and of course the regex for matching URLs could be a lot more sophisticated. | ||||
4. | Reverse a string (xslt 2) | |||
string-join( for $i in string-length($in) to 1 return substring($in, $i, 1), "") or using a recursive function if you find that more elegant. | ||||
5. | Counting characters | |||
Well the XPath 2.0 solution is sum(for $i in preceding-sibling::text() return string-length($i)) For XSLT 1.0 it's much more difficult, it's the classic problem of summing a calculated value over a node-set. There are several workable solutions:
| ||||
6. | Acronym handling | |||
I'll just mention that in XSLT 2.0, you can use <xsl:analyze-string> to do this. Something along the lines of: <xsl:variable name="acronyms" as="element(acronym)+" select="document('../xml/acronyms.xml')/acronyms/acronym" /> <xsl:variable name="acronym-regex" as="xs:string" select="string-join($acronyms/@acronym, '|')" /> <xsl:analyze-string select="$text" regex="{$acronym-regex}"> <xsl:matching-substring> <xsl:variable name="acronym" as="xs:string" select="." /> <acronym title="{$acronyms[@acronym = $acronym]}"> <xsl:value-of select="$acronym" /> </acronym> </xsl:matching-substring> <xsl:non-matching-substring> <xsl:value-of select="." /> </xsl:non-matching-substring> </xsl:analyze-string> | ||||
7. | Converting a text File to XML | |||
This kind of thing is very much easier using XSLT 2.0 * use the unparsed-text() function to read the text file * split it into individual lines using the tokenize() function * Analyse/parse each line using xsl:analyze-string with a regex * arrange it into a hierarchical structure using xsl:for-each-group As you see, each stage benefits from new features in XSLT 2.0. All of them can be done using 1.0 (for example, step (a) can be replaced by passing the text as a string-valued parameter) but it's much harder work: sufficiently hard work that it's probably easier to use a different language, such as Perl. input file H-A-HEADER some content I-AN-ITEM-1 more content I-AN-ITEM-2 and again S-A-SUMMARY-1 for variety I-AN-ITEM-3 and change S-A-SUMMARY-2 and different again DaveP generated this Stylesheet to show the example. <?xml version="1.0" encoding="utf-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"> <xsl:output method="xml" indent="yes" encoding="utf-8"/> <xsl:template match="/"> <xsl:variable name="f" select="unparsed-text('unparsedEntity.txt','utf-8')"/> <someRoot> <xsl:for-each select='tokenize($f, "\n")'> <record> <xsl:analyze-string regex="[\-a-zA-Z0-9]+" select="."> <xsl:matching-substring> <word><xsl:value-of select="."/></word> </xsl:matching-substring> <xsl:non-matching-substring> <other> <xsl:value-of select="."/> </other> </xsl:non-matching-substring> </xsl:analyze-string> </record> </xsl:for-each> </someRoot> </xsl:template> </xsl:stylesheet> | ||||
8. | Conditional selection of an attribute value | |||
<xsl:template match="data"> <a href="{(link[matches(.,'^http://')],link[matches(.,'^ftp://')])[1]}"> <xsl:value-of select="title"/> </a> </xsl:template> Ednote. Note the use of braces inside the AVT; and the use of the matches function in the predicate. | ||||
9. | {{ braces in regular expressions | |||
I quote Michael Kays reply here verbatim. Unless you have run into this problem, you may not appreciate it. Bottom line, be aware that quantifiers, such as {3}, to say that the match must be limited to 3 of the previous patterns, use the brace character, which has other interpretations in XSLT. It may cause you problems as it did me. Ignoring the AVT rules, the regex syntax allows { to be used without escaping inside [], but not outside: outside [] it is reserved for use in regex quantifiers such as x{3}. So it must be escaped as \{. So the regular expression you want is (\w|\{[^{}]*\})+ which is written in the regex attribute as regex="(\w|\{{[^{{}}]*\}})+" It might be less painful to do: <xsl:variable name="regex">(\w|\{[^{}]*\})+</xsl:variable> <xsl:analyze-string regex="{$regex}"> though sadly, I suspect that will prevent Saxon precompiling the regular expression :-( A repetition count in a regex is indicated by curly braces, not square brackets. Remember also that in an attribute value template, curly braces must be doubled. So you want: regex='(\d{{4}})(\d{{2}})(\d{{2}})' Alternatively, you could just as well use regex='(....)(..)(..)' | ||||
10. | Remove non numeric content | |||
neff.xsl <?xml version="1.0" encoding="iso-8859-1"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output method="text"/> <xsl:template match="/"> <xsl:variable name="in" select="'TUV0062'"/> <xsl:value-of select="replace($in,'[^0-9]','')"/> </xsl:template> </xsl:stylesheet> provides output of 62 | ||||
11. | Value and units | |||
<xsl:analyze-string select="$in" regex="^(\d*(\.\d*)(in|cm|pt|em|px)$"> <xsl:matching-substring> <measure><xsl:value-of select="regex-group(1)"/></measure> <units><xsl:value-of select="regex-group(3)"/></units> </xsl:matching-substring> <xsl:non-matching-substring> <value><xsl:value-of select="."/></value> </xsl:non-matching-substring> </xsl:analyze-string> DaveP. Note that the first group (in brackets) matches the value, the second matches the decimal part of the value, and the third matching group holds the units which are known about. | ||||
12. | XPath 2.0 Regex misunderstanding | |||
I have some trouble with understanding your "passing" and "failing" is about. However, if you are trying to remove the "more obvious date format errors", I believe your "matches(...)" needs to become a "not(matches(...))", since your regular expression is about inclusion, not exclusion. That said, you can try the following (assuming American dates: MM/DD/YYYY) for matching any date, disallowing years > 2006 and allowing the format 1/2/2006: <xsl:variable name="dates"> <DATE>07/18/2006</DATE> <DATE>07/12/2006</DATE> <DATE>09/25/2006</DATE> <DATE>10/24/2006</DATE> <DATE>10/18/2006</DATE> <DATE>10/10/2006</DATE> <DATE>1/2/2006</DATE> <!-- false dates --> <DATE>22/12/2006</DATE> <DATE>00/10/2000</DATE> <DATE>01/32/2006</DATE> <DATE>10/10/2007</DATE> <DATE>12/12/20006</DATE> </xsl:variable> <xsl:variable name="date-regex">^( 0?[1-9]| <!-- 01-09 and 1-9 --> 1[0-2] <!-- 10, 11, 12 --> )/( 0?[1-9]| <!-- 01-09 and 1-9--> [1-2]\d| <!-- 10-20 --> 3[01] <!-- 30, 31 --> )/( 1\d{3}| <!-- 1000-1999 --> 200[0-6] <!-- 2000-2006 --> )$ </xsl:variable> <xsl:for-each select="$dates/DATE"> <xsl:value-of select="concat(., ': ')" /> <!-- add normalize-space, because of a bug in saxon prior to 8.0.0.4 with leading space --> <xsl:value-of select="matches(., normalize-space($date-regex), 'x')" /> <xsl:text>
</xsl:text> </xsl:for-each> This outputs: 07/18/2006: true 07/12/2006: true 09/25/2006: true 10/24/2006: true 10/18/2006: true 10/10/2006: true 1/2/2006: true 22/12/2006: false 00/10/2000: false 01/32/2006: false 12/12/20006: false > What your statement implies is: output "bad-date" node when: 1) a date month is in the range (00, 01, 02, 10, 11, 12) 2) a date day is in the range (00, 01,... 09, 10, 11,.... 19, 20, 21, .... 29, 30, 31, ... 39 3) the year is 2006. Well, I don't know much of your calendar system, but I can hardly believe you consider a date as "00/39/2006" as being correct, so here's a part of your problem. I know from my own experience that regexing numeric values is a tricky business (and is: think strings, not numbers). For an article I wanted to write for a long time, but still haven't, I created a template that helps in regexing numeric values. It will simply output the right regexes for you, if you give it a number: my:regex-from-number('376', 0) will give: [0-2]\d{2}| 3[0-6]\d| 37[0-5]| 376|\d{2} it requires some getting used to, but I recall that Jeffrey Friedl named this: enrolling the number, or something similar. For small numbers you can easily do it by hand, but it is still hard for many mere mortals. It is optimized for repeated digits (like 2006). The output regex works perfect. A few notes (if you plan to use it): |\d{2} Leave out this part if you require a fixed number of digits. I.e.: 034 and 009. By default, 34, 9 etc are allowed. 376 The input number. Repeating the number is not necessary for making a bullet proof regular expression, but it made me feel good. The larger the maximum number you need to match, the easier it gets putting it there: you see instantly what number is being matched. The rest speaks for itself, I believe. But call in anytime if you want some additional help. The expressions in the opening are taken from this template to ensure I did the right thing, however, I made them a bit more readable. <xsl:function name="my:regex-from-number"> <xsl:param name="number" /> <xsl:param name="pos" /> <xsl:variable name="digit1" select="substring($number, $pos, 1)" /> <xsl:variable name="digit2" select="substring($number, $pos + 1, 1)" /> <xsl:variable name="len" select="string-length($number)" /> <xsl:value-of select=" if($len = $pos) then concat ( $digit1, '|\d', if($pos - 1 le 1) then '' else concat('{', $pos - 1, '}') ) else if ($digit2 = '0') then concat ( $digit1, my:regex-from-number($number, $pos + 1) ) else concat ( $digit1, if(xs:integer($digit2) - 1 = 0) then '0' else concat('[0-', xs:integer($digit2) - 1, ']'), if($pos + 1 = $len) then '|' else if($len - $pos - 1 = 1) then '\d|' else concat('\d{', $len - $pos - 1, '}|'), '
', substring($number, 1, $pos), my:regex-from-number($number, $pos + 1) )" /> </xsl:function> | ||||
13. | Regex | |||
A good thing to know about regexes is that, besides being powerful, they can be very dangerous too, esp. to the unaware, when backtracking causes the regex to run with exponential times for non-matching strings. An example of such a regex is in this post: nabble.com If you are going to use regexes in a production environment make sure to test them thoroughly for this behavior or your processor may hang occasionally. > Is there any way I can split this RegEx on separate lines and/or add > whitespace so that it would be more readable? You already heard of the 'x' modifier, but there are a few things that you should know before splitting your regex into a more readable format: If you use Saxon, several bugs concerning whitespace handling have been fixed in the 8.8 and 8.9 release, some of which you may consider significant, like this one, which is now fixed: nabble.com The "ignore whitespace" is very literally so. I.e., in XSLT regexes, this: fn:matches("hello world", "hello\ sworld", "x") returns true. The "\ s" part in the regex is, with whitespace removed, "\s" and matches a space. Most regex engines (Perl for one) consider an escaped space as a space. The only place where you must be aware of whitespace with 'x' on i inside classes, where it is not ignored: [abc ] matches 'a', 'b', 'c' or ' '. You probably don't want to do this, but this is allowed with the 'x' modifier: "\p{ I s B a s i c L a t i n }+" and is the same as "\p{IsBasicLatin}+". And a tip for making your regexes more readable: introduce comments inside your regexes. In other regex languages you can do that inside the regex language, but not with a regex in XSLT. You can easily fix this by putting your regexes inside a variable and always calling them with the 'x' modifier: <xsl:variable name="myregex" as="xs:string"> ( <!-- grab everything --> " <!-- start of a q. string --> [^"]* <!-- zero or more non-quotes --> " <!-- end of a q. string --> ) <!-- closing 'grab all' --> </xsl:variable> I use this method to some extend in a format that allows recursive and repetitive regexes on input by just supplying a 'parser' written in XSLT with a set of regexes placed in XML that are then applied to the input. If you have many regexes, you will find that it is easier to maintain them by working on some library and reuse. |