xml to plain text using xslt
1. | DocBook to plain text - what do you use? |
Ednote. Wendell's explanation should apply to most XML formats, not just docbook
If you think about, for example, the way interpolations of inline pseudo-markup (like *this* for emphasis) and similar constructs will affect, for example, line wrapping, particularly since
then it is apparent that creating "pretty plain text" is not as trivial as it may first appear. My guess is that the graceful XSLT-only solution will require two or three passes over the data. Another sad fact of life is that one person's pretty plain text is another's ugly stepsister.
It seems to be one of those problems that is *nearly* general enough for a generic solution, but that has hidden gotchas and local particularities that have hindered the development of a one-size-fits-all solution. Here's an article about an approach that uses Java (SAX) for the final stage of production of the plain text: ibm.com. So it's not that this problem hasn't come up before. (Not too long ago the list even discussed producing plain-text tables from XML -- a real beast.) | |
2. | Inserting page breaks into plain text output. |
output a <FF> tag where you want the page break then post process the .txt file to change <FF> to x0C. You could do it in 3 lines of perl or 6 or so lines of javascript and it wouldn't take too long. something like this off the top of my head perl C:> xalan source.xml transform.xsl sansff.txt C:> type sansff.txt | perl convert.pl > withff.txt convert.pl---------------------- #!perl binmode; my $slurp = join('', <>); $slurp =~ s/<FF>/\x0C/gs; print $slurp; JavaScript C:> xalan source.xml transform.xsl sansff.txt C:> cscript convert.js sansff.txt withff.txt convert.js---------------------- var fso = new ActiveXObject("Scripting.FileSystemObject"); var file = fso.OpenTextFile(WScript.Arguments(0), 1); var fileStr = file.ReadAll(); fileStr = fileStr.replace(/<FF>/g , String.fromCharCode(12)); // or "" pasted ff char var outFile = fso.CreateTextFile(WScript.Arguments(1), true); outFile.Write(fileStr); | |
3. | Layout dependent on position |
based on the pseudo XML below: <groups> <group>First</group> <group>Second</group> <group>Third</group> <group>Fourth</group> <group>Fifth</group> </groups> in my XSL I want to get output, that "increments" for each item of group data, as it is processed, giving HTML output of: [#] First [#][#] Second [#][#][#] Third [#][#][#][#] Fourth [#][#][#][#][#] Fifth where "[#]" is some bit of text,(not a number) that can be repeated for spacing output in a pseudo tree format. David offers: <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" > <xsl:output method="xml" indent="yes"/> <xsl:template match="group"> <xsl:for-each select="preceding-sibling::*|.">[#]</xsl:for-each> <xsl:value-of select="."/> </xsl:template> </xsl:stylesheet> Ed Staub also offers: Two references to the same page in ten minutes; see page 546 etc. in Michael Kay's "XSLT" book, on "Programming without Assignment Statements". Either recurse or use something like substring(">>>>>>>>>>>",1,position()) which will only work for single characters. In this case, using the > symbol. <xsl:template match="group"> <xsl:value-of select="substring('>>>>>>>>>>>',1,position())"/> <xsl:value-of select="."/> </xsl:template> | |
4. | IDENT |
> I'm transforming a HTML file to a text file with xsl, and I want to > obtain a plain text, without lines feeds. How I can obtain it? The short answer: wrap xsl:text elements around the actual text that you want outputted. The long answer: have a look at this template: > <xsl:template match="B" mode="special"> > <B> > <xsl:value-of select="."/> > </B><xsl:text> </xsl:text> > </xsl:template> When a DOM builder goes over that bit of XML, it creates a node tree that looks like: +- (element) xsl:template +- (text) [NL][SP][SP]<B>[NL][SP][SP] +- (element) xsl:value-of +- (text) [NL][SP][SP]</B> +- (element) xsl:text | +- (text) [SP] +- (text) [NL] All those new lines and spaces are because of the indenting in the stylesheet. When the XSLT processor gets hold of this node tree, it first strips out all white-space only text nodes that are in the tree, unless they appear within a xsl:text element. So it gets rid of the last text node to give: +- (element) xsl:template +- (text) [NL][SP][SP]<B>[NL][SP][SP] +- (element) xsl:value-of +- (text) [NL][SP][SP]</B> +- (element) xsl:text +- (text) [SP] Now, when the XSLT processor applies this template, any plain text nodes that are found are copied directly to the result tree. The xsl:value-of adds the relevant value (e.g. 'x') and the xsl:text adds whatever its content is. You end up with a result tree that looks like: +- (text) [NL][SP][SP]<B>[NL][SP][SP]x[NL][SP][SP]</B>[SP] which is why you get all the whitesapce that you do in your output - it's copying over the whitespace that you've used to indent stuff in your stylesheet. The way around this is to cut out the whitespace that you don't want using xsl:text elements to limit what whitespace is interpreted as whitespace for inclusion in the result tree, and what is just used to indent your stylesheet code. If you use the template: <xsl:template match="B" mode="special"> <xsl:text> <B> </xsl:text> <xsl:value-of select="." /> <xsl:text> </B></xsl:text> </xsl:template> Then the node tree that the DOM builder builds looks like: +- (element) xsl:template +- (text) [NL][SP][SP] +- (element) xsl:text | +- (text) [SP]<B>[SP] +- (text) [NL][SP][SP] +- (element) xsl:value-of +- (text) [NL][SP][SP] +- (element) xsl:text | +- (text) [SP]</B> +- (text) [NL] This time, a lot of whitespace disappears when the whitespace-only nodes are cut out by the XSLT processor: +- (element) xsl:template +- (element) xsl:text | +- (text) [SP]<B>[SP] +- (element) xsl:value-of +- (element) xsl:text +- (text) [SP]</B> And the output is simply: +- (text) [SP]<B>[SP]x[SP]</B> which I think is what you're after. | |
5. | Line split and empty content |
A questionner asked: With xml is as follows: <organisations> <orgRecord> <orgID>1</orgID>> <organisation>World Health</organisation> <street>3 street </street> <city>Liverpool</city> <state></state> <postalCode>L42 GH</postalCode> </orgRecord> </organisations> I need to concatenate the address fields: <xsl:value-of select="concat(street, ', ', city, ', ', state, ', ', postalCode)"/> 1. I would prefer the output to look like this street, city, state, postalCode 2. Is there way to check that if a "state" does not exist i.e. <state></state> (as in xml) to return: street, city, postalCode Jeni answers: I guess that you mean you want to have a carriage return rather than just a space after each comma. You can either write them in literally with: <xsl:value-of select="concat(street, ', ', city, ', ', state, ', ', postalCode)" /> Or you can use a character entity like: <xsl:value-of select="concat(street, ',
', city, ',
', state, ',
', postalCode)" /> There's no difference between the two as far as the XSLT processor is concerned. > 2. Is there way to check that if a "state" does not exist i.e. > <state></state> (as in xml) to return: So you mean that the value of state is an empty string rather than that the element state doesn't exist? You can check whether the value of state is an empty string with: state = '' or: not(string(state)) To get the output that you want, you need to split up the xsl:value-of and insert an xsl:if: <xsl:value-of select="concat(street, ',
', city, ',
')" /> <xsl:if test="string(state)"> <xsl:value-of select="concat(state, ',
')" /> </xsl:if> <xsl:value-of select="postalCode" /> | |
6. | Indentation with whitespaces |
XSLT is designed to work in this kind of recursive way. In your case, as you describe, you would have two templates: one matching group elements and one matching field elements, and the different templates would do different things. This is a very neat way of doing it if you're producing HTML and using CSS to produce the indentation that you want. You can use div elements in HTML to add indentation at each level (or if you want a quick-and-dirty way, use blockquote elements): <xsl:template match="group"> If you're producing text, on the other hand, then the easiest way to achieve the indentation is to pass down a 'indent' parameter from template to template. You can make a template accept a parameter with a xsl:param element. When you apply templates to the child elements of the group element, you need to pass on the parameter with the xsl:with-param element. At each group element, add some more to the indent that you pass down by concatenating some more spaces to the indent: <xsl:template match="group"> If you don't like this approach, you could continue to apply templates to all the descendants and then work out the depth of the element by counting how many ancestors it has with: count(ancestor::*) and use that as the basis for an indentation or whatever. However, this is a lot more fiddly and it's likely to be a lot more inefficient than working going with the XSLT flow as described above. | |
7. | Line split at 64 characters in plain text output |
My first attempt was to find the number of 64 byte chunks and to produce them with simple iteration (using my "buildListWhile" template). While this was very simple to implement, the result is quite not appealing, due to line ending words being split in the middle... An intelligent solution will not split a word if it overlaps the lines ending position. Instead, this word will be the first word of the next line. Below I'm presenting such a solution. I'm re-using my functional tokenizer, published last month on the list The idea is to parse the text and obtain a result structured like the following: <line><word>My</word><word>first</word><word>attempt< where the string() of any "line" is the maximum that would fit the given line-length if words are not split apart. I'm using once again the str-foldl template, which on every character instantiates the template matching "str-split2lines-func:*". This template recognises every new word and accumulates the result, which is a list of "line" elements, each having a list of "word" children. After the last "line" element there's a single "word", in which the "current word" is being accumulated. Whenever the current character is one of the specified delimiters, this signals the formation of a new word. This word is either added to the last line (if the total line length will not exceed the specified line-length), or a new line is started and this word becomes the first in the new line. There are two possible improvements, which are left as an exercise to you: 1. If a single word exceeds the specified line-length, then it must be split apart. 2. Lines could be justified both to the left and to the right. And here's the code: <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:str-split2lines-func="f:str-split2lines-func" exclude-result-prefixes="xsl msxsl str-split2lines-func" > <xsl:import href="str-foldl.xsl"/> <str-split2lines-func:str-split2lines-func/> <xsl:output indent="yes" omit-xml-declaration="yes"/> <xsl:template name="str-split-to-lines"> <xsl:param name="pStr"/> <xsl:param name="pLineLength" select="60"/> <xsl:param name="pDelimiters" select="' 	 '"/> <xsl:variable name="vsplit2linesFun" select="document('')/*/str-split2lines-func:*[1]"/> <xsl:variable name="vrtfParams"> <delimiters><xsl:value-of select="$pDelimiters"/></delimiters> <lineLength><xsl:copy-of select="$pLineLength"/></lineLength> </xsl:variable> <xsl:variable name="vResult"> <xsl:call-template name="str-foldl"> <xsl:with-param name="pFunc" select="$vsplit2linesFun"/> <xsl:with-param name="pStr" select="$pStr"/> <xsl:with-param name="pA0" select="msxsl:node-set($vrtfParams)"/> </xsl:call-template> </xsl:variable> <xsl:for-each select="msxsl:node-set($vResult)/line"> <xsl:for-each select="word"> <xsl:value-of select="concat(., ' ')"/> </xsl:for-each> <xsl:value-of select="' '"/> </xsl:for-each> </xsl:template> <xsl:template match="str-split2lines-func:*"> <xsl:param name="arg1" select="/.."/> <xsl:param name="arg2"/> <xsl:copy-of select="$arg1/*[position() < 3]"/> <xsl:copy-of select="$arg1/line[position() != last()]"/> <xsl:choose> <xsl:when test="contains($arg1/*[1], $arg2)"> <xsl:if test="string($arg1/word)"> <xsl:call-template name="fillLine"> <xsl:with-param name="pLine" select="$arg1/line[last()]"/> <xsl:with-param name="pWord" select="$arg1/word"/> <xsl:with-param name="pLineLength" select="$arg1/*[2]"/> </xsl:call-template> </xsl:if> </xsl:when> <xsl:otherwise> <xsl:copy-of select="$arg1/line[last()]"/> <word><xsl:value-of select="concat($arg1/word, $arg2)"/></word> </xsl:otherwise> </xsl:choose> </xsl:template> <!-- Test if the new word fits into the last line --> <xsl:template name="fillLine"> <xsl:param name="pLine" select="/.."/> <xsl:param name="pWord" select="/.."/> <xsl:param name="pLineLength" /> <xsl:variable name="vnWordsInLine" select="count($pLine/word)"/> <xsl:variable name="vLineLength" select="string-length($pLine) + $vnWordsInLine"/> <xsl:choose> <xsl:when test="not($vLineLength + string-length($pWord) > $pLineLength)"> <line> <xsl:copy-of select="$pLine/*"/> <xsl:copy-of select="$pWord"/> </line> </xsl:when> <xsl:otherwise> <xsl:copy-of select="$pLine"/> <line> <xsl:copy-of select="$pWord"/> </line> <word/> </xsl:otherwise> </xsl:choose> </xsl:template> </xsl:stylesheet> When instantiated as follows: <xsl:template match="/"> <xsl:call-template name="str-split-to-lines"> <xsl:with-param name="pStr" select="/*"/> <xsl:with-param name="pLineLength" select="64"/> <xsl:with-param name="pDelimiters" select="' 	 '"/> </xsl:call-template> </xsl:template> and the source xml document is: <text> with the text occupying just one line, the result of the transformation is: Dec. 13 - As always for a presidential inaugural, security and As can be seen, the largest line-length is 64 -- as specified. | |
8. | text output method, and newline problems. |
When all else fails, the minimally intrusive thing I know to do is this - <xsl:template >PDS_VERSION_ID = PDS3 RECORD_TYPE = FIXED_LENGTH </xsl:template> > The reason this (and your earlier <xsl:text/> method) work is not the use of the xsl:text element per se, but rather the fact that there are two tags in a row with only whitespace between them. The processor has to decide if that white space is part of real character data or is just there for visual formatting and can therefore be ignored. The standard thing for the processor to do is to ignore such whitespace-only nodes. Therefore, in example 1, any element will do, and if you are using the text output method, you could use <a/> instead of <xsl:text/>. Your text starts after the <a/> element, and there is no line feed there. The line feed between the <xsl:template> and the <a/> is ignored, and you get what you want. As for new lines, they will be copied to the output as long as they are part of a chunk of character data. If you want them between, for example, two consecutive xsl:value-of elements, similar considerations apply. Whitespace between elements gets ignored. <xsl:text> is one way to put non-whitespace text there. A shorter way is to put a non-breaking space in place, like this: <xsl:value-of select='"xxx"'/>  <xsl:value-of select='"yyy"'/> Now the two xsl:value-of elements are not separated by whitespace, and so the new line is honored. In a decent editor or other viewer, the non-break ing space will display properly, but if the viewer expects one encoding and your output is in another, it may display as a strange character. But it is easier than <xsl:text> </xsl:text>. Any character will do, like a dash or period, but you do not always want something visible. You can also try to define an entity to use instead of the non-breaking space, like putting this into your stylesheet: <!DOCTYPE xsl:stylesheet[ <!ENTITY CR "<xsl:text> </xsl:text>"> ]> <xsl:value-of select='"xxx"'/>&CR; <xsl:value-of select='"yyy"'/> This works with Saxon, XT, Xalan, and Sablotron, but I can never get it to work with msxml, which complains about missing namespaces when I run it it XML Cooktop. | |
9. | Line breaking stylesheet for plain text |
My first attempt was to find the number of 64 byte chunks and to produce them with simple iteration (using my "buildListWhile" template). While this was very simple to implement, the result is quite not appealing, due to line ending words being split in the middle... An intelligent solution will not split a word if it overlaps the lines ending position. Instead, this word will be the first word of the next line. Bellow I'm presenting such a solution. I'm re-using my functional tokenizer, published last month: aspn.activestate.com The idea is to parse the text and obtain a result structured like the following: <line><word>My</word><word>first</word><word>attempt</word><word>was</word>< /line> <line><word>to</word><word>find</word><word>the</word><word>number</word></l ine> <line><word>of</word><word>64</word><word>byte</word><word>chunks</word></li ne> where the string() of any "line" is the maximum that would fit the given line-length if words are not split apart. I'm using once again the str-foldl template, which on every character instantiates the template matching "str-split2lines-func:*". This template recognises every new word and accumulates the result, which is a list of "line" elements, each having a list of "word" children. After the last "line" element there's a single "word", in which the "current word" is being accumulated. Whenever the current character is one of the specified delimiters, this signals the formation of a new word. This word is either added to the last line (if the total line length will not exceed the specified line-length), or a new line is started and this word becomes the first in the new line. There are two possible improvements, which are left as an exercise to you: 1. If a single word exceeds the specified line-length, then it must be split apart. 2. Lines could be justified both to the left and to the right. And here's the code: <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"; xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:str-split2lines-func="f:str-split2lines-func" exclude-result-prefixes="xsl msxsl str-split2lines-func" > <xsl:import href="str-foldl.xsl"/> <str-split2lines-func:str-split2lines-func/> <xsl:output indent="yes" omit-xml-declaration="yes"/> <xsl:template name="str-split-to-lines"> <xsl:param name="pStr"/> <xsl:param name="pLineLength" select="60"/> <xsl:param name="pDelimiters" select="' 	 '"/> <xsl:variable name="vsplit2linesFun" select="document('')/*/str-split2lines-func:*[1]"/> <xsl:variable name="vrtfParams"> <delimiters><xsl:value-of select="$pDelimiters"/></delimiters> <lineLength><xsl:copy-of select="$pLineLength"/></lineLength> </xsl:variable> <xsl:variable name="vResult"> <xsl:call-template name="str-foldl"> <xsl:with-param name="pFunc" select="$vsplit2linesFun"/> <xsl:with-param name="pStr" select="$pStr"/> <xsl:with-param name="pA0" select="msxsl:node-set($vrtfParams)"/> </xsl:call-template> </xsl:variable> <xsl:for-each select="msxsl:node-set($vResult)/line"> <xsl:for-each select="word"> <xsl:value-of select="concat(., ' ')"/> </xsl:for-each> <xsl:value-of select="' '"/> </xsl:for-each> </xsl:template> <xsl:template match="str-split2lines-func:*"> <xsl:param name="arg1" select="/.."/> <xsl:param name="arg2"/> <xsl:copy-of select="$arg1/*[position() < 3]"/> <xsl:copy-of select="$arg1/line[position() != last()]"/> <xsl:choose> <xsl:when test="contains($arg1/*[1], $arg2)"> <xsl:if test="string($arg1/word)"> <xsl:call-template name="fillLine"> <xsl:with-param name="pLine" select="$arg1/line[last()]"/> <xsl:with-param name="pWord" select="$arg1/word"/> <xsl:with-param name="pLineLength" select="$arg1/*[2]"/> </xsl:call-template> </xsl:if> </xsl:when> <xsl:otherwise> <xsl:copy-of select="$arg1/line[last()]"/> <word><xsl:value-of select="concat($arg1/word, $arg2)"/></word> </xsl:otherwise> </xsl:choose> </xsl:template> <!-- Test if the new word fits into the last line --> <xsl:template name="fillLine"> <xsl:param name="pLine" select="/.."/> <xsl:param name="pWord" select="/.."/> <xsl:param name="pLineLength" /> <xsl:variable name="vnWordsInLine" select="count($pLine/word)"/> <xsl:variable name="vLineLength" select="string-length($pLine) + $vnWordsInLine"/> <xsl:choose> <xsl:when test="not($vLineLength + string-length($pWord) > $pLineLength)"> <line> <xsl:copy-of select="$pLine/*"/> <xsl:copy-of select="$pWord"/> </line> </xsl:when> <xsl:otherwise> <xsl:copy-of select="$pLine"/> <line> <xsl:copy-of select="$pWord"/> </line> <word/> </xsl:otherwise> </xsl:choose> </xsl:template> </xsl:stylesheet> When instantiated as follows: <xsl:template match="/"> <xsl:call-template name="str-split-to-lines"> <xsl:with-param name="pStr" select="/*"/> <xsl:with-param name="pLineLength" select="64"/> <xsl:with-param name="pDelimiters" select="' 	 '"/> </xsl:call-template> </xsl:template> and the source xml document is: <text> with the text occupying just one line, the result of the transformation is: Dec. 13 - As always for a presidential inaugural, security and As can be seen, the largest line-length is 64 -- as specified. | |
10. | Splitting Camel Case strings. |
It seems to me that you want to preserve the capital letters? If *not* so, then the following is a most straightforword solution using the "str-split-to-words" template of FXSL: This transformation: <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" > <xsl:import href="strSplit-to-Words.xsl"/> <!-- This transformation must be applied to: testSplitToWords4.xml --> <xsl:output indent="yes" omit-xml-declaration="yes"/> <xsl:variable name="vCaps" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'"/> <xsl:template match="/"> <xsl:call-template name="str-split-to-words"> <xsl:with-param name="pStr" select="/*"/> <xsl:with-param name="pDelimiters" select="$vCaps"/> </xsl:call-template> </xsl:template> </xsl:stylesheet> when applied against this source.xml: <t>thisIsACamelCasedWord</t> Produces: <word>this</word> <word>s</word> <word>amel</word> <word>ased</word> <word>ord</word> In case you need to preserve the capital letters, the solution is slightly different. One first pass is made on the string, which inserts a space in front of every capital letter. The newly produced string is then tokenised. In the first pass I also use the "str-map" template from FXSL. This transformation: <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:myMark="f:MarkAnUppercase" exclude-result-prefixes="myMark" > <xsl:import href="str-map.xsl"/> <xsl:import href="strSplit-to-Words.xsl"/> <!-- This transformation must be applied to: testSplitToWords4.xml --> <xsl:output indent="yes" omit-xml-declaration="yes"/> <xsl:variable name="vCaps" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'"/> <myMark:myMark/> <xsl:template match="myMark:*"> <xsl:param name="arg1"/> <xsl:if test="contains($vCaps, $arg1)"> <xsl:text> </xsl:text> </xsl:if> <xsl:value-of select="$arg1"/> </xsl:template> <xsl:template match="/"> <xsl:variable name="vSpaceDelimited"> <xsl:call-template name="str-map"> <xsl:with-param name="pFun" select="document('')/*/myMark:*[1]"/> <xsl:with-param name="pStr" select="/*"/> </xsl:call-template> </xsl:variable> <xsl:call-template name="str-split-to-words"> <xsl:with-param name="pStr" select="$vSpaceDelimited"/> <xsl:with-param name="pDelimiters" select="' '"/> </xsl:call-template> </xsl:template> </xsl:stylesheet> when applied against the same source.xml produces: <word>this</word> <word>Is</word> <word>A</word> <word>Camel</word> <word>Cased</word> <word>Word</word> | |
11. | Line breaking stylesheet for plain text |
My first attempt was to find the number of 64 byte chunks and to produce them with simple iteration (using my "buildListWhile" template). While this was very simple to implement, the result is quite not appealing, due to line ending words being split in the middle... An intelligent solution will not split a word if it overlaps the lines ending position. Instead, this word will be the first word of the next line. Below I'm presenting such a solution. I'm re-using my functional tokenizer, published last month on activestate.com: The idea is to parse the text and obtain a result structured like the following: <line><word>My</word><word>first</word><word>attempt</word> <word>was</word></line> <line><word>to</word><word>find</word><word>the< /word><word>number</word></line> <line><word>of</word><word>64</word><word>byte< /word><word>chunks</word></line> where the string() of any "line" is the maximum that would fit the given line-length if words are not split apart. I'm using once again the str-foldl template, which on every character instantiates the template matching "str-split2lines-func:*". This template recognises every new word and accumulates the result, which is a list of "line" elements, each having a list of "word" children. After the last "line" element there's a single "word", in which the "current word" is being accumulated. Whenever the current character is one of the specified delimiters, this signals the formation of a new word. This word is either added to the last line (if the total line length will not exceed the specified line-length), or a new line is started and this word becomes the first in the new line. There are two possible improvements, which are left as an exercise to you: 1. If a single word exceeds the specified line-length, then it must be split apart. 2. Lines could be justified both to the left and to the right. And here's the code: <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"; xmlns:msxsl="urn:schemas-microsoft-com:xslt" xmlns:str-split2lines-func="f:str-split2lines-func" exclude-result-prefixes="xsl msxsl str-split2lines-func" > <xsl:import href="str-foldl.xsl"/> <str-split2lines-func:str-split2lines-func/> <xsl:output indent="yes" omit-xml-declaration="yes"/> <xsl:template name="str-split-to-lines"> <xsl:param name="pStr"/> <xsl:param name="pLineLength" select="60"/> <xsl:param name="pDelimiters" select="' 	 '"/> <xsl:variable name="vsplit2linesFun" select="document('')/*/str-split2lines-func:*[1]"/> <xsl:variable name="vrtfParams"> <delimiters><xsl:value-of select="$pDelimiters"/></delimiters> <lineLength><xsl:copy-of select="$pLineLength"/></lineLength> </xsl:variable> <xsl:variable name="vResult"> <xsl:call-template name="str-foldl"> <xsl:with-param name="pFunc" select="$vsplit2linesFun"/> <xsl:with-param name="pStr" select="$pStr"/> <xsl:with-param name="pA0" select="msxsl:node-set($vrtfParams)"/> </xsl:call-template> </xsl:variable> <xsl:for-each select="msxsl:node-set($vResult)/line"> <xsl:for-each select="word"> <xsl:value-of select="concat(., ' ')"/> </xsl:for-each> <xsl:value-of select="' '"/> </xsl:for-each> </xsl:template> <xsl:template match="str-split2lines-func:*"> <xsl:param name="arg1" select="/.."/> <xsl:param name="arg2"/> <xsl:copy-of select="$arg1/*[position() < 3]"/> <xsl:copy-of select="$arg1/line[position() != last()]"/> <xsl:choose> <xsl:when test="contains($arg1/*[1], $arg2)"> <xsl:if test="string($arg1/word)"> <xsl:call-template name="fillLine"> <xsl:with-param name="pLine" select="$arg1/line[last()]"/> <xsl:with-param name="pWord" select="$arg1/word"/> <xsl:with-param name="pLineLength" select="$arg1/*[2]"/> </xsl:call-template> </xsl:if> </xsl:when> <xsl:otherwise> <xsl:copy-of select="$arg1/line[last()]"/> <word><xsl:value-of select="concat($arg1/word, $arg2)"/></word> </xsl:otherwise> </xsl:choose> </xsl:template> <!-- Test if the new word fits into the last line --> <xsl:template name="fillLine"> <xsl:param name="pLine" select="/.."/> <xsl:param name="pWord" select="/.."/> <xsl:param name="pLineLength" /> <xsl:variable name="vnWordsInLine" select="count($pLine/word)"/> <xsl:variable name="vLineLength" select="string-length($pLine) + $vnWordsInLine"/> <xsl:choose> <xsl:when test="not($vLineLength + string-length($pWord) > $pLineLength)"> <line> <xsl:copy-of select="$pLine/*"/> <xsl:copy-of select="$pWord"/> </line> </xsl:when> <xsl:otherwise> <xsl:copy-of select="$pLine"/> <line> <xsl:copy-of select="$pWord"/> </line> <word/> </xsl:otherwise> </xsl:choose> </xsl:template> </xsl:stylesheet> When instantiated as follows: <xsl:template match="/"> <xsl:call-template name="str-split-to-lines"> <xsl:with-param name="pStr" select="/*"/> <xsl:with-param name="pLineLength" select="64"/> <xsl:with-param name="pDelimiters" select="' 	 '"/> </xsl:call-template> </xsl:template> and the source xml document is: Ednote: Please note the following should not be line broken; laid out better for browser viewing :-) <text> Dec. 13 - As always for a presidential inaugural, security and surveillance were extremely tight in Washington, DC, last January. But as George W. Bush prepared to take the oath of office, security planners installed an extra layer of protection: a prototype software system to detect a biological attack. The U.S. Department of Defense, together with regional health and emergency-planning agencies, distributed a special patient-query sheet to military clinics, civilian hospitals and even aid stations along the parade route and at the inaugural balls. Software quickly analyzed complaints of seven key symptoms - from rashes to sore throats - for patterns that might indicate the early stages of a bio-attack. There was a brief scare: the system noticed a surge in flulike symptoms at military clinics. Thankfully, tests confirmed it was just that - the flu.</text> with the text occupying just one line, the result of the transformation is: Dec. 13 - As always for a presidential inaugural, security and surveillance were extremely tight in Washington, DC, last January. But as George W. Bush prepared to take the oath of office, security planners installed an extra layer of protection: a prototype software system to detect a biological attack. The U.S. Department of Defense, together with regional health and emergency-planning agencies, distributed a special patient-query sheet to military clinics, civilian hospitals and even aid stations along the parade route and at the inaugural balls. Software quickly analyzed complaints of seven key symptoms - from rashes to sore throats - for patterns that might indicate the early stages of a bio-attack. There was a brief scare: the system noticed a surge in flulike symptoms at military clinics. Thankfully, tests confirmed it was just that - the flu. As can be seen, the largest line-length is 64 -- as specified. | |
12. | DocBook to plain text - what do you use? |
If you think about, for example, the way interpolations of inline pseudo-markup (like *this* for emphasis) and similar constructs will affect, for example, line wrapping, particularly since Some blocks need to get indented like this: it is several lines long, and is required to *wrap* nicely, no matter what might turn up in it -- requiring the smart introduction of whitespace both at line ends and at line starts (and maybe the extent of the indent varies as well) -- then it is apparent that creating "pretty plain text" is not as trivial as it may first appear. My guess is that the graceful XSLT-only solution will require two or three passes over the data. Another sad fact of life is that one person's pretty plain text is another's ugly stepsister.
It seems to be one of those problems that is *nearly* general enough for a generic solution, but that has hidden gotchas and local particularities that have hindered the development of a one-size-fits-all solution. Here's an article about an approach that uses Java (SAX) for the final stage of production of the plain text: ibm.com. So it's not that this problem hasn't come up before. (Not too long ago the list even discussed producing plain-text tables from XML -- a real beast.) |