uri usage in xslt. A mess
I've run a few variations through a number of processors. The results are here. DaveP.
An xml-dev posting on this topic pointed me to this url which seemingly is still a draft. As Rick says:
which was released at the same time as RFC3986. As it says, the issue of the exact syntax of the file: scheme is unresolved. I think all we can do is know the range of tricks. Whenever I teach an XML-related course, I always try to make the point "The thing you type in the address box of your browser is not a URL" and let them know about this kind problem with file:, because sooner or later they will probably have to face it.
1. | Uri specifications for xslt |
I tested a number of options with a number of processors, (see above) the results are here. Conrad mailed me with another informal usage: There is another variant - this one from Netscape It allows file:///C|/ i.e. pipe (|) instead of colon (:) after the drive. | |
2. | file access |
AFAIK, the convention is not « file:/// » but « file:// ». The third slash is only required on UNIX systems, in order to denote the root directory. Marc-Aurèle DARCHE fills in the picture. For more information on this, I would suggest to refer to the RFC "Uniform Resource Identifiers (URI): Generic Syntax" or RFC "Uniform Resource Locators (URL)" Here are some chosen pieces : The URI syntax is dependent upon the scheme. In general, absolute URI are written as follows: <scheme>:<scheme-specific-part> An absolute URI contains the name of the scheme being used (<scheme>) followed by a colon (":") and then a string (the <scheme-specific-part>) whose *interpretation depends on the scheme*. The URI syntax does not require that the scheme-specific-part have any general structure or set of semantics which is common among all URI. However, a subset of URI do share a common syntax for representing hierarchical relationships within the namespace. This "generic URI" syntax consists of a sequence of four main components: <scheme:>//<authority><path?><query> A file URL takes the form: file://<host>/<path> where <host> is the fully qualified domain name of the system on which the <path> is accessible, and <path> is a hierarchical directory path of the form <directory>/<directory>/.../<name>. For example, a VMS file DISK$USER:[MY.NOTES]NOTE123456.TXT might become <URL:file://vms.host.edu/disk$user/my/notes/note12345.txt> As a special case, <host> can be the string "localhost" or the empty string; this is interpreted as `the machine from which the URL is being interpreted'. BNF So the following are correct on Unix: file://localhost/home/user/data/fileResource And I can only guess, and cannot test, for Microsoft Windows: file://localhost/C:/somedir/anotherone/fileResource David Carlisle and Peter Flynn expand on this: Note that So if you use \ you have no reason to expect relative links to work. If the system changes \ to / for you as a silent error recovery, that's possibly nice of it, but you shouldn't rely on it. is OK, or, if you don't want to name the machine: or For Linux, with setup of xt (jdk1.3), RedHat 6.2,
only | |
3. | URI form when file is used |
It is if the URL is considered transparent, but if it is to be considered opaque then the server gets do decide how to decode it, and the use of "/" is not required. The RFC does not say whether the file: scheme is supposed to use transparent or opaque URLs, and there have been several different interpretations as well as outright errors (like allowing file:c:) in actual implementations. Sometimes an application will accept one thing from the command line and another in the document() function, too. That is why you have to try various forms if you think you might be having this problem. > Which actually doesn't But you can not always be sure where the URL gets parsed, and it is not always by the OS. > BTW from within applications, Windows accepts both Windows Explorer does for local files, but try to use a network host name on an NT network with forward slashes and you will find it does not work. For example, try to open a file using the standard Windows file open dialog on a computer named "computer_a" using "\\computer_a\" and it will work, but not if you use "//computer_a/". Windows is not consistent about forward vs. reverse slashes.
Some libraries insist on that form anyway. > Bottom line: the original form is preferable in "Should" and "actually does" are not always the same, especially in the case of "file:" URLs. For example, Sablotron 0.70 requires "file://", not file:///, although "file://" is clearly an error by any interpretation. I think that "file:\\\" is rarely required but it is well to try it if all else fails to work. Actually, there is a good degree of tolerance for variant forms these days, especially among the major processors (probably, as you say, because the OS accepts several variations), but depending on the library in use you may get unlucky. | |
4. | Referring to files in a URL with Xalan |
> The question is, how do I refer to local file:// type URLs when using Xalan? With file:///path/to/file. | |
5. | Encoding URLs containing spaces. |
> In my demo.xsl, I have the following fragment > <a href="getDemo.xml?Id={@Account}">click</a> > > where @Account is holding onto 'Hello World' The output HTML contains href="getDemo.xml?Id=Hello World" And you should note that the attribute value here is not a legal URI. Raw space characters are not allowed in URIs. > When I use IE 5, Netscape 6, or Opera 5 this is what > is sent via the http call: > > http://localhost/demo/getDemo.xml?Id=Hello%20World Then IE 5, Netscape 6 and Opera 5 are being very nice to translate your unescaped space for you. They are under no obligation to do that. > When I use Netscape 4.7 this is what is sent via the http call > > http://localhost/demo/getDemo.xml?Id=Hello World Netscape 4.7 is not doing anything wrong, although one could argue that since an href value must be a URI, any disallowed characters in the value should be considered part of the URI and should be escaped so as not to have this situation. But there is no requirement that this actually happen. You should just translate the spaces in the first place. If you are sure that your server will accept '+' instead of ' ', you might try using the translate() function: <a href="getDemo.xml?Id= {translate(@Account,' ','+')}">click</a> or, the way I would do it, would be to do a string replacement of ' ' with '%20': <a> <xsl:attribute name="href"> <xsl:text>getDemo.xml?Id=</xsl:text> <xsl:call-template name="replace"> <xsl:with-param name="stringIn" select="@Account"/> <xsl:with-param name="charsIn" select="' '"/> <xsl:with-param name="charsOut" select="'%20'"/> </xsl:call-template> </xsl:attribute> <xsl:text>click</xsl:text> </a> The named template is below. <xsl:template name="replace"> <xsl:param name="stringIn"/> <xsl:param name="charsIn"/> <xsl:param name="charsOut"/> <xsl:choose> <xsl:when test="contains($stringIn,$charsIn)"> <xsl:value-of select="concat(substring-before($stringIn,$charsIn),$charsOut)"/> <xsl:call-template name="replaceCharsInString"> <xsl:with-param name="stringIn" select="substring-after($stringIn,$charsIn)"/> <xsl:with-param name="charsIn" select="$charsIn"/> <xsl:with-param name="charsOut" select="$charsOut"/> </xsl:call-template> </xsl:when> <xsl:otherwise> <xsl:value-of select="$stringIn"/> </xsl:otherwise> </xsl:choose> </xsl:template> | |
6. | Specifying a drive in an URI under Windows |
>Aren't there any standard URI syntax supporting Windows file system? Try file:///D:/foo/bar/baz.xml With the three slashes. | |
7. | Escaping URI's |
The attached xsl code should work with utf-8 for URIs since the RFC guarantees for non-ascii characters both the xslt engine AND the browser will utf-8 encode then hexify when seeing non-ascii in a uri. However, this means that this only works for URIs; one can't use translate(..., '%'. '=') on the output and expect it to work. (which is what I was wanting since I want to output email headers with utf-8 encoding) For this case, I automatically detect if a correct escape-uri(...) function is present in the xslt engine. If so, we use that and therefore the translate trick works since escape-uri will utf-8 and hexify the passed string. If there is no escape-uri to deal with utf-8, it will default to outputing ?s instead of the real characters. This can be overridden to use the uri hack outlined above. This code obeys the RFC to the letter. (I even deal with the nasty % case) The entry point to the xsl code is: <xsl:call-template name="my-escape-uri"> <xsl:with-param name="str" select="'the url here'"/> <xsl:with-param name="allow-utf8" select="true()"/> </xsl:call-template> -------------- <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/1999/xhtml" version="1.0"> <xsl:output method="html" indent="no" encoding="UTF-8" doctype-system="http://www.w3.org/TR/html4/strict.dtd" doctype-public="-//W3C//DTD HTML 4.0 Transitional//EN"/> <!-- Escape URIs --> <xsl:variable name="uri-input">The@dog-z went/__%ab%zu%c</xsl:variable> <xsl:variable name="uri-output">The%40dog-z%20went%2F_%7F_%ab%25zu%25c</xsl:variable> <xsl:variable name="have-escape-uri" select="escape-uri($uri-input, true()) = $uri-output"/> <!-- According to the RFC, non-ascii chars will be utf-8 encoded and escaped with %s by the xslt-engine when in a 'uri' attribute or by the browser if the xlst-engine doesn't. This is ok, but not enough since we still won't have working RFC822 (email) Froms! since they need =s. However, as there is nothing I can do about this, I will just hope for the best if an xslt engine doesn't have uri-escape. note the linebreaks \ at end of line only! (DaveP) are there to enable a moderate display. Concatenate them to one line. --> <xsl:variable name="ascii-charset"> !"#$%&'()*+,-./0123456789:;<=>?\ @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ </xsl:variable> <xsl:variable name="uri-ok">-_.!~*'()0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ\ abcdefghijklmnopqrstuvwxyz</xsl:variable> <xsl:variable name="hex">0123456789ABCDEFabcdef</xsl:variable> <xsl:template name="do-escape-uri"> <xsl:param name="str"/> <xsl:param name="allow-utf8"/> <xsl:if test="$str"> <xsl:variable name="first-char" select="substring($str,1,1)"/> <xsl:choose> <xsl:when test="$first-char = '%' and string-length($str) >= 3 and contains($hex, substring($str,2,1)) and contains($hex, substring($str,3,1))"> <!-- The percent char is ok IF it followed by a valid hex pair --> <xsl:value-of select="$first-char"/> </xsl:when> <xsl:when test="contains($uri-ok, $first-char)"> <!-- This char is ok inside urls --> <xsl:value-of select="$first-char"/> </xsl:when> <xsl:when test="not(contains($ascii-charset, $first-char))"> <!-- Non-ascii output raw based on utf8 allowed or not --> <xsl:choose> <xsl:when test="$allow-utf8"> <xsl:value-of select="$first-char"/> </xsl:when> <xsl:otherwise> <xsl:text>%3F</xsl:text> </xsl:otherwise> </xsl:choose> </xsl:when> <xsl:otherwise> <!-- URL escape this char --> <xsl:variable name="ascii-value" select="string-length(substring-before($ascii-charset,$first-char)) + 32"/> <xsl:text>%</xsl:text> <xsl:value-of select="substring($hex,floor($ascii-value div 16) + 1,1)"/> <xsl:value-of select="substring($hex,$ascii-value mod 16 + 1,1)"/> </xsl:otherwise> </xsl:choose> <xsl:call-template name="do-escape-uri"> <xsl:with-param name="str" select="substring($str,2)"/> <xsl:with-param name="allow-utf8" select="$allow-utf8"/> </xsl:call-template> </xsl:if> </xsl:template> <xsl:template name="my-escape-uri"> <xsl:param name="str"/> <xsl:param name="allow-utf8"/> <xsl:choose> <xsl:when test="$have-escape-uri"> <xsl:value-of select="escape-uri($str, true())"/> </xsl:when> <xsl:otherwise> <xsl:call-template name="do-escape-uri"> <xsl:with-param name="str" select="$str"/> <xsl:with-param name="allow-utf8" select="$allow-utf8"/> </xsl:call-template> </xsl:otherwise> </xsl:choose> </xsl:template> <xsl:template match="/"> <html> <xsl:call-template name="my-escape-uri"> <xsl:with-param name="str" select="'The@dog-zwent/_ÿ_%ab%zu%c'"/> </xsl:call-template> </html> </xsl:template> </xsl:stylesheet> | |
8. | Uri specification |
Thomas covered this pretty thoroughly, and contributed significantly to related threads on other lists, but I want to reiterate the point that the statement above is *not* a rule. Everything after the scheme in a file: URL is OS-dependent by definition. (Thanks, Netscape, for bestowing yet another abomination upon us). The format is: "file:" + an OS-dependent path, properly escaped Ultimately, there are *no* assumptions you can make about what comes after the scheme, if you don't know the OS the URL is associated with. Examples of paths on different OSes that make life difficult: Mac (before OS X) Windows/DOS UNIX The URI resolver shared by Windows Explorer and Internet Explorer is by far the most forgiving. Its ability to handle pretty much anything you throw at it should not be considered evidence of the equivalence of different kinds of path components, or of there being a canonical format for the paths in file: | |
9. | utf-8 in uri's |
Everyone else seems to have missed your point. You are running into an issue with an underspecified part of the URI, HTTP and HTML specs: there is no standard mechanism for declaring what encoding is being used when representing non-ASCII characters (x80 and above) in the %-escaped format used in URIs and HTML form data submissions. Tomcat interprets %C5%A2 in the HTTP request as bytes C5 A2, and exposes them through the Java/JSP API as 2 chars in a String according to an assumed (and probably wrong) iso-8859-1 encoding. On the receiving end, you must convert these chars back into bytes, assuming iso-8859-1, and then convert them to a String again, this time assuming UTF-8. I did this in JSPs with WebLogic a while back, and it was pretty straightforward. I'm not sure how it works with your particular Tomcat/Cocoon setup, though. | |
10. | Use of \ in url |
> '\' by itself is not prohibited in URLs, is it?
No, but it has a _single_ path component called "\data\file.xsl" so if that file has an <xsl:include href="foo.xsl"/> then foo.xsl is a relative uri that corresponds to "http://example.com/foo.xsl" which probably is not what was intended. If the same file is served from "http://example.com/data/file.xsl" then the relative foo.xsl uri will resolve to "http://example.com/data/foo.xsl" Note that even if the server tries to be kind and silently map \ to / (as it may do as it is free to map uris to its file system in any way it likes) then it will still fail as a _client_ given that relative URI is mandated to ask for http://example.com/data/foo.xsl as it is the client that resolves the relative uris and requests an absolute uri from the server.
its legal but the base uri of the included file might not be what you expect. You can see this in action. If you try http://www.nag.co.uk/numeric/fl/manual/html/FLlibrarymanual.asp and click on the link to "Forword" you'll follow a relative link to a pdf file with a Forword to the NAG library (all these docs are generated with xslt by the way, if you wonder why I hang out here:-) If you try http://www.nag.co.uk/numeric\fl\manual\html\FLlibrarymanual.asp Then it depends what you try it with. IE silently corrects your input to the first form, and everything works as before. Mozilla (also on windows) changes the location bar to http://www.nag.co.uk/numeric%5Cfl%5Cmanual%5Chtml%5CFLlibrarymanual.asp Our (Microsoft) server accepts that and serves the same page, but now none of the relative links work, and all give 404's The fact that IE changes the user input in its location box is a matter for IE, this is a user input issue, but once the string has been entered and is being passed between different api it's not at all clear whetherthat kind of user-error correction is realy allowed and teh mozilla behaviour will be far more typical. |