Various
The basic question started as a query how to obtain and display, client side,
a number of linked pages from a database. Wouldn't it be easier to simply paginate the XML before it gets to the
translator? ie. throw windows of XML data at the translator and translate
for each page?
Steve Muench adds:
I've been out to several customers who've tried
to do this "paging" in one way or another with XSLT. The least
efficient way I've observed is a customer who was
querying 25,000 rows of database data into XML
(using the Oracle XML SQL Utility) and then using
top-level stylesheet parameters and <xsl:if> elements
in their stylesheet to "filter" the data to emit
only rows N through M of those 25000 onto the browser.
(Where M-N is usually in the 10-15 range).
Yikes.
Much more efficient is to, as you suggest, query-up/pass-in
only the 15 rows in the current "window" you want to format
and not to use the in-memory filtering capability of XSLT
as a replacement for the fast indexes that a database might
be able to use to quickly return just the rows you want
to "see" on the current page.
Terris" <terris@terris.com> adds:
Easier said than done. Databases can not directly
perform
select * from A
return rows 50-60
This is a simple example. Most queries contain joins.
Queries can be expensive. The query cost can
prohibit executing the same query
over and over again, scrolling through
the results for the desired subset. And anyway
the result set might change between
executions. You would be surprised
how many sites actually do this though.
You either consume disk or memory, take
your pick. If you go the disk route (disk
is much cheaper than memory),
you either use the database or
the file system. For example, you could
store XML data sets in files.
One database-based solution is to
create a temporary table
that holds the query results. This temporary table
has an additional "row number" column which
is indexed. There are several problems with this solution... 1. It consumes database resources | 2. If you use a "real" temporary table you
have to associate a database connection with
the session .. don't even try it. So you
have no choice but to create real tables
on the fly. | 3. Tables have to be cleaned up after
a session times out | 4. Need to make sure the same session can only
have one active temporary table at a time,
otherwise one session can consume an
excessive amount of disk | 5. Database engines were not designed
for this -- engines like the schema to
stay constant, and creating and
dropping tables every few minutes is
disastrous |
Instead of using temporary tables,
create one "query results" table that is
also indexed on session ID. You still
have to garbage collect the rows
at some point, but managing rows is easier
than managing entire tables.
A database solution is appealing because
you can leverage indexes. A database
can be accessed from any machine on
a server farm, so sessions don't have
to be tied to particular servers.
The devil is always in the details.
Where does the row number come from?
Unfortunately, you usually can't use
the select INTO clause to populate
the query results table. This is because
you can't tell SQL to generate an
incrementing row number based on
the ORDER BY clause. Some
dialects of SQL might actually support
this. Without such support you have to
write code that iterates through
the result set, adds an incrementing
row number column, and
inserts the row into the query results table.
This can be very slow for large result
sets.
It seems natural to use XML to represent
temporary result sets, but generally
this approach doesn't scale.
XML files are fundamentally
slow because the file system can't
index XML documents. Stores like
eXcelon can index XML but all of
the undesirable features of temporary
tables described earlier will creep up.
If your data sets are small,
you could get away with managing
temporary XML files in the file system.
You could try to forego files and
cache documents in memory.
Row-based access is fast when the document
has an ID attribute whose
value indicates the row number (r1, r2,
r3, etc.). XML parsers automatically index ID
attributes for fast access. However,
this solution doesn't work on server farms.
In conclusion...
The short answer is this is not easy.
The longer answer is that the
best solution seems to be to store
query results in a permanent table that
is specifically indexed for page-based access.
The best implementation depends
on the size of the result sets. Database-based
solutions scale up to very large result
sets and scale out to web farms.
You can even pre-populate the
query results table with your most popular
queries. This out-perfoms all possible
implementations and that's an
understatement. Of course the assumption is
that the underlying data doesn't change frequently,
which may or may not be correct.
I wish I could say that raw XML technologies
somehow solve this problem. There are limitations
to all tools. XML could be a facilitator,
however, You could build a component that
manages result sets in a database and returns "pages" as
XML. NOW you're talking.
Larry Mason warns:
Now throw in user sorting capability and your row number concept is shot.
Pages are in the eye of the consumer not the producer.
Kevin Williams:
I see this mistake everywhere I look, and it's been a problem since before
the days of XML. The cardinal rule when marshalling data should be
"eliminate unnecessary information as early in the process as possible" -
typically in the relational database layer itself. Otherwise, you're just
chewing up bandwidth and memory, and forcing platforms not well suited to
data manipulation (such as XSLT) to do the work instead...
Francis Norton offers:
I've had good results from a two step process, with variable length
items.
Step 1 is mark up the printable elements by simply adding newPage
(boolean), pageNumber and LineNumber attributes to the printable
elements, using a recursive template with page and line parameters,
which knows how to size the elements.
Step 2 is to merge the data into a format stylesheet, which then becomes
fairly simple - match a page template for-each element where newPage =
true(), then for-each element with that pageNumber, apply-templates in
layout mode to find a item template which prints and positions the
element correctly.
Joshua Allen goes on:
But ... databases *can* directly perform select * from A
return 50 rows where x > the_last_x_in_my_previous_resultset
With SQL Server you can use the "TOP" operator or
"SET ROWCOUNT" to specify how many rows are returned.
I am certain that DB2 has this capability as well, and
I think Oracle goes even beyond this ("bookmarks" or
something? It has been a long time...)
Of course, your example of using a column indexed specifically
for paging is a great strategy. But if you already are returning
the data in some non-random order, then you can also use the above
idea to return pages. Using the above avoids the need to scroll
through the resultset each time, and allows page cutoffs to change
as data gets inserted. Materialized views can alleviate the
pain of joins, and heavy-read situations are not friendly to joins
anyway.
If you can tolerate a small degree of data staleness, it would
also be possible to pull back the whole *entire* resultset as XML and
cache the individual "pages" as separate XML files on the filesystem.
(I mean, I have yet to see a system that can update the user's
web-page when the underlying data in the database changes, so you
always have *some* degree of staleness). Recognizing this and
the relation to caching leads to all sorts of interesting options.
One of the worst possible choices for paging is to scroll through the
*entire* rowset *every* time a user makes a request. You may have to
do this in some situations, but probably far fewer than most people
realize.
Just some additional info to the original idea of paging via
XSLT -- sometimes people want to do this so that they can
target devices with multiple form factors (i.e. use 30 lines
per page if the browser is Netscape, use 11 lines per page
if it's a Palm VII). One word of caution in this approach is
that usability is so challenging on the small devices that
usability concerns usually far outweigh the programmability
concerns that lead people to consider this strategy. You
might make it work, but many people have experienced the only
way to get their systems tolerable on small devices is custom-
build UI for each device (for example, on a web browser you can
show city and state together for input, but on a WML device
you are better to have one screen that lets you choose a letter
range for the first letter of the city, then the next screen with more
detail, etc.; very difficult to XSLT-ize). So don't be too disappointed
if reality foils the well-laid plans :-)
Mike Kay provides a possible (reasonable) break point:
If you've been following other threads you'll have seen the advice that if
you're getting the data from the database it's best to generate one XML
document for each displayed page of data.
However, at the level of 1000 records, handling a single document is quite
feasible and despite what others have said, if you're doing the processing
in the client I think it's quite a reasonable option, because it reduces the
number of hits on the server. You'll find an example using this sort of
architecture on page 608 of my book*: it's not doing paging in the way you
describe, but a similar kind of interactive navigation within a dataset.
Steve Muench closes with an Oracle comment:
As a final posting, I'll note that
the way with Oracle to achieve this is to use
the "TOP-N" query feature in Oracle8i.
If you are ordering your data by a value that
makes each row distinct (e.g. the primary key
or any other unique key), then to retrieve
the "next 10" rows after a certain key that
was "last" on the current page, you can do: SELECT userid, firstname, lastname, phone
FROM ( SELECT userid, firstname, lastname, phone
FROM all_emps
WHERE lastname like '%JONES%' /* This was the orig query critera */
AND userid > 'abjones' /* Return matches beyond 'abjones' */
ORDER BY userid /* Order the inner query by userid */
)
WHERE ROWNUM <= 10 /* Get the "top-10", i.e. first 10, rows */
I cover this topic of stateless paging in my book on pages 643-655. | The "big" advantage I get by paging and sorting at the browser(/client)
| end is that I do not have to hit the web_server/db_server again for the
| same (sorted/paged) data.If I have to save some db hits, I may have to
| cache the data using sessions.
The tradeoff depends on the probability that the users
will visit all of the rows. If the user is likely to
visit all the rows, then one big hit is probably better
and browser-side XSLT can be helpful.
If the user is likely to conclude after paging through
a couple of pages that they *should* have refined their
query criteria a little better (imagine a Google.com
search where you typically don't go beyond the first
couple of pages of "most relevant" hits), then downloading
all the hits might be a bad idea. Depends on the app. |