Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
Download
7641 views
1
How to do progressive loading with MuPDF.
2
=========================================
3
4
What is progressive loading?
5
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6
7
The idea of progressive loading is that as you download a PDF file
8
into a browser, you can display the pages as they appear.
9
10
MuPDF can make use of 2 different mechanisms to achieve this. The
11
first relies on the file being "linearized", the second relies on
12
the caller of MuPDF having fine control over the http fetch and on
13
the server supporting byte-range fetches.
14
15
For optimum performance a file should be both linearized and be
16
available over a byte-range supporting link, but benefits can still
17
be had with either one of these alone.
18
19
20
Progressive download using "linearized" files
21
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
22
23
Adobe defines "linearized" PDFs as being ones that have both a
24
specific layout of objects and a small amount of extra
25
information to help avoid seeking within a file. The stated aim
26
is to deliver the first page of a document in advance of the whole
27
document downloading, whereupon subsequent pages will become
28
available. Adobe also refers to these as "Optimized for fast web
29
view" or "Web Optimized".
30
31
In fact, the standard outlines (poorly) a mechanism by which 'hints'
32
can be included that enable the subsequent pages to be found within
33
the file too. Unfortunately this is very poorly supported with
34
many tools, and so the hints have to be treated with suspicion.
35
36
MuPDF will attempt to use hints if they are available, but will also
37
use a linear search of the file to discover pages if not. This means
38
that the first page will be displayed quickly, and then subsequent
39
ones will appear with 'incomplete' renderings that improve over time
40
as more and more resources are gradually delivered.
41
42
Essentially the file starts with a slightly modified header, and the
43
first object in the file is a special one (the linearization object)
44
that a) indicates that the file is linearized, and b) gives some
45
useful information (like the number of pages in the file etc).
46
47
This object is then followed by all the objects required for the
48
first page, then the "hint stream", then sets of object for each
49
subsequent page in turn, then shared objects required for those
50
pages, then various other random things.
51
52
[Yes, really. While page 1 is sent with all the objects that it
53
uses, shared or otherwise, subsequent pages do not get shared
54
resources until after all the unshared page objects have been
55
sent.]
56
57
58
The Hint Stream
59
~~~~~~~~~~~~~~~
60
61
Adobe intended Hint Stream to be useful to facilitate the display
62
of subsequent pages, but it has never used it. Consequently you
63
can't trust people to write it properly - indeed Adobe outputs
64
something that doesn't quite conform to the spec.
65
66
Consequently very few people actually use it. MuPDF will use it
67
after sanity checking the values, and should cope with illegal/
68
incorrect values.
69
70
71
So how does MuPDF handle progressive loading?
72
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
73
74
MuPDF has made various extensions to its mechanisms for handling
75
progressive loading.
76
77
+ Progressive streams
78
79
At its lowest level MuPDF reads file data from an fz_stream,
80
using the fz_open_document_with_stream call. (fz_open_document
81
is implemented by calling this). We have extended the fz_stream
82
slightly, giving the system a way to ask for meta information
83
(or perform meta operations) on a stream.
84
85
Using this mechanism MuPDF can query:
86
87
+ whether a stream is progressive or not (i.e. whether the
88
entire stream is accessible immediately)
89
+ what the length of a stream should ultimately be (which an
90
http fetcher should know from the Content-Length header),
91
92
With this information MuPDF can decide whether to use its normal
93
object reading code, or whether to make use of a linearized
94
object. Knowing the length enables us to check with the length
95
value given in the linearized object - if these differ, the
96
assumption is that an incremental save has taken place, thus the
97
file is no longer linearized.
98
99
When data is pulled from a progressive stream, if we attempt to
100
read data that is not currently available, the stream should
101
throw an FZ_ERROR_TRYLATER error. This particular error code
102
will be interpreted by the caller as an indication that it
103
should retry the parsing of the current objects at a later time.
104
105
When a MuPDF call is made on a progressive stream, such as
106
fz_open_document_with_stream, or fz_load_page, the caller should
107
be prepared to handle an FZ_ERROR_TRYLATER error as meaning that
108
more data is required before it can continue. No indication is
109
directly given as to exactly how much more data is required, but
110
as the caller will be implementing the progressive fz_stream
111
that it has passed into MuPDF to start with, it can reasonably
112
be expected to figure out an estimate for itself.
113
114
+ Cookie
115
116
Once a page has been loaded, if its contents are to be 'run'
117
as normal (using e.g. fz_run_page) any error (such as failing
118
to read a font, or an image, or even a content stream belonging
119
to the page) will result in a rendering that aborts with an
120
FZ_ERROR_TRYLATER error. The caller can catch this and display
121
a placeholder instead.
122
123
If each pages data was entirely self-contained and sent in
124
sequence this would perhaps be acceptable, with each page
125
appearing one after the other. Unfortunately, the linearization
126
procedure as laid down by Adobe does NOT do this: objects shared
127
between multiple pages (other than the first) are not sent with
128
the pages themselves, but rather AFTER all the pages have been
129
sent.
130
131
This means that a document that has a title page, then contents
132
that share a font used on pages 2 onwards, will not be able to
133
correctly display page 2 until after the font has arrived in
134
the file, which will not be until all the page data has been
135
sent.
136
137
To mitigate against this, MuPDF provides a way whereby callers
138
can indicate that they are prepared to accept an 'incomplete'
139
rendering of the file (perhaps with missing images, or with
140
substitute fonts).
141
142
Callers prepared to tolerate such renderings should set the
143
'incomplete_ok' flag in the cookie, then call fz_run_page etc
144
as normal. If an FZ_ERROR_TRYLATER error is thrown at any point
145
during the page rendering, the error will be swallowed, the
146
'incomplete' field in the cookie will become non-zero and
147
rendering will continue. When control returns to the caller
148
the caller can check the value of the 'incomplete' field and
149
know that the rendering it received is not authoritative.
150
151
152
Progressive loading using byte range requests
153
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
154
155
If the caller has control over the http fetch, then it is possible
156
to use byte range requests to fetch the document 'out of order'.
157
This enables non-linearized files to be progressively displayed as
158
they download, and fetches complete renderings of pages earlier than
159
would otherwise be the case. This process requires no changes within
160
MuPDF itself, but rather in the way the progressive stream learns
161
from the attempts MuPDF makes to fetch data.
162
163
Consider for example, an attempt to fetch a hypothetical file from
164
a server.
165
166
+ The initial http request for the document is sent with a "Range:"
167
header to pull down the first (say) 4k of the file.
168
169
+ As soon as we get the header in from this initial request, we can
170
respond to meta stream operations to give the length, and whether
171
byte requests are accepted.
172
173
- If the header indicates that byte ranges are acceptable the
174
stream proceeds to go into a loop fetching chunks of the file
175
at a time (not necessarily in-order). Otherwise the server
176
will ignore the Range: header, and just serve the whole file.
177
178
- If the header indicates a content-length, the stream returns
179
that.
180
181
+ MuPDF can then decide how to proceed based upon these flags and
182
whether the file is linearized or not. (If the file contains a
183
linearized object, and the content length matches, then the file
184
is considered to be linear, otherwise it is not).
185
186
If the file is linear:
187
188
- we proceed to read objects out of the file as it downloads.
189
This will provide us the first page and all its resources. It
190
will also enable us to read the hint streams (if present).
191
192
- Once we have read the hint streams, we unpack (and sanity
193
check) them to give us a map of where in the file each object
194
is predicted to live, and which objects are required for each
195
page. If any of these values are out of range, we treat the
196
file as if there were no hint streams.
197
198
- If we have hints, any attempt to load a subsequent page will
199
cause MuPDF to attempt to read exactly the objects required.
200
This will cause a sequence of seeks in the fz_stream followed
201
by reads. If the stream does not have the data to satisfy that
202
request yet, the stream code should remember the location that
203
was fetched (and fetch that block in the background so that
204
future retries will succeed) and should raise an
205
FZ_ERROR_TRYLATER error.
206
207
[Typically therefore when we jump to a page in a linear file
208
on a byte request capable link, we will quickly see a rough
209
rendering, which will improve fairly fast as images and fonts
210
arrive.]
211
212
- Regardless of whether we have hints or byte requests, on every
213
fz_load_page call MuPDF will attempt to process more of the
214
stream (that is assumed to be being downloaded in the
215
background). As linearized files are guaranteed to have pages
216
in order, pages will gradually become available. In the absence
217
of byte requests and hints however, we have no way of getting
218
resources early, so the renderings for these pages will remain
219
incomplete until much more of the file has arrived.
220
221
[Typically therefore when we jump to a page in a linear file
222
on a non byte request capable link, we will see a rough
223
rendering for that page as soon as data arrives for it (which
224
will typically take much longer than would be the case with
225
byte range capable downloads), and that will improve much more
226
slowly as images and fonts may not appear until almost the
227
whole file has arrived.]
228
229
- When the whole file has arrived, then we will attempt to read
230
the outlines for the file.
231
232
For a non-linearized PDF on a byte request capable stream:
233
234
- MuPDF will immediately seek to the end of the file to attempt
235
to read the trailer. This will fail with an FZ_ERROR_TRYLATER
236
due to the data not being here yet, but the stream code should
237
remember that this data is required and it should be prioritized
238
in the background fetch process.
239
240
- Repeated attempts to open the stream should eventually succeed
241
therefore. As MuPDF jumps through the file trying to read first
242
the xrefs, then the page tree objects, then the page contents
243
themselves etc, the background fetching process will be driven
244
by the attempts to read the file in the foreground.
245
246
[Typically therefore the opening of a non-linearized file will
247
be slower than a linearized one, as the xrefs/page trees for a
248
non-linear file can be 20%+ of the file data. Once past this
249
initial point however, pages and data can be pulled from the
250
file almost as fast as with a linearized file.]
251
252
For a non-linearized PDF on a non-byte request capable stream:
253
254
- MuPDF will immediately seek to the end of the file to attempt
255
to read the trailer. This will fail with an FZ_ERROR_TRYLATER
256
due to the data not being here yet. Subsequent retries will
257
continue to fail until the whole file has arrived, whereupon
258
the whole file will be instantly available.
259
260
[This is the worst case situation - nothing at all can be
261
displayed until the entire file has downloaded.]
262
263
A typical structure for a fetcher process (see curl-stream.c in
264
mupdf-curl as an example) might therefore look like this:
265
266
+ We consider the file as an (initially empty) buffer which we are
267
filling by making requests. In order to ensure that we make
268
maximum use of our download link, we ensure that whenever
269
one request finishes, we immediately launch another. Further, to
270
avoid the overheads for the request/response headers being too
271
large, we may want to divide the file into 'chunks', perhaps 4 or 32k
272
in size.
273
274
+ We can then have a receiver process that sits there in a loop
275
requesting chunks to fill this buffer. In the absence of
276
any other impetus the receiver should request the next 'chunk'
277
of data from the file that it does not yet have, following the last
278
fill point. Initially we start the fill point at the beginning of
279
the file, but this will move around based on the requests made of
280
the progressive stream.
281
282
+ Whenever MuPDF attempts to read from the stream, we check to see if
283
we have data for this area of the file already. If we do, we can
284
return it. If not, we remember this as the next "fill point" for our
285
receiver process and throw an FZ_ERROR_TRYLATER error.
286
287