CoCalc -- progressive.txt

bin / mupdf / mupdf-1.7 / docs / progressive.txt
¹⁰⁷¹⁸ views
1
How to do progressive loading with MuPDF.
2
=========================================
3

4
What is progressive loading?
5
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6

7
The idea of progressive loading is that as you download a PDF file
8
into a browser, you can display the pages as they appear.
9

10
MuPDF can make use of 2 different mechanisms to achieve this. The
11
first relies on the file being "linearized", the second relies on
12
the caller of MuPDF having fine control over the http fetch and on
13
the server supporting byte-range fetches.
14

15
For optimum performance a file should be both linearized and be
16
available over a byte-range supporting link, but benefits can still
17
be had with either one of these alone.
18

19

20
Progressive download using "linearized" files
21
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
22

23
Adobe defines "linearized" PDFs as being ones that have both a
24
specific layout of objects and a small amount of extra
25
information to help avoid seeking within a file. The stated aim
26
is to deliver the first page of a document in advance of the whole
27
document downloading, whereupon subsequent pages will become
28
available. Adobe also refers to these as "Optimized for fast web
29
view" or "Web Optimized".
30

31
In fact, the standard outlines (poorly) a mechanism by which 'hints'
32
can be included that enable the subsequent pages to be found within
33
the file too. Unfortunately this is very poorly supported with
34
many tools, and so the hints have to be treated with suspicion.
35

36
MuPDF will attempt to use hints if they are available, but will also
37
use a linear search of the file to discover pages if not. This means
38
that the first page will be displayed quickly, and then subsequent
39
ones will appear with 'incomplete' renderings that improve over time
40
as more and more resources are gradually delivered.
41

42
Essentially the file starts with a slightly modified header, and the
43
first object in the file is a special one (the linearization object)
44
that a) indicates that the file is linearized, and b) gives some
45
useful information (like the number of pages in the file etc).
46

47
This object is then followed by all the objects required for the
48
first page, then the "hint stream", then sets of object for each
49
subsequent page in turn, then shared objects required for those
50
pages, then various other random things.
51

52
[Yes, really. While page 1 is sent with all the objects that it
53
uses, shared or otherwise, subsequent pages do not get shared
54
resources until after all the unshared page objects have been
55
sent.]
56

57

58
The Hint Stream
59
~~~~~~~~~~~~~~~
60

61
Adobe intended Hint Stream to be useful to facilitate the display
62
of subsequent pages, but it has never used it. Consequently you
63
can't trust people to write it properly - indeed Adobe outputs
64
something that doesn't quite conform to the spec.
65

66
Consequently very few people actually use it. MuPDF will use it
67
after sanity checking the values, and should cope with illegal/
68
incorrect values.
69

70

71
So how does MuPDF handle progressive loading?
72
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
73

74
MuPDF has made various extensions to its mechanisms for handling
75
progressive loading.
76

77
  + Progressive streams
78

79
    At its lowest level MuPDF reads file data from an fz_stream,
80
    using the fz_open_document_with_stream call. (fz_open_document
81
    is implemented by calling this). We have extended the fz_stream
82
    slightly, giving the system a way to ask for meta information
83
    (or perform meta operations) on a stream.
84

85
    Using this mechanism MuPDF can query:
86

87
      + whether a stream is progressive or not (i.e. whether the
88
        entire stream is accessible immediately)
89
      + what the length of a stream should ultimately be (which an
90
        http fetcher should know from the Content-Length header),
91

92
    With this information MuPDF can decide whether to use its normal
93
    object reading code, or whether to make use of a linearized
94
    object. Knowing the length enables us to check with the length
95
    value given in the linearized object - if these differ, the
96
    assumption is that an incremental save has taken place, thus the
97
    file is no longer linearized.
98

99
    When data is pulled from a progressive stream, if we attempt to
100
    read data that is not currently available, the stream should
101
    throw an FZ_ERROR_TRYLATER error. This particular error code
102
    will be interpreted by the caller as an indication that it
103
    should retry the parsing of the current objects at a later time.
104

105
    When a MuPDF call is made on a progressive stream, such as
106
    fz_open_document_with_stream, or fz_load_page, the caller should
107
    be prepared to handle an FZ_ERROR_TRYLATER error as meaning that
108
    more data is required before it can continue. No indication is
109
    directly given as to exactly how much more data is required, but
110
    as the caller will be implementing the progressive fz_stream
111
    that it has passed into MuPDF to start with, it can reasonably
112
    be expected to figure out an estimate for itself.
113

114
  + Cookie
115

116
    Once a page has been loaded, if its contents are to be 'run'
117
    as normal (using e.g. fz_run_page) any error (such as failing
118
    to read a font, or an image, or even a content stream belonging
119
    to the page) will result in a rendering that aborts with an
120
    FZ_ERROR_TRYLATER error. The caller can catch this and display
121
    a placeholder instead.
122

123
    If each pages data was entirely self-contained and sent in
124
    sequence this would perhaps be acceptable, with each page
125
    appearing one after the other. Unfortunately, the linearization
126
    procedure as laid down by Adobe does NOT do this: objects shared
127
    between multiple pages (other than the first) are not sent with
128
    the pages themselves, but rather AFTER all the pages have been
129
    sent.
130

131
    This means that a document that has a title page, then contents
132
    that share a font used on pages 2 onwards, will not be able to
133
    correctly display page 2 until after the font has arrived in
134
    the file, which will not be until all the page data has been
135
    sent.
136

137
    To mitigate against this, MuPDF provides a way whereby callers
138
    can indicate that they are prepared to accept an 'incomplete'
139
    rendering of the file (perhaps with missing images, or with
140
    substitute fonts).
141

142
    Callers prepared to tolerate such renderings should set the
143
    'incomplete_ok' flag in the cookie, then call fz_run_page etc
144
    as normal. If an FZ_ERROR_TRYLATER error is thrown at any point
145
    during the page rendering, the error will be swallowed, the
146
    'incomplete' field in the cookie will become non-zero and
147
    rendering will continue. When control returns to the caller
148
    the caller can check the value of the 'incomplete' field and
149
    know that the rendering it received is not authoritative.
150

151

152
Progressive loading using byte range requests
153
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
154

155
If the caller has control over the http fetch, then it is possible
156
to use byte range requests to fetch the document 'out of order'.
157
This enables non-linearized files to be progressively displayed as
158
they download, and fetches complete renderings of pages earlier than
159
would otherwise be the case. This process requires no changes within
160
MuPDF itself, but rather in the way the progressive stream learns
161
from the attempts MuPDF makes to fetch data.
162

163
Consider for example, an attempt to fetch a hypothetical file from
164
a server.
165

166
 + The initial http request for the document is sent with a "Range:"
167
   header to pull down the first (say) 4k of the file.
168

169
 + As soon as we get the header in from this initial request, we can
170
   respond to meta stream operations to give the length, and whether
171
   byte requests are accepted.
172

173
   - If the header indicates that byte ranges are acceptable the
174
     stream proceeds to go into a loop fetching chunks of the file
175
     at a time (not necessarily in-order). Otherwise the server
176
     will ignore the Range: header, and just serve the whole file.
177

178
   - If the header indicates a content-length, the stream returns
179
     that.
180

181
 + MuPDF can then decide how to proceed based upon these flags and
182
   whether the file is linearized or not. (If the file contains a
183
   linearized object, and the content length matches, then the file
184
   is considered to be linear, otherwise it is not).
185

186
   If the file is linear:
187

188
   - we proceed to read objects out of the file as it downloads.
189
     This will provide us the first page and all its resources. It
190
     will also enable us to read the hint streams (if present).
191

192
   - Once we have read the hint streams, we unpack (and sanity
193
     check) them to give us a map of where in the file each object
194
     is predicted to live, and which objects are required for each
195
     page. If any of these values are out of range, we treat the
196
     file as if there were no hint streams.
197

198
   - If we have hints, any attempt to load a subsequent page will
199
     cause MuPDF to attempt to read exactly the objects required.
200
     This will cause a sequence of seeks in the fz_stream followed
201
     by reads. If the stream does not have the data to satisfy that
202
     request yet, the stream code should remember the location that
203
     was fetched (and fetch that block in the background so that
204
     future retries will succeed) and should raise an
205
     FZ_ERROR_TRYLATER error.
206

207
     [Typically therefore when we jump to a page in a linear file
208
     on a byte request capable link, we will quickly see a rough
209
     rendering, which will improve fairly fast as images and fonts
210
     arrive.]
211

212
   - Regardless of whether we have hints or byte requests, on every
213
     fz_load_page call MuPDF will attempt to process more of the
214
     stream (that is assumed to be being downloaded in the
215
     background). As linearized files are guaranteed to have pages
216
     in order, pages will gradually become available. In the absence
217
     of byte requests and hints however, we have no way of getting
218
     resources early, so the renderings for these pages will remain
219
     incomplete until much more of the file has arrived.
220

221
     [Typically therefore when we jump to a page in a linear file
222
     on a non byte request capable link, we will see a rough
223
     rendering for that page as soon as data arrives for it (which
224
     will typically take much longer than would be the case with
225
     byte range capable downloads), and that will improve much more
226
     slowly as images and fonts may not appear until almost the
227
     whole file has arrived.]
228

229
   - When the whole file has arrived, then we will attempt to read
230
     the outlines for the file.
231

232
   For a non-linearized PDF on a byte request capable stream:
233

234
   - MuPDF will immediately seek to the end of the file to attempt
235
     to read the trailer. This will fail with an FZ_ERROR_TRYLATER
236
     due to the data not being here yet, but the stream code should
237
     remember that this data is required and it should be prioritized
238
     in the background fetch process.
239

240
   - Repeated attempts to open the stream should eventually succeed
241
     therefore. As MuPDF jumps through the file trying to read first
242
     the xrefs, then the page tree objects, then the page contents
243
     themselves etc, the background fetching process will be driven
244
     by the attempts to read the file in the foreground.
245

246
     [Typically therefore the opening of a non-linearized file will
247
     be slower than a linearized one, as the xrefs/page trees for a
248
     non-linear file can be 20%+ of the file data. Once past this
249
     initial point however, pages and data can be pulled from the
250
     file almost as fast as with a linearized file.]
251
     
252
   For a non-linearized PDF on a non-byte request capable stream:
253

254
   - MuPDF will immediately seek to the end of the file to attempt
255
     to read the trailer. This will fail with an FZ_ERROR_TRYLATER
256
     due to the data not being here yet. Subsequent retries will
257
     continue to fail until the whole file has arrived, whereupon
258
     the whole file will be instantly available.
259

260
     [This is the worst case situation - nothing at all can be
261
     displayed until the entire file has downloaded.]
262

263
  A typical structure for a fetcher process (see curl-stream.c in
264
  mupdf-curl as an example) might therefore look like this:
265

266
 + We consider the file as an (initially empty) buffer which we are
267
   filling by making requests. In order to ensure that we make
268
   maximum use of our download link, we ensure that whenever
269
   one request finishes, we immediately launch another. Further, to
270
   avoid the overheads for the request/response headers being too
271
   large, we may want to divide the file into 'chunks', perhaps 4 or 32k
272
   in size.
273

274
 + We can then have a receiver process that sits there in a loop
275
   requesting chunks to fill this buffer. In the absence of
276
   any other impetus the receiver should request the next 'chunk'
277
   of data from the file that it does not yet have, following the last
278
   fill point. Initially we start the fill point at the beginning of
279
   the file, but this will move around based on the requests made of
280
   the progressive stream.
281

282
 + Whenever MuPDF attempts to read from the stream, we check to see if
283
   we have data for this area of the file already. If we do, we can
284
   return it. If not, we remember this as the next "fill point" for our
285
   receiver process and throw an FZ_ERROR_TRYLATER error.
286

287
Product

Resources

Company