Path: blob/main/crates/polars-parquet/src/arrow/read/deserialize/README.md
6940 views
Design
Non-nested types
Let's start with the design used for non-nested arrays. The (private) entry point of this module for non-nested arrays is simple::page_iter_to_arrays
.
This function expects
a (fallible) streaming iterator of decompressed and encoded pages,
Pages
the source (parquet) column type, including its logical information
the target (arrow)
DataType
the chunk size
and returns an iterator of Array
, ArrayIter
.
This design is shared among all (parquet, arrow)
implemented tuples. Their main difference is how they are deserialized, which depends on the source and target types.
When the array iterator is pulled the first time, the following happens:
a page from
Pages
is pulleda
PageState<'a>
is built from the pagethe
PageState
is consumed into a mutable array:if
chunk_size
is larger than the number of rows in the page, the mutable array state is preserved and a new page is pulled and the process repeated until we fill a chunk.if
chunk_size
is smaller than the number of rows in the page, the mutable array state is returned and the remaining of the page is consumed into multiple mutable arrays of lengthchunk_size
into a FIFO queue.
Subsequent pulls of arrays will first try to pull from the FIFO queue. Once the queue is empty, the a new page is pulled.
PageState
As mentioned above, the iterator leverages the idea that we attach a state to a page. Recall that a page is essentially [header][data]
. The data
part contains encoded [rep levels][def levels][non-null values]
. Some pages have an associated dictionary page, in which case the non-null values
represent the indices.
Irrespectively of the physical type, the main idea is to split the page in two iterators:
An iterator over
def levels
An iterator over
non-null values
and progress the iterators as needed. In particular, for non-nested types, def levels
is a bitmap with the same representation as Arrow, in which case the validity is extended directly.
The non-null values
are "expanded" by filling null values with the default value of each physical type.
Nested types
For nested type with N+1 levels (1 is the primitive), we need to build the nest information of each N levels + the non-nested Arrow array.
This is done by first transversing the parquet types and using it to initialize, per chunk, the N levels.
The per-chunk execution is then similar but chunk_size
only drives the number of retrieved rows from the outermost parquet group (the field). Each of these pulls knows how many items need to be pulled from the inner groups, all the way to the primitive type. This works because in parquet a row cannot be split between two pages and thus each page is guaranteed to contain a full row.
The PageState
of nested types is composed by 4 iterators:
A (zipped) iterator over
rep levels
anddef levels
An iterator over
def levels
An iterator over
non-null values
The idea is that an iterator of rep, def
contain all the information to decode the nesting structure of an arrow array. The other two iterators are equivalent to the non-nested types with the exception that def levels
are no equivalent to arrow bitmaps.