Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
pola-rs
GitHub Repository: pola-rs/polars
Path: blob/main/crates/polars-row/src/lib.rs
6939 views
1
//! Row format as defined in `arrow-rs`.
2
//! This currently partially implements that format only for needed types.
3
//! For completeness sake the format as defined by `arrow-rs` is as followed:
4
//! Converts [`ArrayRef`] columns into a [row-oriented](self) format.
5
//!
6
//! ## Overview
7
//!
8
//! The row format is a variable length byte sequence created by
9
//! concatenating the encoded form of each column. The encoding for
10
//! each column depends on its datatype (and sort options).
11
//!
12
//! The encoding is carefully designed in such a way that escaping is
13
//! unnecessary: it is never ambiguous as to whether a byte is part of
14
//! a sentinel (e.g. null) or a value.
15
//!
16
//! ## Unsigned Integer Encoding
17
//!
18
//! A null integer is encoded as a `0_u8`, followed by a zero-ed number of bytes corresponding
19
//! to the integer's length.
20
//!
21
//! A valid integer is encoded as `1_u8`, followed by the big-endian representation of the
22
//! integer.
23
//!
24
//! ```text
25
//! ┌──┬──┬──┬──┐ ┌──┬──┬──┬──┬──┐
26
//! 3 │03│00│00│00│ │01│00│00│00│03│
27
//! └──┴──┴──┴──┘ └──┴──┴──┴──┴──┘
28
//! ┌──┬──┬──┬──┐ ┌──┬──┬──┬──┬──┐
29
//! 258 │02│01│00│00│ │01│00│00│01│02│
30
//! └──┴──┴──┴──┘ └──┴──┴──┴──┴──┘
31
//! ┌──┬──┬──┬──┐ ┌──┬──┬──┬──┬──┐
32
//! 23423 │7F│5B│00│00│ │01│00│00│5B│7F│
33
//! └──┴──┴──┴──┘ └──┴──┴──┴──┴──┘
34
//! ┌──┬──┬──┬──┐ ┌──┬──┬──┬──┬──┐
35
//! NULL │??│??│??│??│ │00│00│00│00│00│
36
//! └──┴──┴──┴──┘ └──┴──┴──┴──┴──┘
37
//!
38
//! 32-bit (4 bytes) Row Format
39
//! Value Little Endian
40
//! ```
41
//!
42
//! ## Signed Integer Encoding
43
//!
44
//! Signed integers have their most significant sign bit flipped, and are then encoded in the
45
//! same manner as an unsigned integer.
46
//!
47
//! ```text
48
//! ┌──┬──┬──┬──┐ ┌──┬──┬──┬──┐ ┌──┬──┬──┬──┬──┐
49
//! 5 │05│00│00│00│ │05│00│00│80│ │01│80│00│00│05│
50
//! └──┴──┴──┴──┘ └──┴──┴──┴──┘ └──┴──┴──┴──┴──┘
51
//! ┌──┬──┬──┬──┐ ┌──┬──┬──┬──┐ ┌──┬──┬──┬──┬──┐
52
//! -5 │FB│FF│FF│FF│ │FB│FF│FF│7F│ │01│7F│FF│FF│FB│
53
//! └──┴──┴──┴──┘ └──┴──┴──┴──┘ └──┴──┴──┴──┴──┘
54
//!
55
//! Value 32-bit (4 bytes) High bit flipped Row Format
56
//! Little Endian
57
//! ```
58
//!
59
//! ## Float Encoding
60
//!
61
//! Floats are converted from IEEE 754 representation to a signed integer representation
62
//! by flipping all bar the sign bit if they are negative after normalizing nans
63
//! and signed zeros to a canonical representation.
64
//!
65
//! They are then encoded in the same manner as a signed integer.
66
//!
67
//! ## Fixed Length Bytes Encoding
68
//!
69
//! Fixed length bytes are encoded in the same fashion as primitive types above.
70
//!
71
//! For a fixed length array of length `n`:
72
//!
73
//! A null is encoded as `0_u8` null sentinel followed by `n` `0_u8` bytes
74
//!
75
//! A valid value is encoded as `1_u8` followed by the value bytes
76
//!
77
//! ## Variable Length Bytes (including Strings) Encoding
78
//!
79
//! A null is encoded as a `0_u8`.
80
//!
81
//! An empty byte array is encoded as `1_u8`.
82
//!
83
//! A non-null, non-empty byte array is encoded as `2_u8` followed by the byte array
84
//! encoded using a block based scheme described below.
85
//!
86
//! The byte array is broken up into 32-byte blocks, each block is written in turn
87
//! to the output, followed by `0xFF_u8`. The final block is padded to 32-bytes
88
//! with `0_u8` and written to the output, followed by the un-padded length in bytes
89
//! of this final block as a `u8`.
90
//!
91
//! Note the following example encodings use a block size of 4 bytes,
92
//! as opposed to 32 bytes for brevity:
93
//!
94
//! ```text
95
//! ┌───┬───┬───┬───┬───┬───┐
96
//! "MEEP" │02 │'M'│'E'│'E'│'P'│04 │
97
//! └───┴───┴───┴───┴───┴───┘
98
//!
99
//! ┌───┐
100
//! "" │01 |
101
//! └───┘
102
//!
103
//! NULL ┌───┐
104
//! │00 │
105
//! └───┘
106
//!
107
//! "Defenestration" ┌───┬───┬───┬───┬───┬───┐
108
//! │02 │'D'│'e'│'f'│'e'│FF │
109
//! └───┼───┼───┼───┼───┼───┤
110
//! │'n'│'e'│'s'│'t'│FF │
111
//! ├───┼───┼───┼───┼───┤
112
//! │'r'│'a'│'t'│'r'│FF │
113
//! ├───┼───┼───┼───┼───┤
114
//! │'a'│'t'│'i'│'o'│FF │
115
//! ├───┼───┼───┼───┼───┤
116
//! │'n'│00 │00 │00 │01 │
117
//! └───┴───┴───┴───┴───┘
118
//! ```
119
//!
120
//! This approach is loosely inspired by [COBS] encoding, and chosen over more traditional
121
//! [byte stuffing] as it is more amenable to vectorisation, in particular AVX-256.
122
//!
123
//! For the unordered row encoding we use a simpler scheme, we prepend the length
124
//! encoded as 4 bytes followed by the raw data, with nulls being marked with a
125
//! length of u32::MAX.
126
//!
127
//! ## Dictionary Encoding
128
//!
129
//! [`RowsEncoded`] needs to support converting dictionary encoded arrays with unsorted, and
130
//! potentially distinct dictionaries. One simple mechanism to avoid this would be to reverse
131
//! the dictionary encoding, and encode the array values directly, however, this would lose
132
//! the benefits of dictionary encoding to reduce memory and CPU consumption.
133
//!
134
//! As such the [`RowsEncoded`] creates an order-preserving mapping
135
//! for each dictionary encoded column, which allows new dictionary
136
//! values to be added whilst preserving the sort order.
137
//!
138
//! A null dictionary value is encoded as `0_u8`.
139
//!
140
//! A non-null dictionary value is encoded as `1_u8` followed by a null-terminated byte array
141
//! key determined by the order-preserving dictionary encoding
142
//!
143
//! ```text
144
//! ┌──────────┐ ┌─────┐
145
//! │ "Bar" │ ───────────────▶│ 01 │
146
//! └──────────┘ └─────┘
147
//! ┌──────────┐ ┌─────┬─────┐
148
//! │"Fabulous"│ ───────────────▶│ 01 │ 02 │
149
//! └──────────┘ └─────┴─────┘
150
//! ┌──────────┐ ┌─────┐
151
//! │ "Soup" │ ───────────────▶│ 05 │
152
//! └──────────┘ └─────┘
153
//! ┌──────────┐ ┌─────┐
154
//! │ "ZZ" │ ───────────────▶│ 07 │
155
//! └──────────┘ └─────┘
156
//!
157
//! Example Order Preserving Mapping
158
//! ```
159
//! Using the map above, the corresponding row format will be
160
//!
161
//! ```text
162
//! ┌─────┬─────┬─────┬─────┐
163
//! "Fabulous" │ 01 │ 03 │ 05 │ 00 │
164
//! └─────┴─────┴─────┴─────┘
165
//!
166
//! ┌─────┬─────┬─────┐
167
//! "ZZ" │ 01 │ 07 │ 00 │
168
//! └─────┴─────┴─────┘
169
//!
170
//! ┌─────┐
171
//! NULL │ 00 │
172
//! └─────┘
173
//!
174
//! Input Row Format
175
//! ```
176
//!
177
//! ## Struct Encoding
178
//!
179
//! A null is encoded as a `0_u8`.
180
//!
181
//! A valid value is encoded as `1_u8` followed by the row encoding of each child.
182
//!
183
//! This encoding effectively flattens the schema in a depth-first fashion.
184
//!
185
//! For example
186
//!
187
//! ```text
188
//! ┌───────┬────────────────────────┬───────┐
189
//! │ Int32 │ Struct[Int32, Float32] │ Int32 │
190
//! └───────┴────────────────────────┴───────┘
191
//! ```
192
//!
193
//! Is encoded as
194
//!
195
//! ```text
196
//! ┌───────┬───────────────┬───────┬─────────┬───────┐
197
//! │ Int32 │ Null Sentinel │ Int32 │ Float32 │ Int32 │
198
//! └───────┴───────────────┴───────┴─────────┴───────┘
199
//! ```
200
//!
201
//! ## List Encoding
202
//!
203
//! Lists are encoded by first encoding all child elements to the row format.
204
//!
205
//! A "canonical byte array" is then constructed by concatenating the row
206
//! encodings of all their elements into a single binary array, followed
207
//! by the lengths of each encoded row, and the number of elements, encoded
208
//! as big endian `u32`.
209
//!
210
//! This canonical byte array is then encoded using the variable length byte
211
//! encoding described above.
212
//!
213
//! _The lengths are not strictly necessary but greatly simplify decode, they
214
//! may be removed in a future iteration_.
215
//!
216
//! For example given:
217
//!
218
//! ```text
219
//! [1_u8, 2_u8, 3_u8]
220
//! [1_u8, null]
221
//! []
222
//! null
223
//! ```
224
//!
225
//! The elements would be converted to:
226
//!
227
//! ```text
228
//! ┌──┬──┐ ┌──┬──┐ ┌──┬──┐ ┌──┬──┐ ┌──┬──┐
229
//! 1 │01│01│ 2 │01│02│ 3 │01│03│ 1 │01│01│ null │00│00│
230
//! └──┴──┘ └──┴──┘ └──┴──┘ └──┴──┘ └──┴──┘
231
//!```
232
//!
233
//! Which would be grouped into the following canonical byte arrays:
234
//!
235
//! ```text
236
//! ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
237
//! [1_u8, 2_u8, 3_u8] │01│01│01│02│01│03│00│00│00│02│00│00│00│02│00│00│00│02│00│00│00│03│
238
//! └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
239
//! └──── rows ────┘ └───────── row lengths ─────────┘ └─ count ─┘
240
//!
241
//! ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
242
//! [1_u8, null] │01│01│00│00│00│00│00│02│00│00│00│02│00│00│00│02│
243
//! └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
244
//!```
245
//!
246
//! With `[]` represented by an empty byte array, and `null` a null byte array.
247
//!
248
//! These byte arrays will then be encoded using the variable length byte encoding
249
//! described above.
250
//!
251
//! # Ordering
252
//!
253
//! ## Float Ordering
254
//!
255
//! Floats are totally ordered just like in the rest of Polars,
256
//! -inf < neg < -0.0 = 0.0 < pos < inf < nan, with all nans being equal.
257
//!
258
//! ## Null Ordering
259
//!
260
//! The encoding described above will order nulls first, this can be inverted by representing
261
//! nulls as `0xFF_u8` instead of `0_u8`
262
//!
263
//! ## Reverse Column Ordering
264
//!
265
//! The order of a given column can be reversed by negating the encoded bytes of non-null values
266
//!
267
//! [COBS]: https://en.wikipedia.org/wiki/Consistent_Overhead_Byte_Stuffing
268
//! [byte stuffing]: https://en.wikipedia.org/wiki/High-Level_Data_Link_Control#Asynchronous_framing
269
270
extern crate core;
271
272
pub mod decode;
273
pub mod encode;
274
pub(crate) mod fixed;
275
mod row;
276
mod utils;
277
pub(crate) mod variable;
278
mod widths;
279
280
use arrow::array::*;
281
pub type ArrayRef = Box<dyn Array>;
282
283
pub use encode::{
284
convert_columns, convert_columns_amortized, convert_columns_amortized_no_order,
285
convert_columns_no_order,
286
};
287
pub use row::{RowEncodingCategoricalContext, RowEncodingContext, RowEncodingOptions, RowsEncoded};
288
289