Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
pola-rs
GitHub Repository: pola-rs/polars
Path: blob/main/crates/polars-arrow/src/io/ipc/mod.rs
6939 views
1
//! APIs to read from and write to Arrow's IPC format.
2
//!
3
//! Inter-process communication is a method through which different processes
4
//! share and pass data between them. Its use-cases include parallel
5
//! processing of chunks of data across different CPU cores, transferring
6
//! data between different Apache Arrow implementations in other languages and
7
//! more. Under the hood Apache Arrow uses [FlatBuffers](https://google.github.io/flatbuffers/)
8
//! as its binary protocol, so every Arrow-centered streaming or serialiation
9
//! problem that could be solved using FlatBuffers could probably be solved
10
//! using the more integrated approach that is exposed in this module.
11
//!
12
//! [Arrow's IPC protocol](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc)
13
//! allows only batch or dictionary columns to be passed
14
//! around due to its reliance on a pre-defined data scheme. This constraint
15
//! provides a large performance gain because serialized data will always have a
16
//! known structutre, i.e. the same fields and datatypes, with the only variance
17
//! being the number of rows and the actual data inside the Batch. This dramatically
18
//! increases the deserialization rate, as the bytes in the file or stream are already
19
//! structured "correctly".
20
//!
21
//! Reading and writing IPC messages is done using one of two variants - either
22
//! [`FileReader`](read::FileReader) <-> [`FileWriter`](struct@write::FileWriter) or
23
//! [`StreamReader`](read::StreamReader) <-> [`StreamWriter`](struct@write::StreamWriter).
24
//! These two variants wrap a type `T` that implements [`Read`](std::io::Read), and in
25
//! the case of the `File` variant it also implements [`Seek`](std::io::Seek). In
26
//! practice it means that `File`s can be arbitrarily accessed while `Stream`s are only
27
//! read in certain order - the one they were written in (first in, first out).
28
mod compression;
29
mod endianness;
30
31
pub mod append;
32
pub mod read;
33
pub mod write;
34
pub use arrow_format as format;
35
36
const ARROW_MAGIC_V1: [u8; 4] = [b'F', b'E', b'A', b'1'];
37
const ARROW_MAGIC_V2: [u8; 6] = [b'A', b'R', b'R', b'O', b'W', b'1'];
38
pub(crate) const CONTINUATION_MARKER: [u8; 4] = [0xff; 4];
39
40
/// Struct containing `dictionary_id` and nested `IpcField`, allowing users
41
/// to specify the dictionary ids of the IPC fields when writing to IPC.
42
#[derive(Debug, Clone, PartialEq, Default)]
43
pub struct IpcField {
44
/// optional children
45
pub fields: Vec<IpcField>,
46
/// dictionary id
47
pub dictionary_id: Option<i64>,
48
}
49
50
impl IpcField {
51
/// Check (recursively) whether the [`IpcField`] contains a dictionary.
52
pub fn contains_dictionary(&self) -> bool {
53
self.dictionary_id.is_some() || self.fields.iter().any(|f| f.contains_dictionary())
54
}
55
}
56
57
/// Struct containing fields and whether the file is written in little or big endian.
58
#[derive(Debug, Clone, PartialEq)]
59
pub struct IpcSchema {
60
/// The fields in the schema
61
pub fields: Vec<IpcField>,
62
/// Endianness of the file
63
pub is_little_endian: bool,
64
}
65
66