Rkyv Zero Copy Deserialization
An introduction to Rkyv, a zero-copy deserialization library for Rust.
Motivaiton
- First and foremost, the motivation behind rkyv is improvved performance.
- The way that it achieves that goal can also lead to gains in memory use, correctness, and security, along the way.
- Most serialization framework like
serde
define an internal data model that consists of basic types such as primitives, strings, and byte arrays. - This splits the work of serializing a type in two stages: frontend and backend.
- The frontend takes some type and breaks it down into serializable stypes of the data model.
- The backend then takes the data model types and writes them using some data format such as JSON, Bincode, TOML, etc.
- This allows a clean separation between the serialization of a type and the data format it is written to.
- A major downside of traditional serialization is that it takes considerable amount of time to read, parse, and reconstruct types from their deserialized values.
- In JSON for example, strings are encoded by surrounding the contents with double quotes and escaping invalid characters inside of them:
{ "line": "\"All's well that ends well\"" }
^^ ^ ^
numbers are turned into characters:
{ "count": 42 }
and even field names, which could be implicit in most cases, are turned into strings:
{ "message_size": 334 }
- All those characters are not only taking up space, they are also taking up time.
- Every time we read and parse JSON, we are picking through those characters in order to figure out what the values are and reproduce them in memory.
- An
f32
is only four bytes of memory, but it's encoded using nine bytes and we still have to turn those nine characters into the rightf32
.
- The deserialization time addes up quickly, and in data-heavy applications such as games and media editing, it can come to dominate the load times.
rkyv
provides a solution through serializtion technique called zero-copy deserialization.
Zero-Copy Deserialization
- Zero-copy deserialization is a technique that reduces time and memory required to access and use data by directly referencing bytes in the serialized form.
- This takes advantage of how we have some data loaded in memory in order to deserialize it. If we had some JSON:
{ "quote": "I don't know, I didn't listen." }
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
- Instead of copying those characters into
String
, we could just borrow it from JSON buffer as&str
. - The lifetime of that
&str
would depend on our buffer and we wouldn't be alloweed to drop it until we had droppepd the string we were using.
Partial zero-copy
- Serde and others have support for partial zero-copy deserialization, where bits and pieces of the deserialized data are borrowed from the serialized form.
- Strings for example, can borrow their bytes directly from the serialized form in encodings like bincode that don't perform any character escaping.
- However, a string object must still be created to hold the deserialized length and point to the borrowed characters
- A good way to think about this is that even though we are borrowing lots off data from the buffer, we still have to parse the structure out:
struct Example<'a> {
quote: &'a str,
a: &'a [u8; 12],
b: u64,
c: char
}
- So a buffer might break down like this:
I don't know, I didn't listen.AAAAAAAAAAAABBBBBBBBCCCC
^-----------------------------^-----------^-------^---
quote: str a: [u8; 12] b: u64 c: char
- We do a lot less work, but we still have to parse, create, and return an
Example<'a>
.
Example {
quote: str::from_utf8(&buffer[0..30]).unwrap(),
a: &buffer[30..42],
b: u64::from_le_bytes(&buffer[42..50]),
c: char::from_u32(u32::from_le_bytes(&buffer[50..54]))).unwrap(),
}
- And we can't borrow types like
u64
orchar
that have alignment requirements since our buffer might not be properly aligned. - We have to immediately parse and store those!
- Even though we borrowed 42 of the buffer's bytes, we missed out on the last 12 and still had to parse through the buffer to find out where everything is.
Total zero-copy
rkyv
implements a total zero-copy deserialization, which guarantees that no data is copied during deserialization and no work is done to deserialize data.- It achieves this by structuring its encoded representation so that it is the same as the in-memory representation of the source type
- This is more like if our buffer was an
Example
struct Example {
quote: String,
a: [u8; 12],
b: u64,
c: char
}
- And our buffer looked like this:
I don't know, I didn't listen.__QOFFQLENAAAAAAAAAAAABBBBBBBBCCCC
^----------------------------- ^---^---^-----------^-------^---
quote bytes pointer a b c
and len
^-------------------------------
Example
- In this case, the bytes are padded to the correct alignment and the fields of
Example
are laid out exactly same as they would be in memory. - Our deserialization code can be much simpler:
unsafe { &*buffer.as_ptr().add(32).cast() }
- This operation is almost zero work, and more importantly, does not scale with our data.
- No matter how much or how little data we have, it is always just a pointer offset and a cast too access our data.
- This opens up blazingly fast data loading and enables data access orders of magnitude more quickly than traditional serialization techniques.