Communications · Foundation · Libraries · Uncategorized

Binary Serialization with Google Protocol Buffers

Google’s Protocol Buffers are a flexible, compact and extensible mechanism for serializing structured data. We share our implementation that makes it easy to serialize Delphi records to a binary format that is 100% compatible with the Protocol Buffers specification.

Want to go straight to the source? You can find it on GitHub in our GrijjyFoundation repository in the single unit Grijjy.ProtocolBuffers.

Protocol Buffers?

In Google’s own words: “Protocol buffers are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler.”

You can use it is a binary interchange format to send data over the wire, to communicate with 3rd party applications that support Protocol Buffers, or even to create a light weight but extensible file format.

Protocol Buffers are:

  • Flexible and extensible: You can update your data structure (or protocol) without breaking deployed programs that use an “older” format.
  • Compact: The serialized format is compact, making it ideal for transfer over (wireless) networks. The output is 3 to 10 times smaller than corresponding data in XML or JSON format.
  • Fast to parse and generate: The binary format can be generated and parsed very quickly. According to Google, it can be 20 to 100 times faster to parse compared to XML.

Of course, there are other binary serialization formats out there, each with its own strengths and weaknesses. For example, in a previous post, we presented our JSON and BSON library, which can be used (among other things) to convert JSON to BSON (binary JSON) and vice versa. BSON is good for representing JSON data in a fast-to-parse format, but it is not designed to be compact. Also, formats like JSON and XML are good for representing unstructured data. Protocol Buffers require that your data is structured (although it is flexible enough to update the structure in the future).

At Grijjy, we use Protocol Buffers to transmit data between our BAAS and our (mobile) apps.

To get a quick taste, this is all that is required to serialize a record:

type
  TPerson = record
    [Serialize(1)] Name: String;
    [Serialize(2)] Id: Integer;
    [Serialize(3)] Email: String;
  end;

procedure SimpleSerialization;
var
  Person: TPerson;
  Data: TBytes;
begin
  Person.Name := 'John Doe';
  Person.Id := 1234;
  Person.Email := 'jdoe@example.com';
  Data := TgoProtocolBuffer.Serialize(Person);
end;

Want to know more? Keep reading!

Default Implementation

Google designed the Protocol Buffers specification with language-neutrality in mind. To achieve this goal, they developed a language-independent protocol definition format to specify how information must be serialized. For example, to serialize person data using the example above, you would have to create a protocol definition file (.proto file) that could look like this:

message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;
}

It describes a message, which is just a collection of fields, where each field is uniquely identified with an integer Tag.

The next step would be to use a “protocol buffer compiler” to compile this file and generate source code for a specific language. This automatically generated source code can then be used to serialize and deserialize data in protocol buffer format.

To use protocol buffers with Delphi, this means that you would need a specific protocol buffer compiler that generates Delphi source code. There are some open source projects out there that do just this.

We use a different approach though…

Alternative Implementation

At Grijjy, we love the Protocol Buffer format and the efficiency and extensibility it brings. But the need for separate protocol definition files and a protocol compiler: not so much. It is yet another “language” for you to learn and an additional step in your build process.

We think we can achieve the same goal using the language you already know, by defining the protocol as regular Delphi records, decorated with attributes to customize behavior. For example:

type
  TPhoneType = (Mobile, Home, Work);

type
  TPhoneNumber = record
  public
    [Serialize(1)] Number: String;
    [Serialize(2)] PhoneType: TPhoneType;
  public
    procedure Initialize;
  end;

type
  TPerson = record
  public
    [Serialize(1)] Name: String;
    [Serialize(2)] Id: Integer;
    [Serialize(3)] Email: String;
    [Serialize(4)] MainPhone: TPhoneNumber;
    [Serialize(5)] OtherPhones: TArray<TPhoneNumber>;
  public
    procedure Initialize;
  end;

These are plain vanilla Delphi records with just one difference: Each field must be preceded  with a Serialize attribute with a single integer tag parameter that uniquely identifies the field.

In contrast to XML and JSON, Protocol Buffers use numeric tags instead of strings to associate values with corresponding fields. This allows for more compact serialization since integer values take up less space than string values.

The tags only need to be unique within the record. An exception will be raised when a record contains duplicate tags. Tags don’t have to be (and should not be) globally unique. In the example above, tag number 1 is used to identify both the Number field of the TPhoneNumber record and the Name field of the TPerson record. Even though the TPerson record also has a field of type TPhoneNumber, this doesn’t pose a problem.

Tags start at 1 and must be positive. You should reserve tags 1-15 for the most common fields, since these tags are stored most efficiently (using only a single byte). (In case you are wondering: tags 16-2047 use two bytes and other tags take more bytes).

Records are serialized in an extensible way. You can add, delete and reorder fields without breaking compatibility with older bit streams. However, you should never change the tag or data type of a field once bit streams are already “published”.

Our engine uses Run Time Type Information (RTTI) to determine the binary format from these record definitions, in a similar way as our JSON/BSON serialization engine. It does this in an efficient way however, so serialization and deserialization is pretty fast.

Supported Data Types

You can use a wide variety of Delphi data types for your serializable fields:

  • UInt8 (Byte), UInt16 (Word), UInt32 (Longword/Cardinal), UInt64.
  • Int8 (Shortint), Int16 (Smallint), Int32 (Longint/Integer), Int64.
  • Single, Double.
  • Boolean.
  • Enumerated types, as long as the type does not contain any explicitly assigned values (since Delphi does not provide RTTI for these).
  • Records (that is, your field can be of another record type).
  • Strings (only Unicode strings).
  • TBytes (for raw binary data).
  • 1-dimensional dynamic arrays (TArray<>) of the types described above.

The integer data types are stored in an efficient “VarInt” format. This means that smaller values are stored in less bytes than larger values. 32-bit integer types are stored in 1-5 bytes, and 64-bit integer types are stored in 1-10 bytes. Sometimes, you can have integer data that contains random values across the entire 32-bit or 64-bit range. Common examples of these are CRC or Hash values and time stamps. In those cases, it is more efficient to store these integers as fixed 32-bit or 64-bit values. You can do this by declaring the field as one of 4 fixed integer types:

  • FixedInt32, FixedUInt32
  • FixedInt64, FixedUInt64

As a tech aside: when you look at the definitions of these types in the source code, you will notice that these are declared as distinct types (using an extra type keyword):

type
  FixedInt32 = type Int32;

In case you are not familiar with this: a distinct type is just an alias for another type, but it has its own RTTI. This makes it possible to use RTTI to distinguish between “regular” integers and these “fixed” integers.

All other data types can not be used for serializable fields. An exception will be raised when an unsupported data type is encountered. However, you can still use these types for regular (non-serializable) fields. In particular, the following types are not supported:

  • Extended, Comp, Currency.
  • Class, Object, Interface.
  • Enumerated types with explicitly assigned values.
  • AnsiString, RawByteString, UTF8String, UCS4String etc.
  • Static arrays.
  • Multi-dimensional dynamic arrays.

Using the Serializer

Serializing is very easy. You just fill your record with the values you want to serialize and call:

TgoProtocolBuffer.Serialize<TPerson>(MyPerson, 'Person.dat');

This is a generic method with a type parameter that must match the type of the record you are serializing.

Since Delphi is able to infer the generic type from the first parameter, you can also write this a little bit shorter:

TgoProtocolBuffer.Serialize(MyPerson, 'Person.dat');

You can serialize to a file, stream or TBytes array.

Deserializing is equally simple:

TgoProtocolBuffer.Deserialize(MyPerson, 'Person.dat');

Because all fields in a record are optional, some fields may not be in the stream. To prevent the record from having unitialized fields after deserialization, the record is cleared before it is deserialized (that is, all fields are set to 0 or nil).

You can also provide your own means of initializing the record with default values. To do that, you have to add a parameterless Initialize procedure to your record. Then, the deserialization process will call that routine after clearing the record (so it will still clear any fields you don’t initialize yourself).

Changing the Protocol

The protocol you define is not set in stone. If it no longer fits you needs, but you still need to be able to read data in an older format, then you don’t have find some way to import older formats and convert them to a new format. You can just update your protocol definition (attributed records) without breaking existing code as long as you follow these few rules:

  • Don’t change the tag values for existing fields.
  • Don’t change the data types for existing fields.
  • You can always reorder fields.
  • You can always add new fields with new tag values.
  • You can also remove fields you don’t need anymore. Just don’t reuse its tag value for any new fields you add.

Try it Yourself

Go give it a try. Head over to GitHub to pull our GrijjyFoundation repository. Among all the other goodies in there you will find the Grijjy.ProtocolBuffers unit. You can also check out the units tests (in the UnitTests sub-directory) for more examples of how to use the engine. There is API documentation as well.

3 thoughts on “Binary Serialization with Google Protocol Buffers

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s