Communications · Foundation · Libraries · Uncategorized

Binary Serialization with Google Protocol Buffers

Google’s Protocol Buffers are a flexible, compact and extensible mechanism for serializing structured data. We share our implementation that makes it easy to serialize Delphi records to a binary format that is 100% compatible with the Protocol Buffers specification.

Want to go straight to the source? You can find it on GitHub in our GrijjyFoundation repository in the single unit Grijjy.ProtocolBuffers.

Protocol Buffers?

In Google’s own words: “Protocol buffers are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler.”

You can use it is a binary interchange format to send data over the wire, to communicate with 3rd party applications that support Protocol Buffers, or even to create a light weight but extensible file format.

Protocol Buffers are:

  • Flexible and extensible: You can update your data structure (or protocol) without breaking deployed programs that use an “older” format.
  • Compact: The serialized format is compact, making it ideal for transfer over (wireless) networks. The output is 3 to 10 times smaller than corresponding data in XML or JSON format.
  • Fast to parse and generate: The binary format can be generated and parsed very quickly. According to Google, it can be 20 to 100 times faster to parse compared to XML.

Of course, there are other binary serialization formats out there, each with its own strengths and weaknesses. For example, in a previous post, we presented our JSON and BSON library, which can be used (among other things) to convert JSON to BSON (binary JSON) and vice versa. BSON is good for representing JSON data in a fast-to-parse format, but it is not designed to be compact. Also, formats like JSON and XML are good for representing unstructured data. Protocol Buffers require that your data is structured (although it is flexible enough to update the structure in the future).

At Grijjy, we use Protocol Buffers to transmit data between our BAAS and our (mobile) apps.

To get a quick taste, this is all that is required to serialize a record:

type
  TPerson = record
    [Serialize(1)] Name: String;
    [Serialize(2)] Id: Integer;
    [Serialize(3)] Email: String;
  end;

procedure SimpleSerialization;
var
  Person: TPerson;
  Data: TBytes;
begin
  Person.Name := 'John Doe';
  Person.Id := 1234;
  Person.Email := 'jdoe@example.com';
  Data := TgoProtocolBuffer.Serialize(Person);
end;

Want to know more? Keep reading!

Default Implementation

Google designed the Protocol Buffers specification with language-neutrality in mind. To achieve this goal, they developed a language-independent protocol definition format to specify how information must be serialized. For example, to serialize person data using the example above, you would have to create a protocol definition file (.proto file) that could look like this:

message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;
}

It describes a message, which is just a collection of fields, where each field is uniquely identified with an integer Tag.

The next step would be to use a “protocol buffer compiler” to compile this file and generate source code for a specific language. This automatically generated source code can then be used to serialize and deserialize data in protocol buffer format.

To use protocol buffers with Delphi, this means that you would need a specific protocol buffer compiler that generates Delphi source code. There are some open source projects out there that do just this.

We use a different approach though…

Alternative Implementation

At Grijjy, we love the Protocol Buffer format and the efficiency and extensibility it brings. But the need for separate protocol definition files and a protocol compiler: not so much. It is yet another “language” for you to learn and an additional step in your build process.

We think we can achieve the same goal using the language you already know, by defining the protocol as regular Delphi records, decorated with attributes to customize behavior. For example:

type
  TPhoneType = (Mobile, Home, Work);

type
  TPhoneNumber = record
  public
    [Serialize(1)] Number: String;
    [Serialize(2)] PhoneType: TPhoneType;
  public
    procedure Initialize;
  end;

type
  TPerson = record
  public
    [Serialize(1)] Name: String;
    [Serialize(2)] Id: Integer;
    [Serialize(3)] Email: String;
    [Serialize(4)] MainPhone: TPhoneNumber;
    [Serialize(5)] OtherPhones: TArray<TPhoneNumber>;
  public
    procedure Initialize;
  end;

These are plain vanilla Delphi records with just one difference: Each field must be preceded  with a Serialize attribute with a single integer tag parameter that uniquely identifies the field.

In contrast to XML and JSON, Protocol Buffers use numeric tags instead of strings to associate values with corresponding fields. This allows for more compact serialization since integer values take up less space than string values.

The tags only need to be unique within the record. An exception will be raised when a record contains duplicate tags. Tags don’t have to be (and should not be) globally unique. In the example above, tag number 1 is used to identify both the Number field of the TPhoneNumber record and the Name field of the TPerson record. Even though the TPerson record also has a field of type TPhoneNumber, this doesn’t pose a problem.

Tags start at 1 and must be positive. You should reserve tags 1-15 for the most common fields, since these tags are stored most efficiently (using only a single byte). (In case you are wondering: tags 16-2047 use two bytes and other tags take more bytes).

Records are serialized in an extensible way. You can add, delete and reorder fields without breaking compatibility with older bit streams. However, you should never change the tag or data type of a field once bit streams are already “published”.

Our engine uses Run Time Type Information (RTTI) to determine the binary format from these record definitions, in a similar way as our JSON/BSON serialization engine. It does this in an efficient way however, so serialization and deserialization is pretty fast.

Supported Data Types

You can use a wide variety of Delphi data types for your serializable fields:

  • UInt8 (Byte), UInt16 (Word), UInt32 (Longword/Cardinal), UInt64.
  • Int8 (Shortint), Int16 (Smallint), Int32 (Longint/Integer), Int64.
  • Single, Double.
  • Boolean.
  • Enumerated types, as long as the type does not contain any explicitly assigned values (since Delphi does not provide RTTI for these).
  • Records (that is, your field can be of another record type).
  • Strings (only Unicode strings).
  • TBytes (for raw binary data).
  • 1-dimensional dynamic arrays (TArray<>) of the types described above.

The integer data types are stored in an efficient “VarInt” format. This means that smaller values are stored in less bytes than larger values. 32-bit integer types are stored in 1-5 bytes, and 64-bit integer types are stored in 1-10 bytes. Sometimes, you can have integer data that contains random values across the entire 32-bit or 64-bit range. Common examples of these are CRC or Hash values and time stamps. In those cases, it is more efficient to store these integers as fixed 32-bit or 64-bit values. You can do this by declaring the field as one of 4 fixed integer types:

  • FixedInt32, FixedUInt32
  • FixedInt64, FixedUInt64

As a tech aside: when you look at the definitions of these types in the source code, you will notice that these are declared as distinct types (using an extra type keyword):

type
  FixedInt32 = type Int32;

In case you are not familiar with this: a distinct type is just an alias for another type, but it has its own RTTI. This makes it possible to use RTTI to distinguish between “regular” integers and these “fixed” integers.

All other data types can not be used for serializable fields. An exception will be raised when an unsupported data type is encountered. However, you can still use these types for regular (non-serializable) fields. In particular, the following types are not supported:

  • Extended, Comp, Currency.
  • Class, Object, Interface.
  • Enumerated types with explicitly assigned values.
  • AnsiString, RawByteString, UTF8String, UCS4String etc.
  • Static arrays.
  • Multi-dimensional dynamic arrays.

Using the Serializer

Serializing is very easy. You just fill your record with the values you want to serialize and call:

TgoProtocolBuffer.Serialize<TPerson>(MyPerson, 'Person.dat');

This is a generic method with a type parameter that must match the type of the record you are serializing.

Since Delphi is able to infer the generic type from the first parameter, you can also write this a little bit shorter:

TgoProtocolBuffer.Serialize(MyPerson, 'Person.dat');

You can serialize to a file, stream or TBytes array.

Deserializing is equally simple:

TgoProtocolBuffer.Deserialize(MyPerson, 'Person.dat');

Because all fields in a record are optional, some fields may not be in the stream. To prevent the record from having unitialized fields after deserialization, the record is cleared before it is deserialized (that is, all fields are set to 0 or nil).

You can also provide your own means of initializing the record with default values. To do that, you have to add a parameterless Initialize procedure to your record. Then, the deserialization process will call that routine after clearing the record (so it will still clear any fields you don’t initialize yourself).

Changing the Protocol

The protocol you define is not set in stone. If it no longer fits you needs, but you still need to be able to read data in an older format, then you don’t have find some way to import older formats and convert them to a new format. You can just update your protocol definition (attributed records) without breaking existing code as long as you follow these few rules:

  • Don’t change the tag values for existing fields.
  • Don’t change the data types for existing fields.
  • You can always reorder fields.
  • You can always add new fields with new tag values.
  • You can also remove fields you don’t need anymore. Just don’t reuse its tag value for any new fields you add.

Try it Yourself

Go give it a try. Head over to GitHub to pull our GrijjyFoundation repository. Among all the other goodies in there you will find the Grijjy.ProtocolBuffers unit. You can also check out the units tests (in the UnitTests sub-directory) for more examples of how to use the engine. There is API documentation as well.

17 thoughts on “Binary Serialization with Google Protocol Buffers

  1. Thanks for posting this excellent article and code!
    I’m testing this library with websockets and protobuf.js. What is odd is that any integer value is received with is the delphi value multiplied by 2. I see in TgoProtocolBuffer.TWriter.WriteVarInt that the passed value is shifted to the left by 1 that seems to be causing this. I didn’t check out the protocol specification, but seems to me that there is something wrong over there.

    Like

  2. Thanks for posting the code. I hope to use it in my project.
    Actually I should translate some c# code to delphi.
    Unfortunatelly I can not find example of direct Lists (De)Serialization.

    In the legacy C# project it looks like:
    baseSerializer.SerializeWithLengthPrefix(new List(), filePath);

    So here the list is not wrapped with any class and I should not wrap it with any Delphi Record.
    How to workaround it with your library? Have you example?

    Thanks a lot for advance!

    Like

    1. This library only supports (de)serialization of records. The record can contain a dynamic array though. You can create the dynamic array from a list:

      type
        TRec = record
          [Serialize(1)] Items: TArray&lt;string>;
        end;
      
      var
        Rec: TRec;
        List: TList&lt;String>;
      begin
        ...Fill a List
        Rec.Items := List.ToArray;
        TgoProtocolBuffer.Serialize(Rec);
      end;
      

      Like

      1. Thanks a lot for your help.

        One point is not clear for me for your description: Should tags numbering start always with 1?

        I ask because my code below sems to be not working. It returns always empty records.

          TSpatialRecord = record
          public
            [Serialize(83)] Geometry: TShape;
            [Serialize(84)] DataTime: Integer; //TimeStamp
            [Serialize(85)] MeterValue: TRepresentationValue;
            [Serialize(86)] AppliedLatency: Integer;
          end;
        
          TSpatialRecords = record
           [Serialize(1)] Records: TArray<TSpatialRecord>;
          end;
        
        procedure TForm1.Button1Click(Sender: TObject);
        var
          records: TSpatialRecords;
        begin
          TgoProtocolBuffer.Deserialize(records, 'C:\export\adm\documents\OperationData-287.adm');
        end;
        

        Like

  3. Tags don’t have to start with 1 (although the output will be a bit smaller for tag values <= 15).

    I don't see any problems with your code. I assume that TShape and TRepresentationValue are enumerated types or serializable records. I also assume that the file your are loading from has been created by TgoProtocolBuffer as well.

    Is it possible to send me a small test project that shows the issue, so I can take a look at it. You can send it to erik-at-grijjy-dot-com.

    Like

  4. Hi!

    This is a good implementation of protobuf.
    But does your library have a feature to generate records from .proto files?

    Thank you.

    Like

    1. Hi Roman!
      Glad you like the implementation. We currently have now way to generate the records from .proto files. Since we control the protocol at both sides, it is easier to create the Delphi records manually than creating a .proto file. But importing .proto files can be useful when integrating with 3rd party protocols…

      Like

  5. How compatible is that to everything else?

    Can it parse every file created by a Google’s own proto2/proto3 serializer? And in reverse, write files that can be opened elsewehre?

    Does it work on ARM? Big-endian CPUs?

    Does it work with Free Pascal? There is a comment in the source:

    >{ Registers a record for serialization in Protocol Buffer format. This is
    > only needed for use with Free Pascal.

    But Free Pascal does not know the system units with dots like System.Classes

    >Enumerated types, @bold(as long as) the type does @bold(not) contain any explicitly assigned values (Delphi does not provide RTTI for these)

    That is unfortunate. But it would not need RTTI only the size. Perhaps you can build your own RTTI with sizeof?

    Like

    1. Our implementation is compatible with proto2 and can read files created by others that conform to the proto2 spec (not all files follow the spec). The other way around should work too. We don’t have specific proto3 support, but as far as I can remember, the binary format is the same.

      It does not work with Free Pascal. One of my earlier implementations did, but since we exclusively use Delphi, I didn’t want to maintain FPC compatibility. And since Delphi only supports Little Endian architectures, there is no support for Big Endian as well.

      Of course, you can always fork the repo and add your own FPC/Big Endian support😊

      Like

  6. Thanks a lot for sharing this, very friendly and clean library with fast start.
    Much less pain using this for most of goals rather than alternatives.

    Some quick features which are desirable:
    – uploading into record without cleaning it, ie. make zeroing optional
    when a record have support non-serializable fields, they get zeroed too
    – virtualizing all routines for cleaner extensibility (it’s always a pain to fork lib instead of overriding)

    And one heavier:
    – optional skipping of default values (either via attributes or via Initialize procedure)

    And other naive and arrogant:
    – if this is possible, to introduce support of datetime and single in terms of official utility (https://github.com/protocolbuffers/protobuf/releases), i.e. when you decrypt binary message with protoc.exe — decode_raw, they were properly shown as dates

    Like

    1. Thanks for these suggestions. Could you add these as features requests to the GitHub repo. That makes it easier to track them.
      Regarding the last request: that would probably require an upgrade to version 3 of the specification. We currently only support version 2.

      Like

Leave a comment