Blake Smith

create. code. learn.

»

A Primer on Protocol Buffers

If you do a lot of work on the web, you’ll find yourself working with serialization protocols like JSON and XML. I’d like to present another serialization protocol that I’ve found to be quite useful: Protocol Buffers. If JSON and XML are working just fine for you, there are some reasons why you should consider adopting Protocol Buffers for your next project. Besides being the lingua franca for data exchange within Google, Protocol Buffers offer:

  • Compact binary serialization
  • 20 - 100 times better serialization performance than XML
  • 3 - 10 times more compact
  • Messages defined with a formal grammar
  • Strict typing for safe data exchange

Other binary serialization protocols such as Thrift share many of these properties. My use of protobuf over something like Thrift has come from finding more complete and maintained implementations in the languages I work with (namely: Go and Ruby).

Defining a Message

One of the benefits of Protocol Buffers over JSON is Protocol Buffers requires a messaging contract to be formally specified using .proto files. An example .proto file might look something like this:

package messages;

message User {
  required string first_name = 1;
  required string last_name = 2;
  optional int64 post_count = 3;
}

By using a grammar defined schema, we already realize many benefits over something like JSON: without writing any API documentation we know know field names, which parameters are required or optional, as well as what data types we can expect back when interacting with user message types.

Once we have our .proto file, we can compile it into our target language. This will give us programmatically generated objects that we can serialize and deserialize within our code. In this example, I’m using Ruby as my target language.

$ rprotoc messages.proto
./messages.pb.rb writing...

This example compiles a Ruby class that I can then require directly:

1.9.3-p194 :001 > require './messages.pb'
 => true 
1.9.3-p194 :002 > joe = Messages::User.new
1.9.3-p194 :003 > joe.first_name = "joe"
=> "joe"
1.9.3-p194 :003 > joe.last_name = "developer"
=> "developer"
1.9.3-p194 :004 > serialized = joe.serialize_to_string
 => "\n\x03joe\x12\tdeveloper\x18\x03"
1.9.3-p194 :05 > unserialized = Messages::User.new
 => first_name: nil
last_name: nil

1.9.3-p194 :06 > unserialized.parse_from_string(serialized)
 => first_name: "joe"
last_name: "developer"
post_count: 3

What happens when we try to use a type that has missing required fields?

1.9.3-p194 :021 > user = Messages::User.new
 => first_name: nil
last_name: nil
 
1.9.3-p194 :022 > user.serialize_to_string
Protobuf::NotInitializedError: Protobuf::NotInitializedError
...

As you can see, Protocol Buffers prohibits serializing and deserializing an object that is missing its required fields.

Assuming we provide all the required fields for our message type, we get bytes that can be sent across the wire to another consumer. How does another party know how to deserialize the bytes? Just share the .proto schema, and recompile in the consumers target language. No bloated WSDLs or silly JSON schemas anywhere in sight.

Here’s an example of a Ruby client that’s consuming a User type from a Go based HTTP service. Both consumer and producer share the same .proto schema for data interchange. Once again, here’s our User message, this time with a self-referential User type included as a friends list:

message User {
  required first_name string = 1;
  required last_name string = 2;
  optional post_count int64 = 3;
  
  repeated friends User = 4;
}

The first step is compiling our .proto schema to a native Go object:

protoc --go_out=. messages.proto

Next, we import the message and send the actual message via HTTP:

package main

import (
	"code.google.com/p/goprotobuf/proto"
	"log"
	"./messages"
	"net/http"
)

func userShow(w http.ResponseWriter, r *http.Request) {
	bob := &messages.User{
		FirstName: proto.String("Bob"),
		LastName: proto.String("Dole"),
		PostCount: proto.Int64(10),
	}
	joe := &messages.User{
		FirstName: proto.String("Joe"),
		LastName: proto.String("Developer"),
		PostCount: proto.Int64(4),
		Friends: []*messages.User{bob},
	}

	b, err := proto.Marshal(joe)
	if err != nil {
		http.Error(w, "Failed to Marshal joe!", 500)

		return
	}
	w.Header().Set("Content-type", "application/x-protobuf")
	w.Write(b)
}

func main() {
	http.HandleFunc("/users/joe", userShow)

	log.Printf("Listening for requests...")
	http.ListenAndServe(":5555", nil)
}

A silly example, but we’re sharing a user object, “Joe Developer” everytime someone hits the /users/joe route. Notice how we also defined a repeating custom field, “friends”. This allows us to embed 0..N other users that will get serialized with our HTTP call.

Now for the consumer. First, let’s share and compile the same .proto file into a Ruby class:

$ rprotoc messages.proto
./messages.pb.rb writing...

Now we can query the server for Joe Developer and his friends.

require 'rubygems'
require 'net/http'
require './messages/messages.pb'


uri = URI.parse("http://localhost:5555/users/joe")
response = Net::HTTP.get_response(uri)
user = Messages::User.new
user.parse_from_string(response.body)

puts "Got user: #{user.inspect}"

Pretty simple eh?

When should you not use Protocol Buffers?

No technology is without its downsides. Protocol Buffers is no exception. There are a few reasons why you might not want to use Protocol Buffers, and should prefer something like JSON:

You need dynamic message schemas

Perhaps you’re crafting a service that works with arbitrary opaque JSON payloads. Using something like Protocol Buffers or Thrift may be the wrong choice.

You’re sending a lot of data directly to the browser

Although possible, it’s not preferrable to use something like protobuf directly in the browser. JSON originated in Javascript for a reason.

You need wide-spread language support

Protobuf has more limited language reach compared to JSON or XML. Officially, Google only provides compilers for C++, Java and Python. There is support for many other languages via third party add-ons, but your mileage may vary between implementations. This compared to JSON which has nearly ubiquitous language support.

You require direct human readability

Although this should rarely be a strict requirement, some might require over the wire messages to be human readable. Remember that Protocol Buffers is a binary protocol, and will be hard to read without the help of a computer. There are cases when you should optimize for humans, and others where you should optimize for machines. It’s up to you to decide the trade-offs given your problem space. If you know you’re going to be moving large volumes of data across the network, or putting lots of data on disk, protobuf might be the better choice. If you have low volumes of simple messages, something like JSON might be a better decision.

Other Benefits

Besides passing serialized messages, Protocol Buffers can be great for building backend RPC services. Building an RPC service would definitely make great fodder for a future blog post. Protocol Buffers are also quite useful for data persistence. Choosing it for your storage protocol will give you a significant space savings over persisting raw JSON or XML.

By now you’ve seen that Protocol Buffers can make a terrific choice of serialization protocols. Computers are faster and better equipped to deal with storing and processing structured binary data rather than string based formats. Additionally, with a contractual based grammar, and features such as RPC, exchanging large volumes of data between services becomes more simple than with loosely defined JSON APIs. If you’re anxious to give Protocol Buffers a try, check out the Google Documentation. For Ruby and Go, I recommend using ruby-protobuf and goprotobuf respectively. You can checkout the full example source code from this blog post on github.


about the author

Blake Smith is a Principal Software Engineer at Sprout Social.