1. Introduction
In this tutorial, we’ll explore the Apache Avro data serialization/deserialization framework. What’s more, we’ll learn how to approach schema definition with default values used when we initialize and serialize objects.
2. What Is Avro?
Apache Avro is a more powerful alternative to classic ways of formatting data. Generally, it uses JSON for the schema definition. Furthermore, the most popular uses cases for Avro involve Apache Kafka, Hive or Impala. Avro comes in handy for handling large volumes of data in real-time (write-intensive, big data operations).
Let’s think of Avro as being defined by a schema and the schema is written in JSON.
The advantages of Avro are:
- data is compressed automatically (less CPU resources needed)
- data is fully typed (we’ll see later how we declare the type of each property)
- schema accompanies the data
- documentation is embedded in the schema
- thanks to JSON, data can be read in any language
- safe schema evolution
3. Avro Setup
First, let’s add the appropriate Avro Maven dependency:
<dependencies>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.11.3</version>
</dependency>
</dependencies>
Next, we’ll configure avro-maven-plugin that helps us with code generation:
<build>
<plugins>
<plugin>
<groupId>org.apache.avro</groupId>
<artifactId>avro-maven-plugin</artifactId>
<version>1.11.3</version>
<configuration>
<sourceDirectory>${project.basedir}/src/main/java/com/baeldung/avro/</sourceDirectory>
<outputDirectory>${project.basedir}/src/main/java/com/baeldung/avro/</outputDirectory>
<stringType>String</stringType>
</configuration>
<executions>
<execution>
<phase>generate-sources</phase>
<goals>
<goal>schema</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
Now let’s define an example schema, which Avro uses to generate the example class. The schema is a JSON formatted object definition, stored in a text file. We must ensure the file has the .avsc extension. In our example, we’ll name this file car.avsc.
Here’s what the initial schema looks like:
{
"namespace": "generated.avro",
"type": "record",
"name": "Car",
"fields": [
{ "name": "brand",
"type": "string"
},
{ "name": "number_of_doors",
"type": "int"
},
{ "name": "color",
"type": "string"
}
]
}
Let’s take a look at the schema in a bit more detail. The namespace is where the generated record class will be added. A record is a special type of Java class that helps us model plain data aggregates with less boilerplate code than normal classes. Overall, Avro supports six kinds of complex types: record, enum, array, map, union and fixed.
In our example, type is a record. name is the name of the class and fields are its attributes and their types. Here’s where we handle the default value.
4. Avro Default Values
An important aspect of Avro is that a field can be made optional by using a union, in which case it defaults to null, or it can be assigned a particular default value when it hasn’t been initialized. So, we either have an optional field that will default to null or that field is initialized with the default value we specify in the schema.
Now, let’s look at the new schema that configures the default values:
{
"namespace": "generated.avro",
"type": "record",
"name": "Car",
"fields": [
{ "name": "brand",
"type": "string",
"default": "Dacia"
},
{ "name": "number_of_doors",
"type": "int",
"default": 4
},
{ "name": "color",
"type": ["null", "string"],
"default": null
}
]
}
We see that there’s two types of attributes: String and int. We also notice that attributes have an addition to type, default. This allows the types to not be initialized and it defaults to the specified value.
In order for the default values to be used when we initialize the object, we must use the newBuilder() method of the Avro generated class. As we can see in the test below, we use the builder design pattern and through it we initialize the mandatory attributes.
Let’s also look at the test:
@Test
public void givenCarJsonSchema_whenCarIsSerialized_thenCarIsSuccessfullyDeserialized() throws IOException {
Car car = Car.newBuilder()
.build();
SerializationDeserializationLogic.serializeCar(car);
Car deserializedCar = SerializationDeserializationLogic.deserializeCar();
assertEquals("Dacia", deserializedCar.getBrand());
assertEquals(4, deserializedCar.getNumberOfDoors());
assertNull(deserializedCar.getColor());
}
We’ve instantiated a new car object and only set the color attribute, which is also the only one mandatory. Checking the attributes, we see that brand is initialized to Dacia, number_of_doors to 4 (both were assigned the default values from the schema) and color defaulted to null.
Furthermore, adding the optional syntax (union) to the field forces it to take that value. Therefore, even if the field is int, the default value will be null. This can be useful when we want to make sure the field hasn’t been set:
{
"name": "number_of_wheels",
"type": ["null", "int"],
"default": null
}
5. Conclusion
Avro has been created to address the need for efficient serialization in the context of big data processing.
In this article, we’ve taken a look at Apache’s data serialization/deserialization framework, Avro. In addition, we’ve gone over its advantages and setup. However, most importantly, we’ve learned how to configure the schema to accept default values.
As always, the code is available over on GitHub.