1. Introduction
In this tutorial, we’ll explore how to extract the schema from an Apache Avro file in Java. Furthermore, we’ll cover how to read data from Avro files. This is a common requirement in big data processing systems.
Apache Avro is a data serialization framework that provides a compact, fast binary data format. As such, it’s popular in the big data ecosystem, particularly with Apache Hadoop. Therefore, understanding how to work with Avro files is crucial for tasks involving data processing.
2. Maven Dependencies
To get Avro up and running in Java, we need to add the Avro core library to our Maven project:
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.12.0</version>
</dependency>
For testing purposes, we’ll use JUnit Jupiter. If we’re using Spring Boot Starter Test dependency, we don’t have to add the JUnit one. This module automatically brings it. As a side note, this module also brings the Mockito framework.
For JUnit, let’s use the latest available version:
<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter-api</artifactId>
<version>5.11.2</version>
<scope>test</scope>
</dependency>
Whenever we start a new project, it’s good to make sure we’re using the latest stable versions of the respective dependencies.
3. Understanding and Extracting Avro Schema
Before we dive into the code for extracting schemas, let’s briefly recap the structure of an Avro file:
- File header – contains metadata about the file, including the schema.
- Data blocks – the actual serialized data.
- File footer – contains additional metadata and synchronization markers.
The schema of an Avro file describes the structure of the data inside it. In addition, the data is stored in JSON format and includes information about fields, their names, and data types.
Now, let’s write a method to extract the schema from an Avro file:
public static Schema extractSchema(String avroFilePath) throws IOException {
File avroFile = new File(avroFilePath);
DatumReader<GenericRecord> datumReader = new GenericDatumReader<>();
try (DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(avroFile, datumReader)) {
return dataFileReader.getSchema();
}
}
First, we create a File object representing the Avro file. Next, we instantiate a GenericDatumReader. Instantiating this class without specifying a schema allows it to read any Avro file.
Next, we create a DataFileReader using the Avro file and the GenericDatumReader as arguments.
We use the getSchema() method of DataFileReader to extract the schema. The DataFileReader is wrapped in a try-with-resources block to ensure proper resource management.
This approach allows us to extract the schema without needing to know its structure beforehand. This way, it’s a versatile option for working with various Avro files.
4. Reading Data from Avro File
Once we have obtained the schema, we can read the data from the Avro file.
Let’s write a reading method:
public static List<GenericRecord> readAvroData(String avroFilePath) throws IOException {
File avroFile = new File(avroFilePath);
DatumReader<GenericRecord> datumReader = new GenericDatumReader<>();
List<GenericRecord> records = new ArrayList<>();
try (DataFileReader<GenericRecord> dataFileReader = new DataFileReader<>(avroFile, datumReader)) {
GenericRecord record = null;
while (dataFileReader.hasNext()) {
record = dataFileReader.next(record);
records.add(record);
}
}
return records;
}
First, we create a File from the avroFilePath. Next, we create a GenericDatumReader object, which is used to read Avro data. By creating it without specifying a schema, it can read any Avro file without knowing the schema in advance.
Then, we create a DataFileReader which is the main tool we’ll use to extract information from the Avro file. Finally, we iterate through the file using the hasNext() and next() methods and add the records to the list.
In addition, it’s good to note that we’re reusing the GenericRecord object in the next() method call. This is an optimization that helps reduce object creation and garbage collection overhead.
5. Testing
To make sure our code works correctly, let’s write some unit tests. To start with our setup, let’s create a tempDir. Using the @TempDir annotation Junit automatically creates a temporary directory for use in tests.
As such, this is useful for creating temporary files during tests without worrying about cleanup. JUnit creates it before tests run and deletes it after:
@TempDir
Path tempDir;
private File avroFile;
private Schema schema;
Next, we’re going to set up some things before each test:
@BeforeEach
void setUp() throws IOException {
schema = new Schema.Parser().parse("""
{
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "age", "type": "int"}
]
}
""");
avroFile = tempDir.resolve("test.avro").toFile();
GenericRecord user1 = new GenericData.Record(schema);
user1.put("name", "John Doe");
user1.put("age", 30);
try (DataFileWriter<GenericRecord> dataFileWriter =
new DataFileWriter<>(new GenericDatumWriter<>(schema))) {
dataFileWriter.create(schema, avroFile);
dataFileWriter.append(user1);
}
}
Finally, let’s test our functionality:
@Test
void whenSchemaIsExistent_thenItIsExtractedCorrectly() throws IOException {
Schema extractedSchema = AvroSchemaExtractor.extractSchema(avroFile.getPath());
assertEquals(schema, extractedSchema);
}
@Test
void whenAvroFileHasContent_thenItIsReadCorrectly() throws IOException {
List<GenericRecord> records = AvroSchemaExtractor.readAvroData(avroFile.getPath());
assertEquals("John Doe", records.get(0).get(0).toString());
}
These tests create a temporary Avro file with a sample schema and data. Then, they verify that our methods correctly extract the schema and read the data.
6. Conclusion
In this article, we’ve explored how to extract the schema from an Avro file and read its data using Java. In addition, we’ve demonstrated how to use GenericDatumReader and DataFileReader to handle Avro files without prior knowledge of the schema.
Furthermore, these techniques are crucial for working with Avro in various Java applications, such as data analytics or big data processing. By applying these methods we can manage Avro files in a flexible way.
Finally, we should remember to correctly handle exceptions and manage resources properly in our projects. This way, we’ll be able to work with serialized data in an efficient way, especially in Avro-centric ecosystems.
As always, the code is available over on GitHub.