Quantcast
Channel: Baeldung
Viewing all articles
Browse latest Browse all 3550

UTF-8 Validation in Java

$
0
0

1. Overview

In data transmission, we often need to handle byte data. If the data is an encoded string instead of a binary, we often encode it in Unicode. Unicode Transformation Format-8 (UTF-8) is a variable-length encoding that can encode all possible Unicode characters.

In this tutorial, we’ll explore the conversion between UTF-8 encoded bytes and string. After that, we’ll dive into the crucial aspects of conducting UTF-8 validation on byte data in Java.

2. UTF-8 Conversion

Before we jump into the validation sections, let’s review how to convert a string into a UTF-8 encoded byte array and vice versa.

We can simply call the getBytes() method with the target encoding of a string to convert a string into a byte array:

String UTF8_STRING = "Hello 你好";
byte[] UTF8_BYTES = UTF8_STRING.getBytes(StandardCharsets.UTF_8);

For the reverse, the String class provides a constructor to create a String instance by a byte array and its source encoding:

String decodedStr = new String(array, StandardCharsets.UTF_8);

The constructor we used doesn’t have much control over the decoding process. Whenever the byte array contains unmappable character sequences, it replaces those characters with the default replacement character �:

@Test
void whenDecodeInvalidBytes_thenReturnReplacementChars() {
    byte[] invalidUtf8Bytes = {(byte) 0xF0, (byte) 0xC1, (byte) 0x8C, (byte) 0xBC, (byte) 0xD1};
    String decodedStr = invalidUtf8Bytes.getBytes(StandardCharsets.UTF_8);
    assertEquals("�����", decodedStr);
}

Therefore, we cannot use this method to validate whether a byte array is encoded in UTF-8.

3. Byte Array Validation

Java provides a simple way to validate whether a byte array is UTF-8 encoded using CharsetDecoder:

CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
CharBuffer decodedCharBuffer = charsetDecoder.decode(java.nio.ByteBuffer.wrap(UTF8_BYTES));

If the decoding process succeeds, we consider those bytes as valid UTF-8. Otherwise, the decode() method throws MalformedInputException:

@Test
void whenDecodeInvalidUTF8Bytes_thenThrowsMalformedInputException() {
    CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
    assertThrows(MalformedInputException.class,() -> {
        charsetDecoder.decode(java.nio.ByteBuffer.wrap(INVALID_UTF8_BYTES));
    });
}

4. Byte Stream Validation

When our source data is a byte stream rather than a byte array, we can read the InputStream and put its content into a byte array. Subsequently, we can apply the encoding validation on the byte array.

However, our preference is to directly validate the InputStream. This avoids creating an extra byte array and reduces the memory footprint in our application. It’s particularly important when we process a large stream.

In this section, we’ll define the following constant as our source UTF-8 encoded InputStream:

InputStream UTF8_INPUTSTREAM = new ByteArrayInputStream(UTF8_BYTES);

4.1. Validation Using Apache Tika

Apache Tika is an open-source content analysis library that provides a set of classes for detecting and extracting text content from different file formats.

We need to include the following Apache Tika core and standard parser dependencies in pom.xml:

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>2.9.1</version>
</dependency>
<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers-standard-package</artifactId>
    <version>2.9.1</version>
</dependency>

When we conduct a UTF-8 validation in Apache Tika, we instantiate a UniversalEncodingDetector and use it to detect the encoding of the InputStream. The detector returns the encoding as a Charset instance. We simply verify whether the Charset instance is a UTF-8 one:

@Test
void whenDetectEncoding_thenReturnsUtf8() {
    EncodingDetector encodingDetector = new UniversalEncodingDetector();
    Charset detectedCharset = encodingDetector.detect(UTF8_INPUTSTREAM, new Metadata());
    assertEquals(StandardCharsets.UTF_8, detectedCharset);
}

It’s worth noting that when we detect a stream that contains only the first 128 characters in the ASCII code, the detect() method returns ISO-8859-1 instead of UTF-8.

ISO-8859-1 is a single-byte encoding to represent ASCII characters, which are the same as the first 128 Unicode characters. Due to this characteristic, we still consider the data to be UTF-8 encoded if the method returns ISO-8859-1.

4.2. Validation Using ICU4J

ICU4J stands for International Components for Unicode for Java and is a Java library published by IBM. It provides Unicode and globalization support for software applications. We need the following ICU4J dependency in our pom.xml:

<dependency>
    <groupId>com.ibm.icu</groupId>
    <artifactId>icu4j</artifactId>
    <version>74.1</version>
</dependency>

In ICU4J, we create a CharsetDetector instance to detect the charset of the InputStream. Similar to the validation using Apache Tika, we verify whether the charset is UTF-8 or not:

@Test
void whenDetectEncoding_thenReturnsUtf8() {
    CharsetDetector detector = new CharsetDetector();
    detector.setText(UTF8_INPUTSTREAM);
    CharsetMatch charsetMatch = detector.detect();
    assertEquals(StandardCharsets.UTF_8.name(), charsetMatch.getName());
}

ICU4J exhibits the same behavior when it detects the encoding of the stream where the detection returns ISO-8859-1 when the data contains only the first 128 ASCII characters.

5. Conclusion

In this article, we’ve explored UTF-8 encoded bytes and string conversion and different types of UTF-8 validation based on byte and stream. This journey equips us with practical code to foster a deeper understanding of UTF-8 in Java applications.

As always, the sample code is available over on GitHub.

       

Viewing all articles
Browse latest Browse all 3550

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>