Quantcast
Channel: Baeldung
Viewing all articles
Browse latest Browse all 3525

Using Apache POI to Extract Column Names From Excel

$
0
0
start here featured

1. Introduction

Handling Excel files efficiently is crucial, whether reading data for processing or generating reports. Apache POI is a powerful library in Java that allows developers to manipulate and interact with Excel files programmatically.

In this tutorial, we’ll explore Apache POI to read column names from an Excel sheet.

We’ll start with a quick overview of the POI API. Then we’ll set the required dependencies and introduce a simple data example. We’ll then see the steps to extract column names from an Excel sheet within a file both in old and new format. Finally, we’ll write unit tests to verify all works as expected.

2. Dependencies and Example Setup

Let’s start with adding the required dependencies in our pom.xml, including poi-ooxml and commons-collections4:

<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-ooxml</artifactId>
    <version>4.1.2</version>
</dependency>
<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-collections4</artifactId>
    <version>4.4</version>
</dependency>

Let’s start with some sample data stored in two Excel files. The first one is the food_info.xlsx file with the following columns and sample data:

The next one consists of consumer data in the consumer_info.xls file with the older .xls extension and column names consist of these:

We’ll next follow the steps to extract the column names from these files using the provided API by POI.

3. Extracting Column Names From Excel

To read column names from an Excel sheet using Apache POI, we’ll create a method that performs the following steps:

  • Open the Excel file
  • Access the desired sheet
  • Read the header row (first row) to get the column names

3.1. Open Excel File

First, we need to open the Excel file and create a WorkBook instance. POI provides support for both .xls and .xlsx using two different abstractions namely XSSFWorkbook and HSSFWorkbook:

public static Workbook openWorkbook(String filePath) throws IOException {
    try (InputStream fileInputStream = new FileInputStream(filePath)) {
        if (filePath.toLowerCase()
          .endsWith("xlsx")) {
            return new XSSFWorkbook(fileInputStream);
        } else if (filePath.toLowerCase()
          .endsWith("xls")) {
            return new HSSFWorkbook(fileInputStream);
        } else {
            throw new IllegalArgumentException("The specified file is not an Excel file");
        }
    } catch (OLE2NotOfficeXmlFileException | NotOLE2FileException e) {
        throw new IllegalArgumentException(
          "The file format is not supported. Ensure the file is a valid Excel file.", e);
    }
}

Essentially, we’re using the WorkBook interface that represents an Excel workbook. It’s the top-level object for handling Excel files in Apache POI. XSSFWorkbook is a class that implements WorkBook for .xlsx files. On the other hand,  HSSFWorkbook class implements WorkBook for .xls files.

3.2. Access Work Sheet

Now that we have a Workbook, let’s access the desired sheet within the workbook by sheet name:

public static Sheet getSheet(Workbook workbook, String sheetName) {
    return workbook.getSheet(sheetName);
}

The Sheet interface in POI API represents a sheet within an Excel workbook.

3.3. Read the Header Row

Using the Sheet object, we can access its data as desired.

Let’s use the API to read the header row which has the names of all the columns in the sheet. The Row interface represents a row in a sheet. Simply stated, here we’ll access the first row of the sheet passing the index 0 to the sheet.get()  method. Then, we’ll use the Cell interface to extract each column name within the header row.

The Cell interface represents a cell in a row:

public static List<String> getColumnNames(Sheet sheet) {
    Row headerRow = sheet.getRow(0);
    if (headerRow == null) {
        return Collections.EMPTY_LIST;
    }
    return StreamSupport.stream(headerRow.spliterator(), false)
      .filter(cell -> cell.getCellType() != CellType.BLANK)
      .map(Cell::getStringCellValue)
      .filter(cellValue -> cellValue != null && !cellValue.trim()
        .isEmpty())
      .map(String::trim)
      .collect(Collectors.toList());
}

Here, we’re using Java Streams to iterate over each Cell. We filter out blank cells and cells with only whitespace or null values. Then we extract the string value of each remaining cell using the getStringCellValue()  method from Cell. In this case, the API returns the String value of the data in the cell. Additionally, we trimmed the whitespace from these string values. Finally, we collected these cleaned string values into a list and returned the list.

At this point, it’s worth also touching upon a related method called getRichStringTextValue() which retrieves the cell value as a RichTextString. This is useful when handling formatted text, such as text with different fonts, colours, or styles within the same cell. If our use-case requires us not just to extract column names but also preserve the formatting across these column names, then we’ll map using Cell::getRichStringTextValue() instead and store the result as List<RichTextString>

4. Unit Tests

Let’s now setup unit tests to see the POI API in action for both .xls and .xlsx files:

@Test
public void givenExcelFileWithXLSXFormat_whenGetColumnNames_thenReturnsColumnNames() throws IOException {
    Workbook workbook = ExcelUtils.openWorkbook(XLSX_TEST_FILE_PATH);
    Sheet sheet = ExcelUtils.getSheet(workbook, SHEET_NAME);
    List<String> columnNames = ExcelUtils.getColumnNames(sheet);
    assertEquals(4, columnNames.size());
    assertTrue(columnNames.contains("Category"));
    assertTrue(columnNames.contains("Name"));
    assertTrue(columnNames.contains("Measure"));
    assertTrue(columnNames.contains("Calories"));
    workbook.close();
}
@Test
public void givenExcelFileWithXLSFormat_whenGetColumnNames_thenReturnsColumnNames() throws IOException {
    Workbook workbook = ExcelUtils.openWorkbook(XLS_TEST_FILE_PATH);
    Sheet sheet = ExcelUtils.getSheet(workbook, SHEET_NAME);
    List<String> columnNames = ExcelUtils.getColumnNames(sheet);
    assertEquals(3, columnNames.size());
    assertTrue(columnNames.contains("Name"));
    assertTrue(columnNames.contains("Age"));
    assertTrue(columnNames.contains("City"));
    workbook.close();
}

The tests verify that the API supports reading column names from both types of Excel files.

5. Conclusion

In this article, we explored how to use Apache POI to read column names from an Excel sheet. We started with an overview of Apache POI, followed by setting up the necessary dependencies. We then saw a step-by-step guide with code snippets to implement the solution and included unit tests to ensure correctness.

Apache POI is a robust library that simplifies the process of working with Excel files in Java, making it an invaluable tool for developers handling data interchange between applications and Excel.

As always, the full implementation of this article can be found over on GitHub.

       

Viewing all articles
Browse latest Browse all 3525

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>