1. Overview
Data extraction is a common challenge when working with unstructured content. We can use a Large Language Model to address this challenge.
In this article, we’ll learn how to build integration pipelines using Apache Camel. We’ll integrate HTTP endpoints with the LLM using LangChain4j and use Quarkus as the framework to run all our components together.
We’ll also review how to create integration routes that use an LLM as one of the components to structure the data.
2. Introduction to the Components
Let’s review each component that will help us handle the integration pipeline.
2.1. Quarkus
Quarkus is a Kubernetes-native Java framework optimized for building and deploying cloud-native applications. We can use it to develop high-performance, lightweight applications that start quickly and consume minimal memory. We’ll use Quarkus as the framework to run our integration application.
2.2. LangChain4j
LangChain4j is a Java library designed to work with large language models in applications. We’ll use it to send prompts to the LLM to structure the content. Additionally, LangChain4j has a great integration with Quarkus.
2.3. OpenAI
OpenAI is an AI research and development company focused on creating and advancing artificial intelligence technology. We can use OpenAI’s models, like GPT, to perform tasks such as language generation, data analysis, and conversational AI. We’ll use it to extract the data from unstructured content.
2.4. Apache Camel
Apache Camel is an integration framework that simplifies connecting different systems and applications. We can use it to build complex workflows by defining routes to move and transform data across various endpoints.
3. Integration of HTTP Source With Synchronous Response
Let’s build an integration application that will handle HTTP calls with unstructured content, extract data, and return a structured response.
3.1. Dependencies
We’ll start by adding the dependencies. We add the jsonpath dependency that’ll help us to extract JSON content in our integration pipeline:
<dependency>
<groupId>org.apache.camel.quarkus</groupId>
<artifactId>camel-quarkus-jsonpath</artifactId>
<version>${camel-quarkus.version}</version>
</dependency>
Next, we add the camel-quarkus-langchain4j dependency to support LangChain4j handlers in our routes:
<dependency>
<groupId>org.apache.camel.quarkus</groupId>
<artifactId>camel-quarkus-langchain4j</artifactId>
<version>${quarkus-camel-langchain4j.version}</version>
</dependency>
Finally, we add the camel-quarkus-platform-http dependency to support the HTTP endpoint as a data input for our routes:
<dependency>
<groupId>org.apache.camel.quarkus</groupId>
<artifactId>camel-quarkus-platform-http</artifactId>
<version>${camel-quarkus.version}</version>
</dependency>
3.2. Structurizing Service
Now, let’s create a StructurizingService where we’ll add the prompting logic:
@RegisterAiService
@ApplicationScoped
public interface StructurizingService {
String EXTRACT_PROMPT = """
Extract information about a patient from the text delimited by triple backticks: ```{text}```.
The customerBirthday field should be formatted as {dateFormat}.
The summary field should concisely relate the patient visit reason.
The expected fields are: patientName, patientBirthday, visitReason, allergies, medications.
Return only a data structure without format name.
""";
@UserMessage(EXTRACT_PROMPT)
@Handler
String structurize(@JsonPath("$.content") String text, @Header("expectedDateFormat") String dateFormat);
}
We’ve added the structurize() method for building the chat model request. We’re using the EXTRACT_PROMPT text as a template for our prompt. We’ll extract the unstructured text from the input parameter and add it to the chat message. Additionally, we’ll take a date format from the second method parameter. We marked the method as an Apache Camel Route @Handler so we’ll be able to use it in our route builders without specifying the method name.
3.3. Route Builder
We use routes to specify our integration pipelines. We can create the route using the XML configuration or Java DSL with RouteBuilder.
Let’s use RouteBuilder to configure our pipeline:
@ApplicationScoped
public class Routes extends RouteBuilder {
@Inject
StructurizingService structurizingService;
@Override
public void configure() {
from("platform-http:/structurize?produces=application/json")
.log("A document has been received by the camel-quarkus-http extension: ${body}")
.setHeader("expectedDateFormat", constant("YYYY-MM-DD"))
.bean(structurizingService)
.transform()
.body();
}
}
In our route configuration, we added the HTTP endpoint as a data source. We created a preconfigured header with a date format and attached the StructurizingService bean to handle requests, transforming the output body into the route response.
3.4. Testing the Route
Now, let’s call our new endpoint and check how it handles unstructured data:
@QuarkusTest
class CamelStructurizeAPIResourceLiveTest {
Logger logger = LoggerFactory.getLogger(CamelStructurizeAPIResourceLiveTest.class);
String questionnaireResponses = """
Operator: Could you provide your name?
Patient: Hello, My name is Sara Connor.
//The rest of the conversation...
""";
@Test
void givenHttpRouteWithStructurizingService_whenSendUnstructuredDialog_thenExpectedStructuredDataIsPresent() throws JsonProcessingException {
ObjectWriter writer = new ObjectMapper().writer();
String requestBody = writer.writeValueAsString(Map.of("content", questionnaireResponses));
Response response = RestAssured.given()
.when()
.contentType(ContentType.JSON)
.body(requestBody)
.post("/structurize");
logger.info(response.prettyPrint());
response
.then()
.statusCode(200)
.body("patientName", containsString("Sara Connor"))
.body("patientBirthday", containsString("1986-07-10"))
.body("visitReason", containsString("Declaring an accident on main vehicle"));
}
}
We’ve called the structurize endpoint. Then, we sent a conversation between a patient and a healthcare service operator. In the response, we’ve obtained the structured data and verified if we have information about the patient in the expected fields.
Additionally, we’ve logged the entire response, so let’s take a look at the output:
{
"patientName": "Sara Connor",
"patientBirthday": "1986-07-10",
"visitReason": "Declaring an accident on main vehicle",
"allergies": "Allergic to penicillin; mild reactions to certain over-the-counter antihistamines",
"medications": "Lisinopril 10 mg, multivitamin, Vitamin D occasionally"
}
As we can see, all the content was structured and returned in a JSON format.
4. Conclusion
In this article, we discussed how to structure content using Quarkus, Apache Camel, and LangChain4j. With Apache Camel, we gain access to a wide range of data sources, allowing us to create transformation pipelines for our content. Using LangChain4j, we can implement data structuring processes and integrate them into our pipeline.
As always, the code is available over on GitHub.