Apache Hudi creator Onehouse debuts specialized runtime promising 30X faster data lakehouse queries


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


As organizations store increasing volumes of information in data lakehouses, queries can potentially become slower and more costly.

That is a challenge that Onehouse is looking to help solve. The data lakehouse technology vendor is a leading contributor to the open source Apache Hudi and Apache XTable data lake table formats. Today, the company is advancing its vision of a universal data lakehouse with its new Onehouse Compute Runtime (OCR), which offers the promise of queries that are accelerated up to 30X. That speed can potentially lead to dramatic cost savings of up to 80%, according to Onehouse.

There are multiple open data lake table formats in use today, including Apache Hudi, Apache Iceberg and Delta Lake. Onehouse has been helping to lead the Apache XTable project (formerly known as OneTable), which enables a degree of interoperability across all open table formats. With the new compute engine, the goal is to enable enterprises to more easily query any open data lake table format. That includes popular services such as Amazon Redshift, Databricks, Google BigQuery and Snowflake, among others.

The new offering aims to address the gaps in existing compute engines and provide a more efficient way to run data-intensive applications on open table formats. 

“We feel we need a specialized runtime that is optimized for lakehouse workloads,” Vinoth Chandar, founder and CEO of Onehouse, told VentureBeat in an exclusive interview. “There has been an ongoing gap in the industry, where many vendors have simply adapted their existing engines to read and write from open table formats, which is a great start, but we believe we can go deeper.”

Why there is a need to accelerate open data lake table formats

Widely used data processing frameworks like Apache Spark, while powerful, are often not optimized for the requirements of all open table formats and data lakehouse architectures. 

Kyle Weller, head of product at Onehouse, explained that table formats like Hudi and Iceberg are metadata abstractions that help describe how tables are formed. He noted that, for the most part, Apache Spark is still a generic data processing framework. As such, users need to have specialized knowledge of how to optimize Spark when it comes to using open table formats.

The key differentiator of the Onehouse Compute Runtime is its ability to deeply understand and optimize for specific lakehouse workload patterns, going beyond generic compute optimizations. 

How Onehouse Compute Runtime works

Onehouse Compute Runtime operates as a layer that integrates with open compute engines such as Apache Spark and open table formats. It consists of three main components:

  • Adaptive workload optimizations
  • High-performance lakehouse input/output (I/O)
  • Serverless compute management in an organization’s virtual private cloud (VPC)

The adaptive workload optimizations allow the runtime to intelligently tune the execution of specific workloads, such as data ingestion or query processing, based on observed patterns. The system can automatically optimize file sizes and data organization patterns that typically require manual tuning. 

“Where we see most gains, and also the common pitfall for customers trying to build open data lake houses, is they either don’t get partitioning right, or they don’t sort and organize their data in the right way,” said Chandar.

The enterprise impact of faster queries

Among the early users of Onehouse Compute Runtime is digital optimization vendor Conductor .

Emil Emilov, principal software engineer at Conductor, told VentureBeat that his company has been using Onehouse for a year. He explained that Onehouse provides his company’s central data store, which feeds all of its downstream marketing analytics for end users. The new runtime will help the company in a number of ways.

Ingesting data to Onehouse, then querying with the right tool for any downstream use case is one key challenge that the new runtime helps to solve. Onehouse Compute Runtime enables Conductor to provide fresher data, resulting in more up-to-date insights. 

“Onehouse Compute Runtime also accelerates query performance, which means faster access to those insights,” said Emilov. “Ultimately, this means providing better service and higher customer satisfaction.”

Unlocking cost savings and new capabilities 

The performance improvements offered by Onehouse Compute Runtime can translate into significant cost savings for organizations running data lakehouse workloads.

By optimizing data organization and reducing the amount of data that needs to be scanned, the runtime can help lower overall compute costs.

“When it comes to the lakehouse, cost and performance are two sides of the same coin, because all we are doing is running a lot of jobs and scanning a lot of data,” said Chandar. “So, whatever we’re doing here is just making that super efficient, so I think while you get performance benefits, you’re also dropping your cost.”



Source link