skip to content

Kostas Pardalis

Why you should keep an eye on Apache DataFusion and its community.

Why Apache DataFusion is one of the most important open source projects right now.

For more 🚀

On June 24, 2024, the first San Francisco Bay Area DataFusion meetup happened. I had the opportunity to help with the organization of the event and also attend.

The event had a lot of content from six different companies. These companies ranged from startups to scale-ups and big Fortune 500 companies. Leaving the event, I felt I had experienced something significant, and I want to share it with you.

And trust me, you don’t want to miss out on this!

What are you talking about, dude?

In case you don’t know what Apache DataFusion is, here’s the high-level blurb.

DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in Rust

It’s a pretty good description of what technically DataFusion is, but like many amazing open source projects, it sells short itself.

Here are a few reasons why I say that, and by the end of this list, you will have all the reasons why you should pay close attention to the future of this project.

First, the technology

Databases are notoriously hard to build and get to market. So hard that there’s a whole graveyard of database systems that were built and never made it into a product.

The reason for that is simple. Databases are just very complex systems.

They stand up there together with operating systems and compilers in terms of technical complexity.

Operating systems abstracted this complexity with the genius of Linux. There’s the kernel and then a whole set of layers that build functionality in both user-land and kernel-land.

Similarly, LLVM revolutionized the world of programming languages and compilers. Since its creation, we’ve seen many new languages being created of increased complexity.

But databases are still waiting for their LLVM moment. Until today, if you wanted to build a database system, you pretty much had to build every piece of it.

  • Design the grammar of the query language
  • Build a parser
  • Figure out an intermediate representation
  • Logical plans
  • Optimizations of logical plans
  • Query optimizers
  • Physical plans
  • Execution engines
  • Storage

And all that while fighting constantly with performance and correctness.

It can be done, but it takes a lot of time, and in the world of technology, time is the only resource you don’t really have.

As a result, most companies that tried to market a new database, didn’t have enough time to figure out what the market needed.

DataFusion is changing this.

Its design lets a team focus on a specific part of the database system they want to change. They can then reuse the rest, which greatly reduces the time it takes to get the product to the market.

Taking a look at the companies who are using DataFusion today, is a testament to that claim.

LanceDBCube.devInfluxDataDenormalized, and Greptime are building completely different products. What they have in common, though, is that their products are a database system at their core and they are also using DataFusion to build them.

Each project is innovating on a different part of a database system. They also are reusing the rest as DataFusion provides them out of the box.

The community

DataFusion is a young open-source project, but has managed to build a very healthy community.

That was evident at the meetup event, where everyone was there to share knowledge and seek opportunities to contribute back.

Building such a community is not easy and it’s primarily the result of the hard work a very small number of people are doing. Andrew Lamb and Andry Grove have done an amazing job so far, and they deserve recognition for that.

Toxicity and bad governance is what kills many open-source projects, but what I’ve experienced from the community so far, makes me feel very optimistic about the future.

Having said that, the work of these folks shouldn’t be taken for granted. Everyone who benefits from the project and the community, should try to support it in whatever way they can.

Governance & ownership

DataFusion is blessed to be an open-source project that doesn’t have a single company maintaining it.

The open-core model of monetizing software has left a very bitter taste in the mouths of many practitioners. Hashicorp and Databricks are just a small example of that.

We need a different model for building monetary value over open-source. Projects like Apache Arrow and Apache DataFusion are a great example of how a better future could look like.

All the companies I mentioned in the previous section benefit from DataFusion and contribute back to the project. They also monetize their technology and build their moats, without being antagonistic to the project.

The stars are aligned

Finally, the market is looking for solutions to problems that will require a lot of innovation to happen in data management systems.

The rise of new use cases like AI and ML are pushing existing solutions to their limits.

We need to build and we don’t have the luxury of iterating over 5+ years to just get a demo out there to the market.

DataFusion and the rest of the Arrow ecosystem is the foundation that will enable that, and it’s already happening.

The companies that presented at the Bay Area meetup collectively received over $200 million in funding.

All are using DataFusion for critical parts of their products and contribute back to the project.

Conclusions

The above are just a few of the reasons that make DataFusion such a special project. It’s still early, but the future looks really bright.

I hope I convinced you to keep an eye on the project, and if not, reach out and let me know why. I’m happy to hear your thoughts.

I’ll leave you for now with a picture from the event and with a prediction that the 1,000 projects built on DataFusion is not that far away!

meetup