Data Ingestion Standards.

Why do we need Data Ingestion Standards? #

Benn argues in his post about the importance of standardization as a too to organize the messiness of the data ecosystem. One of the areas that standards have became important in many different, and not very obvious ways, is data ingestion.

By standardization in data ingestion, we mainly refer to a number of frameworks that help us build connectors that are either extracting or loading data from and to different systems. Benn mentions as the main reason for seeking standardization the messiness of the data ecosystem, but I’ll argue that there are a couple of different reasons that these data ingestions emerged.

In this post I’ll go through a number of different reasons that I believe this happened as I experienced them by building two related products so far. Also, my approach will be heavily product driven and more specifically I’ll try to explain why standardization and especially open standardization makes a ton of sense from a product perspective.

Before we start talking about the reasons, let’s first set some context.

Who cares? #

Pretty much everyone. Whatever task involves working with data, sooner or later will face the problem of getting access to the necessary data. Data is rarely processed to the system that has been captured. For example, we collect customer data on our CRM but we need this data into our data warehouse so we can use a BI tool so we can figure out how well we perform.

This dichotomy between data creation and data processing emerged pretty early on in the industry, the distinction between OLAP and OLTP systems exists since forever.

As soon as we realize that extracting and ingesting data is part of living and working with data, we start thinking of how to automate this process. Also, as more people and organizations have to deal with this problem, we reach a critical point where a market opportunity emerges and that’s when we start thinking of how moving data around as an experience, can be productized.

At thit point, we start designing and developing ETL / ELT / Ingestion products.

But why standardization? #

A data ingestion system is a pretty simple product on high level. All it has to do is:

  1. Extract data from one system
  2. Ingest this data into another system

And do that in a scalable and fault tolerant way, two reasons that make ingestion much harder, but there are also a few other reasons that are not that obvious.

If we want to build a product and a business around ingestion, we need to make sure that we can support an open set of sources and destinations in order to satisfy the needs of the market out there. A few important things to keep in mind here,

  1. The set of sources is much bigger than the set of destinations.
  2. The Cloud has greatly contributed in the complexity and the size of these sets.
  3. This complexity will only grow larger as more and more business moves into the Cloud.

So a question quickly arises to anyone who has to figure out how to build such a system. How do we maintain support for all these different sources and destinations?

Also, how can we accelerate the development around connecting to these systems?

Any PM who has worked in a data ingestion product can tell you what a mess is to build and maintain these connectors. Why?

First of all, each system is pretty much a mini-product on its own. The way we can interact with it, the limitations it has and the associated use cases are different. Many times we start building a connector and we don’t really know all the different ways it can be used, just as we don’t know what the complexity of interacting with the system will be.

We interact with the system through an API, which is just the tip of the iceberg in terms of technical complexity. The actual system is hidden from us and we have zero control over it, how do we know if it works properly? How do we find out if something goes wrong? How should we react? It’s very rare to find a system that has documented everything that is needed to integrate with it and not having to face any of the above questions.

There are systems out there that we don’t even know that they exist. What if one of our customers has a custom system that she wants to work with? How do we take care of that?

Finally, as we don’t control any of these systems, we have zero control on changes that might break our system. How do we react when this happens?

How we can build relationships to ensure that we know when something is going to change on time so we can react without breaking our product?

The above are some really tough problems to solve at scale and that’s where standardization gets into the picture.

Still not convinced? #

You might not be convinced yet about the value of standardization and you have every reason for that. I mentioned the problems but I haven’t told you why standardization and specifically open source standards, help with all the above problems. So, let’s see how.

One of the tricks that the industry came up to solve all the above problems, is by introducing open standards on how to build connectors for extracting and loading data between systems. Some of these standards have been built and maintained by companies like Confluent (Kafka Connect), Singer (Stitch Data) and Airbyte (Airbyte).

Although different, these standards have something in common, they are all open and they are all encouraging an open source model where the creator of the connector can share it with others.

What’s the benefit of using any of these standards?

First, if we manage to get people to adopt and use these standards, we as PMs can provide much higher velocity in delivering new connectors. Now, we have a whole community of people who build connectors for us. Building engineering teams to achieve the same velocity is much much harder. For example, Airbyte and Singer have hundreds of connectors available and all these were built in a small timeframe.

By building an ingestion system that supports such open standards, we achieve better extensibility for our platform. Our customers can now build their own connectors if they want, or even better, professional services providers can learn the standard and offer paid services to build on top of our platform.

Also, better quality is achieved. An open standard means that the people who care and use a connector can take care of it, the best case of this is when a vendor decides to build and maintain a connector for their own system. No one knows better than Salesforce how to build a connector for SFDC. This is not something that can easily happen, but having an open standard can help in creating the needed critical mass for the vendors to start caring about you.

There are some pretty good reasons to try and build an open standard for connectors, but it’s not easy to make it successful. The main reason for this is that a community has to be built around it and even if you manage to build it you will have to make sure to nurture it and keep it happy.

Kafka Connect took a long time and many resources to reach the point it is today with Kafka Connect Hub and Airbyte wouldn’t be successful so fast if it didn’t build on top of Singer that had a community already.

Building an open standard for connectors is great but it’s really hard to do well, if you succeed though the reward might be bigger than you think.

Using the open standard for growth #

From a product development perspective there are some pretty good arguments why an open source standard around connectors is a good thing to have. But there’s another reason why someone would try to build one. It has to do with growth and Go To Market motions.

The success of Airbyte, Singer and Kafka Connect tells us that companies can build whole GTM motions on such standards, let’s see how.

Data ingestion is one of the fundamental problems that data engineers have to solve. In many cases, they have to do that in a very restricted context, either because of budget or because of other constraints.

Standardized and open sourced connectors that can quickly be deployed and used to extract and ingest data from different systems are an amazing gift to data engineers.

An example will be helpful.

One day the Customer Success leadership of a company decides that it needs some metrics over the data that lives on Zendesk. To do that, we first have to figure out how to make the Zendesk data available to our analysts.

The first step is to build a pipeline that will sync the data between our Zendesk and our data infrastructure. To do that, the data engineer needs to go and figure out the Zendesk API and then use python or whatever she prefers to write some logic that will hit the API, export the data and then push it into the data warehouse. Then, she also has to figure out how to continuously do that, without having to export everything every time and of course schedule the process to run in regular intervals. Oh and also figure out how to react when for whatever reason the connector she wrote breaks during execution.

Building even a simple script to do the above and maintain it, takes time and resources. Now, consider that instead of doing the above the data engineer finds an open sourced connector that has been built by a vendor using an open standard, maintained usually by the same vendor.

The data engineer now has to clone the repo, configure the connector and start ingesting data.

This is something that is pretty straightforward to do with Singer and Airbyte, because of how these standards have been built. It literally takes a few minutes to start pulling data out of something like Zendesk. Kafka Connect is a bit more complicated but if you have Kafka already in the organization, it won’t take much longer either.

What a company like Airbyte has achieved at that point is that data engineers can quickly solve a problem they have, and by doing that, also learn about the brand.

But this happens for free you might think, so how is this any good for Airbyte or Stitch Data?

Well, the connector is just the tip of the iceberg. Running a python script or launching a docker image is easy, but from running it from a terminal to putting it into production is a whole journey.

What all these companies do is that they separate the connectors from the infrastructure needed to run reliably and at scale. For this to happen at least a broker is needed. Kafka Connect requires Kafka, Airbyte the airbyte platform and of course Singer runs best together with the Stitch platform. These brokers and platforms offer well defined delivery semantics and functionality like logging and scheduling, things that are mandatory for running data pipelines in production.

In some cases, the broker is also offered as open sourced, this is the case of both Kafka and Airbyte. The data engineer now can setup and run the broker too but very soon she will face a situation when the operations around these systems will just be too much.

This is the point where these vendors have the opportunity to sell. At this point the cost the company has to suffer in order to maintain the infrastructure in house, is becoming higher than buying the vendor and that’s how companies become happy Confluent and Airbyte customers.

The end #

Hopefully I managed to convince you that open standards do not exist only to add order to the chaos of the data ecosystem.

Open standards when combined with community driven growth tactics can become an amazing tool for both product development and GTM.

Next time, we’ll go through the details of how each one of these standards has been developed.

Please consider sharing this article.

For comments, feedback and everything else, please ping me on Twitter.

Published