Data Observability.

Introduction to Data Observability #

Observability is something that is used by engineers in the context of building reliable systems. Observability is not a new term, you can find a great article about observability on wikipedia. What is important to keep is the following definition:

Observablity is a measure of how well internal states of a system an be inferred from knowledge of its external outputs.

In the context of SRE, we get measurements from the software we operate, plus the hardware infrastructure we hace and using that information we try to infer if the system is operating as expected, and if not, figure out what the problem is and fix it.

Similarly, in data observability we observe the data infrastructure plus we make measurements on the data itself and we try to infer if the data we have can be trusted or not.

To do that, data observability platforms has to pretty much interact one way or another with every part of the data infrastructure we have. So, if we consider the following popular unified data infrastructure architecture:

img

Data observability platforms should interact with pretty much every component of it, that's why it is positioned horizontaly.

Creating a platform that covers the whole architecture is challenging and maybe not even completely nessesary. To learn what the current state of the industry is and where it's heading, I'll look into a number of vendors and see where do they focus or how much of the architecture they consider.

The vendors I'll be considering are the following.

  1. AccelData
  2. Avo
  3. BigEye
  4. Datafold
  5. Great Expectations
  6. Iteratively
  7. Lightup.ai
  8. Metaplane
  9. MonteCarlo

What I want to learn is, for each of the vendors, in what parts of the reference architecture they are integrating with.

At this point, I have to explain something about how these platforms works. One of the most fundamental functions they perform is to extract metadata from different systems and use that to infer the state of the data and the infrastructure. For example, a platform might pull information about a query that has been executed on a data warehouse and check for latencies.

Of course these systems are not just metadata aggregators but they need it to build their functionality. To do anomaly detection for example, you need some kind of time series that you are tracking and inspecting for unexpected behavior.

This time series will come from observing a variable, some kind of metadata coming from the data infrastructure.

We will get into more technical details in the future but for now let's assume the following data infrastructure components that we will look into.

  1. Sources → Any place that data is generated and captured and we have to extract it from. Keep in mind that a source can also be a destination and vice versa.
  2. Ingestion & Transport → Services that are performing ELT/ETL/EL and/or orchestration.
  3. Data Warehouses → I'll consider DWs as a separate part of the infrastructure
  4. Data Lakes - Lakehouses → Same as with Data Warehouses
  5. Transformation Layer → dbt, headless BI etc.
  6. Analysis & Output → I consider BI, analytics, embedded analytics, ML and reverse ETL.

All the vendors considered here are still early in the execution of their roadmaps, to try and capture that, I'll be using the following three states for each category.

  1. ✅ → There's good support from the vendor.
  2. ❌ → No support at all.
  3. 🌗→ there's some support but it feels as work in progress.

I understand that the yes/no/partial definitions above are not very scientific but you have to trust my product intuition a little bit 🤓 and of course if you find a mistake or something missing, please let me know and I'll make sure to update everything here.

VendorSourcesIngestionData WarehousesDataLakesTrasnformationsAnalysis & Output
AccelData🌗🌗🌗
Avo🌗
BigEye🌗🌗
Datafold🌗🌗
Great Exp.
Iteratively🌗
Lightup
Metaplane🌗
MonteCarlo🌗

A few comments and clarifications on the above table.

When it comes to Sources, I consider a full support when a vendor is offering some kind of diffing between a source and the destination. BigEye has deltas and Datafold is focusing a lot on that, they have even published an open source tool to do exactly that.

Avo and Iteratively offer instrumentation at the source but the sources they support are very narrowly focused as they manage events, that's why I put them as partially supporting that.

for AccelData I put a partial support there mainly because they implement the concept of pipeline monitoring and they can at least interact with databases that usually act as sources and also systems like Kafka which are typically used for data delivery.

The Ingestion layer hasn't been a big focus of the observability vendors. There's some support for systems like Airflow while Avo is integrating with systems like Segment and RudderStack and for that reason I gave them a ✅ but I haven't seen anyone integrating with systems like Airbyte for example.

I have to mention here that MonteCarlo for example integrates with Airflow although I have put an ❌ to them under ingestion. The reason is that they do that for implementing circuit breakers.

Here I'm looking for integrations where the vendor is pulling data to perform observability. I'm not super confident that this is the right way to go with this, but we'll see. If you disagree please let me know.

Data Warehouses are what almost every vendor is heavily using for implementing data observability. I would argue that there are three types of interactions these vendors have with data warehouses.

Data Lakes are not that popular yet but they have started to receive more love from the vendors. AccelData as the only pure enterprise vendor of the list, is doing a good job here while Monte Carlo is also doing a great job in supporting data lakes and lakehouses.

Transformations is not something that the observability vendors are investing a lot yet. Whatever support I have put there is mainly some kind of integration with dbt. I haven't found any integration with metric layers for example.

And finally Analysis & Output. To be honest, I've been a bit surprised here, I was expecting more integrations. The vendors who are interacting here are doing it mainly with BI tools. I haven't seen much support for ML related tools for example. My feeling is that this will also come at some point, especially as lakehouses mature more and ML & analytics infrastructures start merging into one.

Final Thoughts #

Something that quickly stands out is how important the storage and query layer is for observability, this is evident by the maturity of the data warehouse integrations all the vendors have already.

Another thing that is evident is that there's a lot of work still to be done for delivering an end to end data observability platform.

Probably not all of the infrastructure components are equally important but what is and what not remains to be seen.

You can discuss the post and ask questions on Twitter

References #

  1. Datafold Documentation
  2. Datafold Diff tool
  3. Avo Documentation
  4. Iteratively - Amplitude Data Documentation
  5. Monte Carlo Documentation
  6. Bigeye Documentation
  7. Lightup Documentation
  8. AccelData Documentation (for Torch)
  9. Metaplane Documentation
  10. Great Expectations Documentation

Please consider sharing this article.

For comments, feedback and everything else, please ping me on Twitter.

Published