The life of a data consumer is a tough one. Unless you’re lucky enough to have the power to set our terms, you’re at the mercy of data producers.
Data producers generally get to decide which tech to use, like the format and protocol of the data transfer. Yet what's more important is that the data producer holds all the cards for understanding the data they’re making available. It’s their data after all! It's likely don’t own by coincidence either. They’re closest to the business domain it represents so they’ve got expert knowledge of the real world thing it relates to.
Data these days also tends to be a bit of a nomadic creature. It travels far and wide within and between organisations. For you, data consumer, who meets with data randomly throughout its travels, the further from its origin the greater the cost to use it.
You’re at a disadvantage from the get-go, and as a result there’s a cost to being a data consumer.
There’s three aspects of the cost of consuming data that I’d like to cover in this post:
- Knowledge transfer cost - going beyond the technical details and understanding the meaning of the data.
- Initial vs Maintenance costs - there’s an initial cost on Day 1, as well as ongoing costs over time
- The 3 cost dimensions - breadth of usage, distance from source & time
Knowledge Transfer Cost
Knowledge transfer is a key aspect of consuming data which involves understanding what the data is and how to use it. This isn’t free and usually looks like one of the following:
You have to reverse engineer or guess the meaning of the data. You don’t know where it came from, or you can’t ask anyone who created or owns it. Your only shot is to take a look at what values are present and how they relate to the rest of the data set.
E.g. Is this timestamp when an event physically happened, or when it was recorded? Who knows.
You know someone who runs the service it comes from and who probably knows the domain. You can email, call or go visit them to learn about it.
At least you don’t have to guess! Hope you enjoy meetings and emails though, cause that’s what it takes to get to the bottom of how to use the data.
Hurrah! There’s API documentation for the host system which at least has some brief descriptions. This commonly takes the form of an OpenAPI spec. Some API first companies will have detailed documentation pages, however you’re unlikely to find this for your run of the mill internal microservice, which is what most engineers are working against.
Once the data leaves the initial service, and gets combined with data from other sources, tracking it back through to the original source becomes a bit of an archaeological exercise.
The data is self describing using a data language of shared business terms. These are machine readable and can be used to automate integrations, describe relationships with other data in the business and automatically populate data catalogs, API catalogs, and lineage tools.
The key point here is that the meaning of the data being passed around should be an atomic unit with the value. Without this context, it needs to be reverse engineered by every consumer of it. It means that populating a data catalog, or lineage tool requires a huge investment to collate the right information and to keep it up to date. Understanding how to use an API with poorly modelled and described data is a tax that’s paid in time-consuming knowledge sharing, incorrect implementations, and general confusion.
The most efficient place to attach meaning to data is at the source.
Initial vs Maintenance Costs
Often in software, we consider the upfront costs involved in building something but fail to recognise ongoing maintenance costs. (FWIW, maintenance costs account for over 60% of the total cost of a platform - so that’s a pretty big piece of the pie we’re ignoring).
Versions need updating, bugs need a fixin’, and course-corrections applied as the business requirements change over time.
The same is true when consuming data. Once we’ve finally found the data set to consume, the work doesn’t end there. We’ve actually committed to a series of activities to ensure the system continues working as intended.
This isn’t sexy work either. It doesn’t deliver new features to customers, or insights for the business. It’s things that keep the business ticking over, such as:
- Managing updates as the source data changes over time
- Responding to unexpected breakages when changes aren’t communicated properly
- Knowledge transfer to new team members about what the data is and how it’s used
- Revising understanding as you learn more about the business domain it represents
The cost of managing these changes can easily dwarf the initial costs.
Regardless of the software architecture practices that are used to decouple systems (such as APIs and message queues), we’ve created a data plane which is coupled. As a consumer we have a dependency on the producer. If the structure or the meaning of the data is updated, we’re going to be affected.
Not all consumers are equally affected. There’s three dimensions that increase data consumption costs for consumers.
- Breadth of usage
- Distance from original source
- Time since authored / maintained
Breadth of usage
We’ve all seen examples of APIs where some fields are obscure or there’s critical data in the comments field because it’s become too hard to change the API. Once we reach this tipping point of the paradox of success for an API, the cost to consume skyrockets.
A general rule of thumb is that the more widely used an API is, the more inertia there is to change that API. Once change becomes restricted, the less likely it is to reflect the best possible representation of that data. The pursuit of backwards compatibility leads to hacks and workarounds. As a result, the data becomes harder for consumers to navigate.
A corollary to this is that consumers become less able to request changes to data from a producer. A simple change by the producer might make the data better suited to the consumers needs. Instead, the consumer needs to add complexity on their side.
The broader the usage of a data set, the more use cases a producer needs to cater for. A change that helps one consumer could have a detrimental effect on another.
Distance from original source
Data is being shared across and between organisations more than ever. The communication challenge this introduces is immense.
The further we take data from the original source, the more difficult it becomes to understand how to consume it. It’s harder to fall back to the option of having a conversation to increase your understanding.
If the data owner works at the open plan desk 2 metres from you, 3 days a week, at least you’ve got a shot at pinning them down and getting some answers.
It's a different situation when:
- it's from company on the other side of the world, in a different timezone
- they've munged it together from a bunch of other sources, and
- changed it to match their own idea of what a good representation looks like
Getting back the original context here becomes a lot more challenging.
Time since authored / maintained
Let’s not forget about all those network shares with mountains of random data sets in Excel files. In extreme cases, the cost to figure out what the data was supposed to mean is so prohibitive that it’s easier to hit delete. Trying to unpick what your predecessor was thinking 10 years ago when they named all those columns with abbreviations just isn't worth it.
The same applies wherever data is being provided from. Consider an internal team that manages 40 different applications. It's likely that not all are being worked on at any one time. Where no recent changes have been made to an application, it's likely that the knowledge a consumer needs has been lost.
How can you fix this?
The best place to fix this problem is at source, with data producers. Any measures put in place after this rely on inefficient hacks and workarounds to derive what the producer already has access to.
The best place to fix this problem is at source, with data producers
The first step you can take is to attack the foundational problem. With their understanding of the domain, it's much easier for data producers to assign semantic metadata to it. This saves every consumer for doing the same work independently, and with much more difficultly. Building more advanced tools becomes much easier once this information is machine readable.
Embedding this information within code and schemas has a lot of potential. This would be richer than your standard language primitive types like strings, integers and booleans. Instead we can embed a set of semantic tags that represent the business domain. A meaningful library of tags such as ‘First name’ or ‘Date of birth’ become the language used to represent your data.
Using similar strategies to embed descriptions of other aspects of the data can also be useful for consumers. This has a huge advantage of being co-located with what it’s describing. When the underlying data changes, so too can the metadata. Storing things together that change in tandem is a common best practice in software development and relevant here.
The second step involves recognizing the imbalance in incentives for producers and consumers of data.
This stems from the fact that the costs and benefits of well described data lie with the consumer. Producers generally aren't held accountable for publishing poorly described data despite the significant knock-on costs. It's likely that part of the reason for this is that a low bar is set on average. The best one can expect for data documentation is a docs page on Confluence or the vendor’s site.
You would expect there to be some situations, such as for data marketplaces, that we would see innovation in this area. Alas, even when there's inbuilt economic incentives to make data as easy to consume as possible, little progress has been made. For internal teams, under pressure to deliver features where the cost of consuming data isn’t high on the priority list, the challenge is greater.
We can rebalance this by putting in place measures to reward and recognise producers of well described data. Features in our data and API catalogs can measure and rank progress from data producers. Consumers can be directed to the better ones and make more informed choices. As a result, efforts by data producers to make the lives of their consumers better can be recognised.
For leaders of data producers, they can make better decisions about where to invest effort. This might allow the redirection the operational cost for poor and underutilised data. For critical data, focused investment can be made to reduce the total organisational cost.
We build Vyne - an automated data integration platform.
It uses metadata from data sources to build integrations on the fly, eliminating the need for integration code. Because there's no integration code, it can automatically adapt when data is change or moving giving organisations a new level of flexibility in their data architectures.
Check it out at vyne.co