Data Catalogs form the map of your data and service ecosystem. However, it’s important to remember they’re a means to an end: Facilitating data discovery.

In this post, we look at how teams are evolving towards decentralized responsibilities, and how open source tools are helping with this.  We also look at emerging trends to ask what comes after a catalog - what does the future of catalogs look like, to further lower the cost of data discovery across the enterprise.

The desire to catalog it all

Data Catalogs and API Catalogs have become standard fare at organisations with ambitious data plans.  Having a central directory for people to look up where to find data, or which APIs exist for a requirement is key when reducing the time-to-value for data & engineering teams.

In fact, without a catalog, the Data and API landscape at a large enterprise can quickly become overwhelming - like being lost in a new city, it’s difficult to know where to look.

Like a good map, a catalog is key to helping people navigate the maze of your ecosystem, and track down the data they’re trying to find.  Creating a catalog that’s complete and accurate is a huge undertaking, and can be a long journey.

When teams are assembling to start populating catalogs - it’s important they sight on the end-goal -- to help people find data faster.  They’re travel agents, not map builders - facilitating faster travel and better destinations - the catalog is a means to an end.

Catalogs are directions

A well populated data catalog is key to making it easy to find the data or API you need -  telling you where to find it, which system holds it, and who’s responsible for it.

These instructions are like directions from a street map - a collection of steps, based on metadata that has been gathered.

In the world of maps, the Google Maps navigation experience is incredible - no matter your location and your destination - Google will find you a way.  Routes are constantly updated based on road closures and traffic.  If conditions change whilst I’m enroute, Google notifies me, and updates my path.

If our data catalogs are to be the Google Maps of our data ecosystem, the user experience needs to be on par.  Otherwise, they quickly get out of date - and inevitably you’ve taken a wrong turn.  It can be tempting to marry the roll out of a catalog with large scale data collection exercise, and a series of rules and procedures teams must follow to update the catalog with any changes.

However, when you’re laying the foundation of your catalog creation, it’s important to ask - are you building a team of surveyors, or phenomenal navigation experience?

Surveyors are responsible for visiting each road, and mapping it out - it’s arduous and time consuming work.  Google Maps leaves this to the authorities, and simply consumes their map data, to deliver the best navigation experience.  For Google Maps, their business is about getting you where you need to go, rather than deploying an army of surveyors.

Be a knowledge hub, not a local surveyancing office

RTK survey in quarry
Photo by Valeria Fursa / Unsplash

Back in the enterprise data world, all too often we see teams with desires of a  Google Map view of their data ecosystem, but then fall into the trap of building a centralized surveyancing team - rather than a decentralized metadata gathering function.  Before you know it, a flurry of local-council-esque processes and bylaws have been published mandating steps teams must comply with to evolve the data ecosystem -

If we’re to learn anything from the google model, it’s that decentralizing the maintenance , and automating data capture is key.

Rather than adding a series of additional processes, it’s important to try and leverage the metadata and procedures teams are already following, with as minimal augmentation as possible

Rather than focussing maintenance via manded, governance teams need to ask the same question Google Maps did:

How do we make it dead simple for local data owners to publish their own metadata automatically, so maintenance is automated?

It’s important to remember your catalog is just a means to an end.  It’s critical infrastructure, but it’s an Enabler - it’s not The Thing.

The reality is no-one really wants a map - they want the destination.   Likewise, no-one really wants a catalog - they want information.

Remember when building your catalog that you're in the business of facilitating destinations - not building beautiful maps.

Moving from Inventory to Hub

The critical shift around the role of data catalogs (and the teams that manage them), is from one of "building" to one of "facilitating".

Early catalog initiatives saw data stewards responsible for building and maintaining a catalog of data inventory -- classic surveyor territory.  As the role of the catalog has become more central and critical to the data landscape, this hasn’t scaled well.

Instead, catalogs & teams are now moving towards facilitating metadata capture - in all it’s varied formats.

Turns out that most data platforms have metadata that describe their data. Modern catalogs are leveraging this metadata to automate the catalog - generating listings and lineage data by interrogating metadata.  The interface becomes less about data entry, and more about curation - shifting them from surveyors to (meta)data hubs - just like in the Maps analogy.

By replacing the data entry with consumption of metadata - the task of building and updating the catalog has been decentralized - pushing responsibility back to the platform owners.  These are arguably the best people to provide the metadata anyway - since platform owners are the subject-matter-experts of their own systems.

This drive towards decentralized ownership and collaboration in the data space are key facets of the emerging Data Mesh trend, which focuses on accessing and linking data products in-situ across the organization, rather than mandating moving into a data lake for consumption.

Open Source, Metadata and Catalogs

Another (sometimes overlooked) benefit of leveraging schema metadata to drive catalogs is one of eliminating vendor lock-in.

Previously, the enterprise catalog was typically locked away in a fairly high-cost proprietary platform.  As catalogs become more critical infrastructure within data organisations, having a central component in proprietary tooling can become unpalatable.

Furthermore, the shifting role of catalogs from data entry to data discovery, the main source of data themselves becomes either schemas published by data platforms (almost all schema languages are open source), or metadata about those schemas.  As the catalogs become tools for navigating this open source metadata, iit makes less sense to capture catalog data in a proprietary platform.

There are a couple of key elements missing from the existing open source stack that are needed to describe all the metadata required to power a full data catalog:  (We’re not talking about the catalog itself here - the UI layer - but instead the data that powers it)

Organisational metadata - the who’s who of data

We need a way to describe who system owners are, so that consumers know who to discuss consumption with.

Lineage

What are the upstream sources of data?  How was it derived?

Linking columns / fields to glossary terms

How do columns and attributes relate to terms within the business glossary? Looking beyond column names, to the semantic meaning of data within each system.  By answering this question, we can see how data across the organisation all link together.

Closing these gaps are the reason why we built Taxi.

Taxi - A language for describing data

Taxi is an Open Source language designed to enhance existing schemas with additional metadata.

Taxi provides tooling for documenting your data glossary, describing lineage, and tagging columns and attributes with glossary terms.

Taxi has tools to read, embed metadata and interoperate with most standard schema languages, along with open source editing and browsing tools.

We won’t go into detail about Taxi here, but if this is something of interest, reach out on our Slack channel - we’d love to chat.

What comes after catalogs?

As catalogs are all gradually embracing decentralized workflows, and open source tooling looks to help prevent vendor lock-in, what comes next?  Returning to our Google Maps metaphor, what can we learn from observing how the world of street maps have evolved, and what that might look like in the world of enterprise data.

Today Navigation technology is on the cusp of yet another evolution, with the impending emergence of self-driving cars.  Self Driving cars are changing the role of maps, with turn-by-turn directions taking a back seat (pun intended) to a more streamlined navigation experience - “Where do you want to go today?”.

Maps are still there, and represent a major component in the self-driving experience, but to the user, they’ve been supplanted with a much simpler interface.  The natural evolution of Destination Facilitation is to make the map invisible.

How does that translate to the world of enterprise data?  How can we move from catalogs - as a set of directions on where to look to find data - to a hub of knowledge discovery - a place where people & applications find answers, rather than directions?

Data teams who are racing to build a google-like experience for their data platform need to be careful that they’re building a tool for delivering information, not directions.  The key question a data platform needs to answer is: “What information would you like today?”

Delivering answers, rather than directions, is about integration & connectivity.  Data discovery is moving to platforms that leverage metadata to automate the navigation between data sources to deliver information on-demand.

The catalog isn’t going anywhere, but we see it’s role - and it’s primary users - evolving from end-users to software.  A metadata hub that informs integration platforms where to look and how to connect data sources together, to discover and power data-lead enterprises.

Expectations that data will always be centralized into a data lake have shifted - Enterprise data will continue to be distributed across multiple platforms - trends like Data Mesh are about making this distribution a first-class citizen.

Data platforms are about delivering information, not instructions.