While the concept of data mesh as a data architecture model has been around for a while, it was hard to define how to implement it easily and at scale. That is, until now. Two data catalogs went open-source this year, changing how companies manage their data pipeline.
Let’s examine how open-source data catalogs can simplify data mesh implementation and what your organization can do to prepare for this change.
Understanding data mesh
Data mesh is a decentralized architecture type that allows different departments to access data independently. It’s different from traditional data architecture, which usually has dedicated data engineering teams that provide access to information after other departments request it.
Both methods have advantages, but the difference lies in scalability, flexibility, and access speed: more teams can work with the relevant data faster.
Think about the difference between traditional broadcast television, where the programming is centralized, and streaming services, which work independently and provide access to different types of custom content.
The creator of the concept, Zhamak Dehghani, explains that it is a shift defined by the following principles:
- Domain-oriented data ownership. The departments closest to data should own it. For example, marketing teams should fully manage the entire marketing data pipeline.
- Data as a product. Internally used domain-oriented data should be manageable and usable, just as product data would be. It has to be discoverable, secure, and interoperable, among other things.
- Self-service data infrastructure. Although data ownership is domain-oriented, every team should be able to access the data quickly, allowing cross-functional collaboration.
- Federated computational governance. Data should be governed by considering legal, compliance, and security requirements. This means that it has to be interoperable, its governance must be transparent, and its policies should be automatically applied to every data product across the entire organization.
Data mesh principles would be the most valuable to large and quickly growing mid-sized businesses with hundreds or thousands of data users across multiple departments and business units needing domain-specific data management. In these cases, decentralized ownership improves data access and reduces the dependency on central data teams.
As most large organizations have the resources and can purchase data catalogs, such as Collibra, open-source access might be revolutionary for medium-sized organizations seeking to optimize their costs and prepare their infrastructure for future growth.
So, how do open-source data catalogs unlock these opportunities?
The emergence of open-source data catalogs
Data catalogs provide a bird’s-eye view of your data assets. However, this service used to get quite expensive as your infrastructure expanded. As a result, companies would stick to the usual structure without a comprehensive inventory. After all, data storage and management are costly by default, even without paying for nice-to-have services.
This summer, multiple providers decided to go open-source, including Polaris Catalog by Snowflake and Unity Catalog from Databricks, in addition to other open-source data catalogs, such as Apache Atlas or DataHub.
So, how do these tools help to enable data mesh?
The open-source data catalogs provide several key features that are beneficial for a data mesh. These include a centralized metadata repository to enable the discovery of data assets across decentralized data domains.
The tools also help to enforce governance policies, track data lineage, ensure data quality, and understand data assets using a single layer of control for all data assets, regardless of where they reside. It also enables role-based access control and attribute-based access control.
Advantages of open-source data catalogs in data mesh
Open-source data catalogs unlock a new era of data mesh because these tools are now free and very flexible. There are many options to choose from, and you can integrate multiple platforms and tools without vendor lock-in.
In addition, open-source tools are supported by a community of millions of developers and engineers ready to share their experiences and assist in solving everyday problems. At the same time, open source enables the community to develop new plugins or extensions, which unlocks multiple opportunities and new use cases.
In its essence, data mesh helps with data observability – another important element every organization should consider. With granular access controls, data lineage, and domain-specific audit logs, data catalogs allow engineers and developers to have a better view of their systems than before.
However, while the tools have multiple advantages, data catalogs might not be right for smaller organizations. Setting up this architecture and integrating multiple domains requires a lot of technical knowledge. At the same time, using data catalogs for data mesh might not be a good option for companies with on-premise setups or those who work in multi-cloud environments.
As always, it’s important to research your options and ensure that the platforms you use are compatible with your organization’s current situation and future goals.
How to implement data mesh with open-source data catalogs
To build data mesh architecture in your organization, you should start by engaging with the community and exploring how other organizations have used these tools.
While every journey will be different, generally, you will have to achieve the following milestones:
- Assess the needs. First, evaluate the current data infrastructure and define your organization’s key domains. Explore how each domain would be structured and whether your organization is big enough to need the restructuring.
- Try out different data catalogs. Compare the features and capabilities of different open-source options to ensure you (and your team) pick the right ones for testing. Consult the community while installing, configuring, and customizing the tools.
- Onboard the domains. Once you have selected the best tool and compared it with others, it’s time to establish domain teams and assign data ownership.
- Define and implement governance policies. Together with domain owners, legal, compliance, and other responsible teams, define the data governance standards and set up the policies.
- Integrate with existing data infrastructure. Once the domains are defined and onboarded and the data governance rules are clear, you must connect the catalog to data sources, pipelines, and business intelligence tools.
- Train the teams. Providing training for domain teams and data consumers ensures each team has enough knowledge to own their domain fully.
- Maintain the data mesh infrastructure. Once everything is set up, the final step is to review the policies regularly and update the metadata and governance practices.
Once you have the initial direction, you should start small—pick just one domain—and scale up once all the key stakeholders are on board.
What’s next for data mesh?
Data mesh architecture allows organizations to be faster, work smarter, and work as flexibly as necessary. With interoperability, multiple integration options, and customizable plugins, every domain can build its preferred data infrastructure more efficiently without compromising the speed of access.
While this is a change that most organizations should explore in the near future, data mesh integration cannot be done without prioritizing data security, privacy, and compliance. At the same time, it means that teams that are typically not as tech-savvy will also need to build the necessary skill sets to succeed.
With the emergence of machine learning and advanced data analytics, accessing data faster is crucial for business. This is an exciting time of transformation for data infrastructure – and open-source data catalogs will allow it to happen.
This article was originally published on Open Data Science.