Almost every big enough enterprise has a problem with their data. Applications are created in silos. In those silos, they are collecting gigabytes of data. Furthermore, departments after many years built high and thick walls around their data. As a result companies now are spending millions of dollars to counteract those problems. The usual solution is to recruit some suicide squad of data engineers and ask them not only to create some common data platform but also to retrieve data from every possible application and/or database in the company. Additionally, they have to create Super-Generic-Uber-Data-Model which will accommodate any kind of data. We all know how this story ends…
I think there is a good solution. I call it Event-Driven Data Mesh. This Data Architecture is not a silver bullet, but it is a good fit for more complex scenarios. This solution is based on a combination of a few already discovered patterns/concepts/methodologies. Two of them were most influential: Turning Database inside-out (described by Martin Klepmann) and Data Mesh (described by Zhamak Dehghani). But before we dive into the solution, first let’s see what kind of problems with data we exactly have.
Problems with data
Data is not FAIR
The FAIR acronym stands for findability, accessibility, interoperability, and reusability. This acronym was established in 2016 by the consortium of scientists and organizations, they called it: “FAIR Guiding Principles for scientific data management and stewardship”. The FAIR acronym was meant to be used in the scientific environment but if we take into consideration that most of the big organizations nowadays are (or should be) data-centric, we can easily adopt in every branch of the economy.
findability – data is discoverable by its metadata
accessibility – even if data is protected, everyone should be able to search through its metadata and then ask for access – if they will not know it exists, they will also not know if they need it
interoperability – format of data should be some open standard which everyone understands
reusability – it should be modeled in a way it can be easily integrated with other data sources
Data engineers don’t understand the domain of data they govern
In most cases, there is one team of data engineers who are responsible for retrieving and transforming data coming from multiple sub-domains of their business. This team thus have an extremely big problem to understand every detail of every sub-domain and in consequence of that to create a proper model.
We are losing time dimension
Most of the systems are tracking just current state and they are losing knowledge of how that state changed in time. Even if we don’t need that information now, who knows if in the future with some fancy Machine Learning we will not be able to extract some profitable patterns/insights out of it.
We need to understand the use cases/needs upfront
With the Data Warehouse kind of approach, we need to understand use cases upfront, cause base on those use cases team will prepare data model, and as we all know one data model will never be a good fit for all use cases, especially those not yet known.
We now understand some of the problems we are facing in our organizations and that possible solution for those problems could be Event-Driven Data Mesh, but what exactly it means?
To be able to understand Event-Driven Data Mesh we might first understand core concepts it is built around: Turning database inside-out and Data Mesh.
Turning database inside-out
Turning database inside-out is about exposing a stream of events directly from the databases of the applications.
Martin in his talk focused on four core capabilities of relational databases: Replication, Caching, Secondary Indexes, and Materialized View. All these capabilities are creating some kind of derived data, let’s focus on two of them:
Replication is an operation in which we are copying the leader database into the follower.
The idea is simple, but one thing which is interesting is the way it is accomplished. The leader database is producing a stream of events. Every event is capturing change which happened in the leader database, on the follower side those changes are applied to the database. This mechanism is very reliable and in addition to it, it’s happening in real-time.
It is a projection of data stored in the database. Whenever we need to expose data coming from some complex SQL statement consisting of many time-consuming joins we could use Materialized View which is running that SQL statement upfront (it is updating it from time to time) and it is saving results in new tables. This approach has many advantages, but unfortunately, it also has disadvantages: data is always obsolete.
What if we could have a self-updating materialized view?
Let’s leave this thrilling question for a while and sail to the open sea.
Aye aye captain!
On every ship, there is a captain’s logbook. During every watch, an officer is responsible to put there all important information like speed, course, temperature, depth. In the case of an accident, the captain’s duty is to take the logbook because it will be the evidence during the trial.
Every row in a logbook is very similar to a change in a database emitted during replication. Imagine if after adding every row officer would grab radiotelephone and he would broadcast every information from the row.
Based on those information miles from his position in his home harbor master’s office they could prepare temperature/ air pressure charts along with localized weather forecast. In addition to this, they could show his route on a map and many others.
Going back to question: What if we could have a self-updating materialized view?
Now imagine what if we could ask Database the request as our Officer: please grab your radiotelephone and tell everybody out there what just changed in your Tables! It would be awesome, cause on the receiving end, we could create any kind of projection/materialized view/derived data, similarly to creating weather forecasts and charts. Moreover, we could update those projections on every change published by the database in real-time. So on the consumer side, we have some kind of derived data systems, and on the producer side, we have systems that could be called Systems of Record.
Zhamak in her article tackled two very difficult questions:
- Who is responsible for data?
- Should we treat data as a product?
Let’s try to answer those.
Who is responsible for data?
Picture post-apocalyptic world full of bloodthirsty zombies. In the middle of that chaos, there is an island of peace – a small town. They were lucky, cause among the survivors there were: Geography, Physics, Chemistry, and English teacher. So they were even able to run a small school. After a few years, Geography, Physics, and Chemistry teacher disappeared, probably they were bitten and changed into zombies. The only thing which was left were notes taken by children. So society gave English teacher a task: recreate curriculum for Geography, Physics, and Chemistry based on children’s notes.
Teams of Data Engineers who are responsible for running Data Warehouses / Data Lakes in your company probably feel exactly the same as that English teacher. They are trying to retrieve and transform data, which they don’t understand from many domains in which they don’t have expertise in. This is one of the problems solved by Zhamak in her Data Mesh concept.
Idea is to be heading in a similar direction as a DevOps culture/movement. We should split this Data Monolith in the middle and move their responsibility and roles into teams that own source data (Systems of Record) and to teams that own Consumers Projections (Derived Data Systems).
This way we would end up with truly cross-functional teams, we could call it DevDataOps.
Data as a Product
Companies like Facebook, Apple, Google are all about data. They are collecting every possible piece of information about us, and they are selling that information to every company which is only willing to pay for it. They are treating their data as a Product. In our operational applications, we should do it in the same way. Besides exposing functionalities as an API we should also expose our product data in an easily consumable way. One of the best ways to do it is through a stream of events. But this is not enough to be able to treat data as a product it has to meet certain standards like findability, accessibility, interoperability, reusability (FAIR data).
Event-Driven Data Mesh
When I combined those concepts (and few others eg. like Event Sourcing, CQRS, Event-Driven Architecture, Domain-Driven Design) I ended up with the above architecture.
In the center of our system, there is an Event Store. Event Store is responsible for storing all events from the beginning. In this example, I used Kafka cause it is a very good fit for this particular use case.
Systems of Record
On the left side, we have Systems of Record, this could be new systems implemented in an Event-Driven Architecture way, but it is not necessary. It could also be a very old legacy application, we would turn DB inside-out and retrieve this stream of events directly from its database using tools like Kafka Connect.
Derived Data Systems
On the right side, there are derived data systems. It could be a Data Mart, small Data Warehouse, ElasticSearch instance, Self-service or some Application responsible for reporting in our organization. There could be many of them, and because everything is in the Event Store we could create a new Derived Data System whenever need will appear.
What we are gaining with this solution?
- Data is populated in projections in real-time.
- We are not losing the time dimension.
- We have highly focused teams concentrated only on their sub-domain.
- It can be easily applied in the organization ecosystem which contains new applications written in a modern way and old legacy applications.
- We can create new projections whenever new requirements/need will appear.
Stream of Events as a Data Product
In this solution stream of events and the projections are the data products. But as we said before, those products have to meet certain standards.
By creating Event & Data Catalog which easily accessible within our organization we are gaining findability.
Every event and projection is sufficiently described in Event Catalog in a way that data scientists/analytics in the organization can use the data for there purposes.
Everyone can access metadata in Event Catalog and when founded, if he only has permission to access, he can easily access data ( in case of Kafka from topics).
By using some common industry-standard like Avro or JSON schema for events, we can assure that everyone will be able to use it. Another thing is to make the data semantically understandable and we can achieve that by using common dictionaries.
I purposely skipped many important details, to not make this article too big. I will try to make a series out of it and resolve all challenges we can meet somewhere along the line one by one.
To sum up, Event-driven Data Mesh is a Data Architecture for data-centric organizations that are not scared to try out modern approaches to software and data. It is not the easiest approach to understand and implement properly but in many cases, I think it can be very beneficial.
In the next post, I will show how this data architecture can be applied to a specific example. Stay tuned!