Why I Hate Microservices Part 2: The Who's Telling the Truth Problem 🤷
When designing a system, you should not limit yourself with using only one type of databases. Some business needs might require having a more flexible sort of databases like NoSQL/Document databases. One of these databases is Elasticsearch, which is a topic that I have talked about in a previous post that you should definitely check out. However, if you use more than one database for your system, making sure that all your databases are in sync is crucial.
How It Started 🚩
As I said in my previous post. The other sort of problems that we have faced while working on the microservices project is data inconsistency. To explain this problem I have to tell you why the system needed an Elasticsearch index. In microservices projects you usually have more than one database, one for each service/domain. The client-side needed an aggregated document of the data from all the databases. Which in microservices is called The Aggregate Root, and that is basically an entity that represents some consolidated data from different domains/entities.
In order to do so, after saving the data successfully in all the different databases, these different sets of data are aggregated in one document and was then published as a message to a Kafka topic that was linked to a Confluent Kafka Connector that consumed the messages from this topic and created a new document (Aggregate root) in an Elasticsearch index.
In theory this is good so far and is needed. Because retrieving the data is so much easier now as you look for a certain document that includes all the required data instead of retrieving data from 4 or 5 databases, aggregating them before finally sending them to the client-side.
Two Databases...Two Exhausting ✌
Here's where it gets ugly. Imagine that you had to edit some of this data. You're not only going to edit one field in one record in a single database, you also need to fetch that document from the Elasticsearch index, edit that field and then save it again in the index. Same with delete. Because you have to maintain both the Postgres database AND the Elasticsearch index to ensure that they are the same to prevent data inconsistency. And that was done by publishing messages to a Kafka topic which the impacted service/domain was subscribed to and accordingly edited its concerned fields in the Elasticsearch document. Another way to handle this is using a JDBC Kafka Connector.
The problem was not in delete or edit as this was actually handled. However, the huge problem was that if somehow Confluent Kafka Connect fails to store the document in the Elasticsearch index (usually due to a schema conflict as I explained in part 1), then the client-side will not see the record.
And if the user tries to create a record with the same name, an error will occur saying that a record with the same name already exists in the database.
That's because the record was indeed saved in the Postgres database but was not stored in the Elasticsearch index.
So, as a user, the record with the name you wanted was actually created, but you cannot view it or even delete that corrupted record so that you can create a new one with the needed name.
And why did that happen?
- There was no rollback mechanism that detects the Kafka message that failed to process so that it removes its related data from the Postgres database to allow the user to recreate his record.
- And there was no retry mechanism to ensure that the Kafka message that has failed can try processing again for 2 or 3 times before rolling back.
- And there was no scheduled job for example that detects inconsistencies between the Postgres database and the Elasticsearch index and tries to fix them or at least allow the record to be recreated.
- Create an orchestrated/choreographed Saga that explains the stages that this record should go through if anything bad happens, like rollback or retry and have a service bus like Rebus implement this Saga.
- Create a dead letter queue that allows you to store failed messages, and have a service listen to this queue and try to process the messages again for 2/3 times. And if it keeps failing then it should rollback the saved records in the Postgres databases.
Having this sort of data inconsistency will make you have more than one source of truth. Because for the client-side, this record doesn't exist, but for the service that saves the record, it does exist in the database.
Circular Dependency 🔄
There was another problem which was more of an integration problem that this system had with one of its sub-systems. Some changes that happened to the records in the system propagated to a sub-system by publishing on/subscribing to a Kafka topic.
But one day, the client wanted the sub-system to also be able to propagate changes to the main system. Which meant a clear circular dependency problem.
Having two databases impact each other is not good design. Because you can never track what changed what. Circular dependency in data sources is never a good idea.
Always remember to propagate changes in a unilateral direction to avoid circular impacts.
Facing problems in your data sources usually comes from initial design decisions. When designing a system you should have strict rules when it comes to how different data sources interact with each other.
And determining a rollback/retry mechanisms is not optional. Having a failsafe procedure that allows you to overcome unplanned exceptions or system instability is not a luxury or something that you should add after your system is up and running as some sort of enhancement.
Your system will never be in a deliverable state as long as you're missing these aspects from your design. And the manner in which your data sources impact each other is a priority.
You have to know for sure: is your system's consistency eventual or transactional? It can never be both. If you apply both, you're going to face the problems that I'm going to tell you about in the next article.
Comments
Post a Comment