🔧Error Handling in Asynchronous Systems.

Reeshabh Choudhary
5 min readDec 11, 2023

👷‍♂️ Software Architecture Series — Part 11.

🤔Imagine, a user logging to a social media platform and posting some comments against a post. Posting a comment on a SM platform may take around some time to complete the process and the process is usually comprised of many critical steps like content parsing for abuse, checker for incorrect word, etc. If we are using REST APIs and we wait for the response from the server that the comment has been posted correctly (persisted), it will require us to wait for a considerable time. Usually these parsing operations take 1000s of milliseconds to complete and on top of that add the network latency to send a request and receive a response. However, when we post something on a SM platform, we don’t experience this delay as end users. We easily move on to scrolling the feed after posting a comment and if there is some issue with our comment, we get a notification about it.

👉Asynchronous communication is a powerful technique to improve the overall responsiveness of the application. The example we just discussed relies completely on this mode of communication, where we don’t wait for the response of a particular action but get on with other tasks. Here, we are trusting the system (in this case SM platform) to take care of the process once we have initiated an action (posting a comment).

📲When we design an application, we must evaluate the architectural characteristics required and then set our priorities. In case of SM platform, responsiveness is more pertinent factor, where a user does not need any information back after action has been initiated. If you actually dig into the architecture of SM giants like Twitter, FB, etc. you will find the feeds are also prepared well in advance and it is not the usual operation that you will log in and request for a feed and then feed will get delivered to you. The priority of such applications is to provide a smooth experience for their end users so that they are engaged for longer hours on these platforms, and to ensure this experience, responsiveness of the application is top priority.

💡“Responsiveness is all about notifying the user that the action has been accepted and will be processed momentarily, whereas performance is about making the end-to-end process faster.” — Mark Richards, Neal Ford in “Fundamentals of Software Engineering”.

⏱However, a major issue with responsive applications relying on asynchronous communication is error handling, which makes the actual implementation of such application a complex task. In a simple process of posting a comment, which we just discussed, there can be numerous scenarios where things can go haywire. Let us look at some possibilities:

1. Slow or congested network connections might cause delays or timeouts during the posting process, impacting the user experience and potentially leading to incomplete posts or retries.

2. If there is loss of internet connection, comment posting process will fail.

3. If the platform’s servers are overloaded, undergoing maintenance, or facing technical issues, users might encounter errors when attempting to post comments.

4. Unexpected server errors or database issues might occur during the comment submission process, leading to failed postings or data corruption.

5. Users might provide invalid, abusive, or malformed comments, triggering validation errors that prevent successful submission.

6. User session might have got expired or user is not even authorized to post the comment.

7. Multiple users posting comments simultaneously on the same content might result in conflicts, causing race conditions or unexpected behavior in updating the comment threads.

đź–‡These are just some common scenarios which have been recollected here. In an actual system, there can 100s of such possibilities. A good architect must evaluate the situations where a system can fail and design capabilities to be well prepared for adverse scenarios. One way to gracefully manage error handling in asynchronous workflow is the use of the workflow event pattern.

⚙Workflow Event Pattern:

The workflow event pattern leverages a workflow processor which aid in providing resiliency and responsiveness to the application. Let us look at the conceptual set up for this pattern:

The Workflow Event

➡Events containing data are asynchronously passed from an event producer to an event consumer through a message channel. This enables continuous processing of events without blocking the system.

➡When an event consumer encounters an error while processing an event, instead of halting to resolve the issue, it immediately delegates the error to the workflow processor via publishing to another event channel corresponding to the workflow processor.

➡The event consumer swiftly moves on to process the next message in the event queue, ensuring that overall system responsiveness isn’t affected. By not dwelling on the error, it prevents delay in processing subsequent messages.

➡Upon receiving the error notification, the workflow processor takes charge of diagnosing and potentially repairing the problematic message data without human intervention.

➡This diagnosis might involve deterministic checks or even leverage machine learning algorithms to detect anomalies or patterns within the data that might indicate errors.

➡Programmatically, the workflow processor attempts to repair the erroneous data by making changes or corrections. These changes aim to rectify the issue within the original message.

➡ The repaired message is then placed back into the event queue as a new item, ready for reprocessing.

➡The event consumer, treating the repaired message as a new event in the queue, attempts to process it again. This time, it is hoped that the repairs conducted by the workflow processor have addressed the issue, enabling successful processing.

➡ If after certain number of tries, the workflow processor is not able to repair the error, it will escalate the issue by notifying administrators or designated personnel responsible for handling such errors.

➡Escalation could trigger alerts or notifications through monitoring systems or logging mechanisms, drawing attention to the persistent error that couldn’t be resolved automatically. In situations where automated fixes consistently fail, the system might require manual intervention by operators or support personnel.

➡Continuous failures to fix errors might prompt a reassessment of the workflow design or the repair mechanisms themselves. This could lead to rethinking the workflow logic, refining error handling strategies, or even redesigning the repair process entirely.

đź“ŚThe workflow event pattern strives for efficient handling of errors within an event-driven system by separating error resolution from the immediate processing flow. It ensures the continuous processing of events while attempting automated error repair, ultimately aiming to reduce the impact of errors on overall system responsiveness.

#softwarearchitect #responsiveness #architecture #softwaredevelopment #design #resiliency #performance

--

--

Reeshabh Choudhary
Reeshabh Choudhary

Written by Reeshabh Choudhary

Software Architect and Developer | Author : Objects, Data & AI.

No responses yet