Multibank Group is a one the largest financial derivatives companies globally with 20+ offices and a paid-up capital of $322 million. The group with various member companies is active in areas from asset management to brokerage with a leading trading platform.
One of the groups entities was planning to migrate a mission critical application to AWS handling real time financial transactions. The migration was conditional on achieving a highly redundant architecture with cluster failover of less than 5 seconds. The application can’t handle an active-active architecture hence an active-passive architecture.
                  Active/Passive clustering has always been a very well-known
                  strategy cloud experts have used to provide a fully redundant
                  architecture where a secondary node is brought online when a
                  primary node fails to respond. Nevertheless, setting up an
                  Active/Passive cluster really depends on the client’s needs
                  based on a specific scenario that meets their availability
                  requirements.
                  
                  Although Amazon Route53 can be configured for an
                  active-passive failover through health checks and failover
                  routing policies, the minimum failover time for Route53 is
                  around 70 seconds computed as follows:
                  
                  Failover = TTL + (Interval * Threshold)
                  
                  If you have a TTL of 60 seconds, checks at 10 second
                  intervals, and a threshold of 1, the failover process will
                  take 70 seconds:
                  
                  Failover = 60 + (10 * 1)
                  
                  Application load balancer (ALB) target groups health check
                  wasn’t a valid solution for this scenario since the minimum
                  period between each check is 5 seconds and at least 2
                  successive failures are needed to mark an instance as
                  unhealthy, which lead us to 10 seconds at best.
                  
                  Zero&One knows exactly what innovation stands for in the world
                  of cloud, and that’s where our team put all the skills and
                  expertise needed to come up with a tailor-made solution within
                  a short period of time:
                  
                  • Application Load Balancer • Two target groups, one that is
                  active and another that is passive, with weighted target group
                  routing that allows us to control the distribution of traffic
                  to the application. • Lambda functions: • Check current Load
                  Balancer weight configuration • Conduct a health check on the
                  active server • Flip the weight of target groups upon failover
                  • Step functions to orchestrate the whole Lambdas
                
 
                
                  The above diagram shows the flow of our Lambdas and how we
                  were able to achieve a 3-second failover time.
                  
                  Even though this solution achieved a less than 5-second
                  failover time, Lambda requests and step function states
                  required around 86,400 Lambda invokes and more than 250,000
                  states daily. This led us to use CloudWatch events to restart
                  Step Function executions every 30 minutes in order to avoid
                  reaching the maximum number of registered state machines,
                  which is 10,000.
                  
                  Fortunately, our team was able to overcome this hindrance by
                  leveraging Step Function Activity instead of Lambda, which
                  consists of a program code that waits for an operator to
                  perform an action or to provide input. The next step was for
                  us to write a simple code that sends heartbeats using the
                  SendTaskHeartbeat API action, running it on each of the EC2
                  instances. Once the Step Function Activity stops receiving
                  heart beats from the active server, it will automatically
                  failover to the standby target group.
                
 
                
                  Step Function Activities enabled us to:
                  
                  • Reduce the number of Lambda invocations to get the health of
                  the active instance behind a target group. • Reduce the number
                  of Step Function State Machines. • Use Activity Heartbeat to
                  confirm the health of the server. • Use Activity Heartbeat
                  timeout to allow a grace period on health check, so we don’t
                  fake flip target groups and to correctly flip if a server is
                  completely down. • Use State Retry and Error Handling to
                  confirm we do flip the target group and nominate the passive
                  server into active.
                
                  Note:
                  
                  Step Function maximum time is 1 year, so we prepared a
                  schedule to rerun it on daily basis.
                
Client obsession is at the center of Zero&One. We made sure to leverage our expertise and skills to natively build a custom solution on AWS to meet and exceed Multibanks expectations. We were able to ensure that the customer does not suffer revenue loss or loss of reputation in the case of an IT disruption by enabling them with a highly redundant architecture with failover clustering of less than a 5-second while maintaining a cost effective, clean and future proof solution that does not reach the boundary of any AWS service.