AWS Step Functions as Cloud Batch Scripts

AWS Step functions are a very powerful tool that can help any team leverage AWS. When the AWS team developed step functions, they were really trying to replicate a way to do batch scripting in the cloud. So, when deciding when to use step functions, a good analogy is to ask yourself the workflow would be a good candidate for a batch or shell script in an on-prem situation.

The first task that I utilized step functions for is a monitoring function. The step function needed to monitor a DynamoDB table, monitoring it for changes in the data and ensuring proper data was inserted in the table based on previously inserted data, sort of like a messaging system. This step function needed to be running at all times, and once a message was found to be delayed, it would need to email a specific set of users to be alerted to the delay. The initial implementation looked like this:

However, after a few months, there was an error. Apparently, after 25,000 state transitions, AWS force quits the state machine. From the documentation: “AWS Step Functions has a hard quota of 25,000 entries in the execution history”. This is great for state machines that have gone into an infinite loop, but in this case we want an infinite loop, as the checking for job delays was to happened at all times. Luckily, there was an alternative implementation. In the end an event bridge rule was triggered every 15 min to start the job checking process by calling the state machine. The final state machine looked like this:

And the triggering event bridge rule looked like this:

Event schedule:

cron(0/15 * ? * * *)

Target:

[arn of step function]