How Company 'X' resolved 1000s of tasks stuck in queues

Hey! This is Pranjal from Engineering Diary.It's chill Monday holiday with beautiful morning snow here in Boston.

Here's what I got for you today:

How Company 'X' resolved 1000s of tasks stuck in queues

Some terminologies:Celery - Asynchronous task queueRabbitMQ - Message Broker (enables services to store and retrieve messages)

Problem:

Last Thursday evening X got to know that there were 1000s of tasks stuck in their RabbitMQ queues on one of their production servers.They have a lot of async tasks. For example:

  • Send an email to a customer when they get pre-approved for a loan

  • Update Salesforce(CRM) when loan application changes

  • Send invoices to customers for their upcoming loan payment

  • Send sms to customers with unique payment link if their payment bounces

You get the idea.

Temporary Solution:

They started by looking into why the Celery workers were stuck and not processing the tasks. They got to know a command that gives them current active tasks that are run by the workers - celery inspect active

They have two workers and each for them were running same tasks for different loans.Digging deeper in the code, they found that because there is a time.sleep in the code for a particular scenario, workers might not be able to send the heartbeat back to Celery.But the theory didn't last as Celery starts new workers automatically if the workers are not sending heartbeat back.

In the meantime, they increased the workers to 6 so that the tasks backlog could be resolved.After a while, they also sent a soft timeout to the previous workers that they thought were stuck so that they can also take on new tasks.

In about 30 minutes, all the tasks were cleared out.

Problem Identification:

They still didn't knew the root cause but they had a lead as it was the same task definition that was stuck on both the workers for different loans.But the weird part was 100s of the task definition runs everyday but they didn't get stuck so why now? 🤔 

After looking at the task definition code, it looked fine.So now it could only be one thing, a particular problem with these two loans.The special case about this tasks definition was they were calling a third party API to retrieve the data and there was a while loop. So they logged into the third party dashboard and for these two loans they saw continuous requests every second with 200 response code and then 429 response code (which means rate limiting) and this behavior was repeating.

This got them to know that workers were not stuck but continuously running the task.After running the task on the local machine for a loan, they recreated the same behavior. Digging deeper, it turned out that the third party API was sending them bad data and because of that they had an infinite loop.

Final Solution:

They wanted to fix it quickly as this could potentially cause other workers to go in infinite loop. Hence they added a soft timeout on this particular task of 90 seconds as they knew it wouldn't take any longer in all possible scenarios. But this was not ideal as the rate limiting would cause delays all across our system in the data pull.Later on Friday, they specifically checked for response from the third party API and terminated the tasks in few seconds.

Lessons Learned:

  • Put soft timeouts on long running async tasks.

  • Don't blindly trust data from third party API (even if they are a multi-billion dollar company 😅 ). Check and code defensively.

  • Put alerts on our queues when there is a lot of backlog.

Do you have any incidents like above? Please share 😊 

That's a wrap for today. Stay thirsty & see ya soon!

If you have any suggestions or questions, I would love to hear from you.

Please share with your friends and colleagues.

Reply

or to participate.