I was happily boarding early testers to my upcoming Stripe analytics app, when the logs got filled with “Error R14 (Memory quota exceeded)” warnings. I’m new to both Ruby and Rails, so I thought “Gee, time to download a profiler”.
After couple of hours I found out there was no profiler. More precisely, there are plenty of tools for the front-end of your app, like Oink and MiniProfiler. They just don’t work for background jobs.
There are also some older profilers (MemoryProf) that just don’t work with Ruby 2.0 and Rails 4.0. Ruby-Prof, which does work, does not give you information about the memory usage. It writes html files, so it’s more suitable to be used in development anyway.
So, I cried a little. Then I rolled up my sleeves and frankenstained a somewhat working solution from the pieces that I had.
1. Confirming the leak/bloat
Heroku can write summaries about memory usage.
heroku labs:enable log-runtime-metrics for your app.
Then, run the problematic worker again and follow the logs. Now you can see the memory usage statistics in between your own log entries and you can easily estimate if the memory behaves like expected.
In my case, the problem worker was fetching all customer’s data from Stripe. It had several separate sections, and in between them, the garbage collection should have picked up the trash. As they say Ruby MRI doesn’t really give you memory back if it gets reserved, I was expecting an upwards trend with plateaus every now and then when garbage gets collected. What I got was a constant upwards trend. A memory leak. Or a bloat.
You’ll also want to set up the New Relic monitoring for your app.
If you put
config/initializers/new_relic.rb, you can see your memory usage in New Relic.
New Relic also has a thread profiler that can see your Sidekiq worker. In my case, these tools were nice for checking out the overall situation, but didn’t give me enough details. Maybe the tools in the paid version are different.
2. Finding out what’s eating the memory
The top part shows the memory usage in KB. The next part shows the garbage collector stats. Here are instructions for reading them. And the bottom shows the object instances in memory. In development env it will run also RubyProf, and it will write the results files in your app’s log dir. In Heroku it will print the results to the log, and skip RubyProf.
You can get the tool code for Mac & Linux(Heroku) from Snipplr.
Start it with
MemTool.start(“some info, like class name”) and stop it with
I wish I would have been able to find out who reserved the objects, but I’m still newbie enough that patching Ruby runtime sounds too scary. If you have any advice/experiences on that, let’s hear it in the comments.
You’ll also want to run Oink. It tells you if you let ActiveRecord queries eat up the memory. Remember to use the -r option, or you won’t see the ActiveRecord stats.
Unfortunately Oink is meant for profiling the front-end, so you’ll also need to create a new controller request that calls the code under test. That means you’ll have to do this test in development env, as Heroku will kill your non-responding requests after 30 seconds.
Edit: After a couple of months I found a better way to check for ActiveRecord bloat and that finally helped me to fix the memory leak/bloat.
I installed the Rack-Mini-Profiler, loaded my development environment with data and created a temporary controller & view that only run the background task. Then I was able to use the Mini-Profiler’s great UI to check the actual SQL queries by the time they took. When you load a lot of things to memory, it takes time. So I just checked out the queries that took the most time and quickly found the bloat.
In my case, it was a total newbie mistake. I was calling blank? on large scoped searches.
3. Profile the job without Sidekiq
You can run your background jobs using rake in Heroku. Just call your worker’s
#perform from a rake task. This way you get to take Sidekiq out from the equation and you’ll find out if the leak happens also when you are running outside of it.
4. Fix things
That’s the phase I am. I was able to find several problem spots from my code, which I’m fixing now. After reading couple of articles by Mike Perham, the creator of Sidekiq, I don’t think I can fix my workers to be bloat-free, but I’m targeting to getting them into a condition where once-a-day restart with Monit would be enough.
I’m also trying to reduce the base memory usage, in case I need to switch to Resque. Resque forks a new process for every job, loading the whole env each time - that’s why it in general needs more resources than Sidekiq. But that would also mean that every job has a clean plate.
I’ll write a follow-up when I’m done, and if you don’t want to miss it, you can drop your email here: [optin]