Note: The dashboard and graphs outlined in this article are simplified examples of what I have used successfully in a production environment in the past. You should modify the dashboard and queries to fit your application’s specific needs.

Example Sumo Logic Dashboard for Heroku Rails App You can download the JSON config to import this dashboard for your own application.

Overview

Our goal is to create a Sumo Logic dashboard that we can use to help us identify trends and problems with the performance of our Heroku Ruby on Rails application.

To do this we are going to focus on a few key areas:

Rails Application (Web Dynos)
- Web dyno load¹
- Web dyno memory utilization
Heroku Postgres
- Postgres load²
Worker Dynos and Sidekiq Jobs
- Sidekiq worker error rate
- Sidekiq worker failures by job

In order to build dashboards for these different areas, we first need to consider what kinds of logs Heroku makes available.

Heroku runtime dyno performance logs (e.g. CPU, memory usage) are available if you enable the Heroku Labs log-runtime-metrics feature. $ heroku labs:enable log-runtime-metrics
Heroku Postgres logs are available by default for Heroku Postgres Standard and Premium Tier databases.
Sidekiq logs are available by default as standard application logs from worker dynos.

Sumo Logic Graph Implementations

Some things to note with the implementations below:

All queries timeslice with buckets: This is done so that the time period can be changed on the dashboard and the graphs auto adjust their precision accordingly.
All queries omit application specific filters: This is done so that the dashboard can be applied to any application with a SumoLogic Dashboard Filter.

Heroku Web Dyno Load

This graph is designed to help provide insight into the CPU usage for you application. It uses Heroku’s CPU load average calculation which, in my experience, is mostly useful to identify relative spikes in load.

Graph

Sumo Logic Heroku Web Dyno Load

Logs

$ heroku logs --dyno web --source heroku --app my-app | grep load_avg | head -1
2020-01-28T23:59:31.883407+00:00 heroku[web.1]: source=web.1 dyno=heroku.93034875.4635d2a8-08be-4bf5-93c8-aed05a2fda5c sample#load_avg_1m=0.00 sample#load_avg_5m=0.00 sample#load_avg_15m=0.01

Query

"source=web."
| parse "sample#load_avg_1m=* " as load_avg_1m
| timeslice 60 buckets
| avg(load_avg_1m) as load_avg_1m by _timeslice

Heroku Web Dyno Memory Utilization

This graph is designed to help identify your application’s utilization of the available memory.

Graph

Sumo Logic Heroku Web Dyno Memory Utilization

Logs

$ heroku logs --dyno web --source heroku --app my-app | grep memory | head -1
2020-01-29T00:02:49.243705+00:00 heroku[web.1]: source=web.1 dyno=heroku.93034875.4635d2a8-08be-4bf5-93c8-aed05a2fda5c sample#memory_total=338.52MB sample#memory_rss=252.74MB sample#memory_cache=85.79MB sample#memory_swap=0.00MB sample#memory_pgpgin=5641245pages sample#memory_pgpgout=5554583pages sample#memory_quota=512.00MB

Query

"source=web."
| parse "sample#memory_rss=*MB" as memory_rss
| parse "sample#memory_quota=*MB" as memory_quota
| timeslice 60 buckets
| max(memory_rss) as max_memory_rss, max(memory_quota) as max_memory_quota by _timeslice
| max_memory_rss / max_memory_quota * 100 as memory_utilization
| fields _timeslice, memory_utilization

In this query we divide the Resident Memory (memory_rss) by the Memory Quota (memory_quota) to get a maximum memory utilization rate for our dynos. At the end of the query we do a fields _timeslice, memory_utilization so that memory_rss and memory_quota don’t show up in our graph.

Resident Memory (memory_rss): The portion of the dyno’s memory (megabytes) held in RAM.
Memory Quota (memory_quota): The resident memory (memory_rss) value (megabytes) at which Heroku triggers an R14 error.

Heroku Postgres Load

This graph is designed to help identify if your Postgres database is CPU-bound.

Graph

Sumo Logic Heroku Postgres Load

Logs

$ heroku logs --dyno heroku-postgres --app my-app | head -1
2020-01-29T00:36:10+00:00 app[heroku-postgres]: source=HEROKU_POSTGRESQL_RED addon=postgresql-aerodynamic-12345 sample#current_transaction=1012345 sample#db_size=3743972487bytes sample#tables=10 sample#active-connections=8 sample#waiting-connections=0 sample#index-cache-hit-rate=0.99682 sample#table-cache-hit-rate=0.84557 sample#load-avg-1m=0 sample#load-avg-5m=0 sample#load-avg-15m=0 sample#read-iops=0 sample#write-iops=0.12397 sample#tmp-disk-used=33849344 sample#tmp-disk-available=72944943104 sample#memory-total=4044932kB sample#memory-free=78976kB sample#memory-cached=3700368kB sample#memory-postgres=23292kB

Query

"heroku-postgres"
| parse "source=* " as db_name
| parse "sample#load-avg-1m=* " as load_avg_1m
| (load_avg_1m / 1) * 100 as utilization_rate
| timeslice 60 buckets
| max(utilization_rate) as max_utilization_rate by _timeslice

Sidekiq Error Rate

This graph is designed to give you insight into your frequency of failed Sidekiq jobs over time.

Graph

Sumo Logic Sidekiq Error Rate

Logs

$ heroku logs --source app --dyno worker --app my-app | grep MyExampleJob
2020-01-26T09:02:44.353201+00:00 app[worker.1]: pid=4 tid=6bjk class=MyExampleJob jid=b027792cdb4a453576377c69 elapsed=10.576 INFO: done
2020-01-26T09:02:33.777529+00:00 app[worker.1]: pid=4 tid=6bjk class=MyExampleJob jid=b027792cdb4a453576377c69 INFO: start
2020-01-26T09:02:07.485427+00:00 app[worker.1]: pid=4 tid=5si0 class=MyExampleJob jid=b027792cdb4a453576377c69 elapsed=44.526 INFO: fail
2020-01-26T09:01:22.959446+00:00 app[worker.1]: pid=4 tid=5si0 class=MyExampleJob jid=b027792cdb4a453576377c69 INFO: start

Query

"host app worker."
| parse "elapsed=* INFO: *" as run_time, status
| if (status = "fail", 100, 0) as unsuccessful
| timeslice 60 buckets
| avg(unsuccessful) as error_rate by _timeslice

In this query we filter for the Sidekiq logs that represent a job being successful or failing. From those logs, we assign the successful messages a value of 100 and assign the failure messages a value of 0. Finally, we take the average of all of those logs to get an error rate out of 100%.

Sidekiq Job Failure Count by Class

This table is designed to work with the Error Rate graph to help quickly identify which jobs are contributing to the error rate.

Graph

Sumo Logic Sidekiq Error Count by Class

Logs

$ heroku logs --source app --dyno worker --app my-app | grep MyExampleJob
2020-01-26T09:02:44.353201+00:00 app[worker.1]: pid=4 tid=6bjk class=MyExampleJob jid=b027792cdb4a453576377c69 elapsed=10.576 INFO: done
2020-01-26T09:02:33.777529+00:00 app[worker.1]: pid=4 tid=6bjk class=MyExampleJob jid=b027792cdb4a453576377c69 INFO: start
2020-01-26T09:02:07.485427+00:00 app[worker.1]: pid=4 tid=5si0 class=MyExampleJob jid=b027792cdb4a453576377c69 elapsed=44.526 INFO: fail
2020-01-26T09:01:22.959446+00:00 app[worker.1]: pid=4 tid=5si0 class=MyExampleJob jid=b027792cdb4a453576377c69 INFO: start

Query

"host app worker."
| parse "class=* " as job_name
| parse "elapsed=* INFO: *" as run_time, status
| where status = "fail"
| count by job_name
| order by _count

CPU load average is a very complicated calculation that I’ve found to only be useful relative to itself overtime. The calculation uses an exponentially damped moving average of the number of runnable tasks from the previous 30 minutes. Read more here. ↩
Postgres load average is calculated differently than how Heroku CPU load average for dynos. It is the average system load (for a 1, 5, or 15 min period) divided by the number of available CPUs. ↩