Troubleshooting Performance Issues

Powerful, multi-core CPU’s; fast, high-capacity networks; and well-balanced storage tiers often hide the real culprits of a data center’s performance bottlenecks. Most problems result from inefficient application code, poor system services and software designs; and less than optimal database, operating system, and virtual machine configurations. In fact, studies have found that more than 75% of performance problems in a data center can be traced to the application, services, and database layers of its infrastructure stack. Yet regrettably, it’s hardware infrastructure that generally takes the blame for performance bottlenecks, and is therefore, often the focus of system performance scrutiny.

Of course, starting with the hardware layer seems to make sense. Data center monitoring tools measure CPU and memory utilization, and storage and network I/O speed and realization. On the surface, therefore, it can appear that hardware is the rightful source of performance bottlenecks. Why else would hardware be the exclusive focus of many monitoring tools? Well, it’s because monitoring the metrics for CPU, memory, storage and network performance identifies all potential bottlenecks, hopefully before they impact users. Accordingly, knowing how to troubleshoot performance metrics to identify problems, and then knowing how to correct these problems is essential if data centers are to stay ahead of bottlenecks that impact users.

By the way, a common technique cloud administrators use while investigating problem applications and processes (workloads) is to shift those workloads onto pre-optimized cloud resources. This temporarily improves the performance of problem workloads, helping buy some time, while the root causes are uncovered and resolved. For public clouds, like Microsoft Azure and Amazon Web Services (AWS), provisioning these resources is easy and cost-effective.

When it comes to troubleshooting performance issues, it’s best to be proactive, watching for symptoms before users call out problems. For instance, is the CPU under constant pressure, operating at or above 80% of capacity; or is its processor queue length high? Other symptoms to proactively monitor include, but are not limited to:

  • High memory utilization and/or spikes in memory usage
  • Sudden increases in the number of transaction and/or error logs
  • Unbalanced usage over time across processors and hyper-threads
  • Load averages that are less than half the number of CPU cores (idled servers)
  • Load averages that are 4x or greater than the number of CPU cores (overloaded servers)
  • High CPU utilization and/or sustained network queuing

On the other hand, being reactive becomes the reality when users start experiencing:

  • High application and/or service response times
  • Completely unresponsive applications and/or services
  • System errors, lost data, and incorrect query results
  • Slow network and/or unstable network connectivity

Therefore, it’s important to establish a troubleshooting plan before performance problems are called out by users. Then execute the plan when the metrics being monitored start showing symptoms of potential issues.