Improbable Icon

How To: Understanding a cloud deployment with metrics

community
how-to

#1

Another “How To” post. Today’s edition focuses on the tracking of cloud deployments and their performance, as well as any related issues. Authored by @jim, one of our engineers on the web and cloud API team.

Don’t hesitate to reach out if you have follow-up questions and to discuss your own takes on the subject!

The importance of metrics

Metrics are numerical indicators of the state of your SpatialOS games: resource usage, number of connected workers, etc.

Metric values are reported in 5 second intervals to help you visualise trends. They are great tools for tuning and debugging your game code. You can compare the metrics between two versions of your game for regression tests and spotting abnormal behaviour. Some users have created dedicated metric dashboards to monitor their games on big TV screens in their offices.

When something unexpected happens, you may want to be immediately notified. You can integrate our Prometheus metrics to your own alerting solution.

Discovering metric dashboards

image2

You can access metrics in your Console.

On the top right of the dashboard, there is a “Dashboards” drop down menu. You will find many ready-to-use examples there. There are some undocumented legacy dashboards, such as Four Days, Gamex, which will be removed very soon. A rule of thumb is: if there is no explanation next to the graphs, don’t use the dashboard.

The ready-to-use dashboards are built with specific problems in mind. One of the most common questions is whether a cloud deployment is healthy. The “Check deployment health” dashboards is there to answer this very question.

You can tell a metric is abnormal by comparing its graph with that of a known healthy deployment. We find that the “Worker to Runtime Latency (P99)” graph is one of the best indicators for worker health as it shows the latency of the slowest workers which tends to correspond to a deployment’s overall health.

image1

Some of the graphs will give you directions for performance optimisation. For example if the “Overall command latency” in the “Debug entities and commands” dashboard is more than 0.5 second you should find which exact commands are sent in the “Detailed command count” summary and try to reduce their frequencies.

Debugging worker performance issues is the main challenge when developing on SpatialOS, so we will provide an example in the next section.

Use metrics to debug a worker performance issue

A typical SpatialOS deployment has more than one worker instance for any given worker type. Server side physics may require tens of Unity Workers to co-simulate a large seamless world. How can you tell which workers are unhealthy?

You can open the Inspector, and sort workers by load. For Unity workers built using our UnitySDK, load is reported automatically and for other workers you will need to implement load reporting. A worker is considered overloaded if the load value is above 1. Once you know which worker has issues, click the worker and check its logs.

image6

You can store additional logs or heap dumps by saving your files to “/improbable/logs” directory (Please note that this path may change in the future). which you will then be able to download via the “Raw logs” window accessible in the “Advanced tab” of your deployment’s Console page. These files are only accessible while the deployment is running as they are not archived on shutdown.

Advanced: Setting up your own dashboard

Now that you have mastered the basic dashboards, and if your game’s needs are growing, you are ready to create your very own dashboards. You can tailor them to your needs and use our Prometheus-based infrastructure as the data source for your own Grafana instance. Instructions for such a setup can be found in our documentation. To get started, please reach out on our forums so that we can set you up with the necessary access.