System health



The dashboard

On the dashboard of AGR you will find a section called System Health. This section will show you, by means of text and colored lights, the status/health of AGR.

Each element in the system health list has 4 components:

  • Header - this is the name of the element (e.g. “Orders”, “Data transfer”, …)
    • This message can be a static string or a translation key
  • Last run datetime - An optional component, showing the last time something ran
  • Status message - An optional message, useful for showing extra info, such as an error message
    • This message can be a static string or a translation key
  • Severity (color) - This is a colored circle. It has 3 options that could be iterpreted as follows:
    • Green = OK
    • Yellow = warning
    • Red = error

inst

The internals

When you install AGR a few system health elements are included. This includes elements for the daily job (e.g. Reports, Orders, Data transfer etc.) and a couple of forecast checks. You are, however, able to completely customize what is shown on the system health dashboard with a few easy steps

Let’s start by listing up the main components and how they work:

EXEC core.system_health

inst

This procedure, located in the prod database is the main procedure that supplies the system health dashboard with its data. It returns 4 columns, one for each of the previously mentioned components:

  • resource_category - NVARCHAR(200) - This is the “header”. As mentioned above, this can be a translation key or just a text
  • last_run - DATETIME2(0) - The last (successful) run for this element. If NULL then this component will be hidden
  • status_message - NVARCHAR(200) - This is a status/error message, if NULL then this component will be hidden
  • status_severity - SMALLINT - This one can have 3 values
    • 0 = OK = green light
    • 1 = warning = yellow light
    • 2 = error = red light

You never change this procedure but it’s useful to know that the output of this procedure is exactly what should be displayed on the dashboard

When the procedure is run it does 2 things:

  1. Executes a set of “health” procedures, defined in the core.system_health_check table. Each procedure must return results in a specific form. These procedures are executed on the fly so they must be FAST. If they are not it will cause a delay in the loading of the dashboard. With the initial setup of AGR you get one procedure which will check all the daily job staging elements. inst In the picture above you can see (id=1) the procedure that comes with the setup and also another I created for this document. Note that since the procedures can exist in either the staging or prod database there are 2 options for the sp_database column

    • stg = the staging database
    • prod = the prod database

    If you wish to create your own custom procedure there is a useful template procedure, core.system_health_check_template, that you can use as a base

  2. Fetches data from the core.system_health_status table. This table is used to store health check data for checks that are not being run using the procedures mentioned above. This could be checks like the forecast checks that simply take too long to execute “on the fly” and are therefore running as a part of the daily job. A forecast check procedure called dbo.system_health_forecast_check is a part of the daily job and if something is wrong, rows are inserted into the table. Note that an empty table is not an error, it simply means there is nothing to display.

inst

The results of the procedures and the table is union-ed together and returned.

The staging element check

As mentioned above, included with the system is the daily job / staging element check. This check looks at the status of the staging elements, the runtimes and the errors and is also highly configurable. The objects used to control this check are placed in the staging database. Let’s take a look at the objects involved:

  • core.system_health_group - This is where you define the groups. Each group will become an element on the system health dashboard inst
    • name - a text or translation key
    • yellow_light_minutes - this is the time (last_run_time), in minutes before we change the severity to 1 (yellow light). Most groups use 1440 minutes which is 24 hours
    • red_light_minutes - this is the time (last_run_time), in minutes before we change the severity to 2 (red light). Most groups use 2880 minutes which is 48 hours
    • include_element_errors - if this is set to true, we will display how many staging elements failed in the status message
  • core.stg_group - In this table we configure which staging group is linked to which system health group. Pretty straight forward inst
  • core.stg_run - In this table there is a column include_in_system_health which will allow you to select which runs are included in the system health