Measuring how an application meets its service level agreement (SLA) is a critical IT Operations (ITOps) responsibility. I had the pleasure of co-presenting a session on this topic with Sunny Kathuria of Mitratech Corporation at Oracle CloudWorld 2023 (LRN2384). This blog summarizes key points from the session including the techniques Oracle itself uses to track its SLA using Oracle Cloud Infrastructure Application Performance Monitoring service (OCI APM) and how Mitratech ensures their services meet SLAs, using OCI APM tracing, profiling, synthetic monitoring, alarms, and dashboards.
SLAs, service availability, and Synthetic Monitoring OCI APM techniques
An SLA defines the expected availability of a service – the percentage of the time the service is available in a set period (e.g., the service will be available 99.5% of the time in any month). The SLA also defines the availability metric – a yes/no test that can be checked periodically (e.g., the service is available if a user can access its UI and perform transactions).
Synthetic monitoring is utilized to execute tests that determine the availability metric value and to calculate and present the service availability based on the result of all the tests.
Figure 1: Synthetic monitors enable service availability as well as location, network, and backend performance reporting
Monitoring SLA challenges
At Oracle, upwards of 10,000,000 Synthetic test runs are executed every day and these are some of the challenges our Ops teams encounter and how OCI APM is used to address them.
High test volume
A mid-size organization may have tens of applications and services while a large enterprise or global service provider may have hundreds and even thousands. Each one requires multiple availability tests, executed from multiple internal and external locations, repeated several times an hour. Operation teams need to set, maintain, and analyze results for a very large number of test executions. This leads to another challenge - handling false positives like tests that fail due to intermittent network problems. While percentage-wise, these are small numbers, when running many scripts it results in too many false positives to handle manually.
OCI APM automation (terraform, CLI, SDKs), helps with the creation and maintenance of the tests. That includes an option to set a test to maintenance mode, excluding it from the availability calculation. Executing the test from multiple vantage points provides built-in redundancy, increasing the reliability of the tests. Flexible availability calculation options (e.g., at least 2 out of at least 5 test executions need to fail before a test is considered down), can be used to address the false positive problem.
Figure 2: Public Vantage Points available in OCI and other cloud providers' regions
Security
Applications and services have to be protected to prevent unauthorized access with one or more authorization schemes. Synthetic Monitors need to be able to execute the required authorization to be able to progress with the test flow, as well as to test the different authorization options, as they are part of the application that needs to be tested. A second concern is the need to protect passwords, tokens, usernames, etc., that are used in test scripts.
OCI APM service provides support for multi-factor authentication (MFA) login, and resource principal authentication, commonly used to control access to REST services. Integration with OCI Vault service is used to protect sensitive content (like usernames and passwords) used in the scripts.
Scheduling
Some applications allow a limited number of concurrent authentication sessions, in some cases more concurrent sessions means a higher bill from the authorization provider. Running tests from multiple vantage points at the same time will increase the number of used concurrent sessions.
OCI APM supports several scheduling schemes, allowing configuration that will limit the number of tests executed concurrently. A second benefit can be the use of a round-robin scheme to enable wider geographical coverage while keeping a lower total test count.
Triage and diagnostics
Network test and data explorer, in-script markers (custom metrics), HAR files for each browser tests execution, manual and automatic screenshot capturing, and built-in integration with real user monitoring (RUM) and application tracing are powerful capabilities that quickly identify network, infrastructure, and application issues as they are proactively exposed by the service.
Figure 3: Network Data Explorer – diagnose network issues on the path from vantage points to the monitored application/service
Figure 4: Service topology and tracing used to identify slow or problematic steps in a transaction
Mitratech’s journey featuring OCI APM
Mitratech is a service provider and a proven global technology partner for corporate legal departments, risk & compliance teams, and HR professionals seeking to raise productivity, control expenses, and mitigate risk by deepening organizational alignment, increasing visibility, and spurring collaboration across the enterprise.
Mitratech was looking to consolidate multiple application performance monitoring solutions to reduce costs and streamline application monitoring for their DevOps users. Mitratech evaluated OCI APM as part of their move to run their hosted environment on OCI. Today, OCI APM is used to monitor availability, performance, and diagnostics for their Java-based microservices application.
Mitratech OCI Application Performance monitoring setup
- 8 APM domains configured to monitor applications
- 10+ applications configured with OCI APM’s Synthetic Monitoring
- 350+ Java applications configured with OCI APM
- Integrations with PagerDuty and Grafana
Mitratech’s application troubleshooting workflow
When troubleshooting issues, tracing multi-span interactions and being able to drill down into the server side (including databases, hosts, and network) can be quite a challenge. Something that many have had to acquire multiple tools to do. Miratech found all those requirements could be met using OCI APM’s Tracing capability.
- Trace Explorer is used to view traces and spans and identify performance issues and bottlenecks in the monitored application, from browser to database
- Diagnostics features like CPU consumption, allocated memory, and GC impact per thread, as well as a collection of thread stack snapshots including state and lock information
Integration with PagerDuty
OCI APM is auto-discovered by PagerDuty making integration across the two solutions easy. From the service directory page in PagerDuty, search for “OCI” and select “OCI-APM.”
Integration with Grafana
Mitratech uses the OCI APM Service dashboard for all internal performance and availability needs. As a service provider, they also need to provide visibility to some of this data to their customers, without providing them access to the OCI console. Using the OCI APM Service API, Mitratech enabled Grafana dashboards with the required data, utilizing an external authentication system.
Figure 5: OCI APM data accessed in a Grafana dashboard
To wrap up, we reviewed key points covered in session LRN2384 at Oracle CloudWorld 2023 for monitoring SLAs with examples of how Oracle and Mitratech use modern observability solutions natively available from the Oracle Cloud like OCI APM. To learn more about OCI Observability and Management including OCI APM, visit the links below.
Source: oracle.com
0 comments:
Post a Comment