DevOps-SRE 시험 - Peoplecert실제시험문제와 답 - 40문항

Question No : 1

What is the MOST widely tracked Service Level Objective (SLO)?

A.Performance
B.Observability
C.Securability
D.Availability

정답:
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
Availability is the most widely tracked and commonly understood SLO across nearly all digital services. It measures whether users are able to successfully access and use the system. Because unavailability directly impacts user experience, revenue, trust, and reliability, it is the primary SLO used across industries.
The Site Reliability Engineering Book, Chapter “Service Level Objectives,” states:
“Availability is one of the most common and important SLOs since it reflects the basic ability of the service to function for users.”
The SRE Workbook also notes:
“Availability targets (e.g., 99.9%, 99.99%) are the most widely used form of SLOs and form the foundation of error budget policies.”
While performance SLOs are also common, availability SLOs are almost universal and foundational.
Thus,
D. Availability is the correct answer.
Reference: Site Reliability Engineering Book, “Service Level Objectives” SRE Workbook, “Implementing SLOs”

Question No : 2

Which of the following is a principle of SRE-Led Service Automation?

A.No automated tests in production
B.Environments provisioned using IaC
C.Using unsigned artifacts in production
D.Adding as much hardware as possible

정답:
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
SRE-led service automation focuses on making environments reproducible, reliable, and consistent. One of the key principles aligned with Google SRE practices is the use of Infrastructure as Code (IaC), which allows environments to be provisioned automatically, consistently, and predictably.
The Site Reliability Engineering Book, in its discussions on automation, states:
“Automation implemented as code ensures that environments are consistent, repeatable, and less prone to human error.”
The SRE Workbook expands on this concept:
“Infrastructure as Code allows services to scale and evolve reliably by ensuring that configuration and infrastructure changes are automated and version-controlled.”
IaC is fundamental to:
Reducing toil
Increasing reliability
Enabling consistent automation across environments
Reducing configuration drift
Why the other options are incorrect:
A SRE supports testing in production; it does not ban automated tests.
C Using unsigned artifacts violates security and reliability best practices.
D Adding hardware is not an automation principle and contradicts efficiency goals.
Thus, the correct answer is B.
Reference: Site Reliability Engineering Book, “Eliminating Toil” and automation sections SRE Workbook, “Automation and Infrastructure as Code”

Question No : 3

How does automation reduce toil?

A.Automated releases can replace manual releases
B.We can use artificial intelligence to tell us where we are wasting all of our time
C.We can use video conference facilities to prevent travel to meetings
D.Automation doesn’t reduce toil. In fact creating automation requires more toil.

정답:
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
Automation is the primary method of reducing toil in SRE. The Google Site Reliability Engineering Book, Chapter “Eliminating Toil,” states:
“Automation is the most effective tool for reducing toil. Any recurring, manual, automatable task should be automated to prevent it from consuming engineering time.”
Automated release systems directly eliminate toil by:
Removing manual deployment steps
Removing repeated, error-prone human processes
Increasing reliability and consistency
Freeing engineers for high-value project work
The SRE Workbook reinforces this:
“CI/CD pipelines and release automation remove significant operational toil by replacing manual processes with repeatable, reliable automation.”
Why the other answers are incorrect:
B AI is not required for toil reduction.
C Meeting travel is not an SRE toil concern.
D Incorrect; automation dramatically reduces long-term toil, even though initial setup requires effort.
Thus, A is the correct answer.
Reference: Site Reliability Engineering Book, “Eliminating Toil”
SRE Workbook, “Toil Reduction Strategies”

Question No : 4

Why is observability potentially better than traditional monitoring?

A.Observability is less expensive than traditional monitoring
B.Traditional monitoring does not adapt well to the cloud since it focuses on discrete components and applications
C.Traditional monitoring can struggle to scale when service growth is rapid
D.Traditional monitoring cannot support containers

정답:
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
Traditional monitoring works well when systems are static and predictable. However, cloud-native, distributed, and microservice-based architectures create highly dynamic environments. In these cases, observability becomes more effective because it provides visibility across entire systems, rather than focusing on individual components.
From Google’s Observability guidance:
“Traditional monitoring relies on predefined dashboards and known failure modes. In modern cloud systems, component-level monitoring becomes insufficient because failures occur in ways that cannot always be predicted.”
Further, in the SRE Workbook:
“Monitoring individual components does not provide adequate visibility into complex distributed
systems. Observability enables teams to understand system-wide behavior and user impact.”
Why options are incorrect:
A Observability is not inherently cheaper.
C While true, it is not the best reason; observability's benefit is broader than scale alone.
D Traditional monitoring can support containers but often becomes noisy and ineffective.
Thus, the best answer is B.
Reference: SRE Workbook, “Monitoring and Observability”
Google Cloud Architecture Framework, “Observability vs Monitoring”

Question No : 5

Which of the following is the definition for Application Performance Management (APM)?

A.The highly automated communications process by which measurements are made and other data collected at remote or inaccessible points and transmitted to receiving equipment for monitoring
B.The monitoring and management of performance and availability of software applications
C.The use of a hardware or software component to monitor system resources and performance of a computer system
D.Ways for engineers to communicate quantitative data about systems

정답:
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
Application Performance Management (APM) refers to a set of tools and practices used to monitor and manage the performance, behavior, and availability of software applications. Although APM is not defined exclusively in the Google SRE Book, it is described within the broader context of monitoring and observability.
In the SRE Workbook, under Monitoring:
“Application monitoring tools provide insights into the performance, latency, availability, and behavior of applications to help engineering teams maintain reliability.”
Industry-standard APM frameworks (including Google Cloud Operations Suite, formerly Stackdriver)
define APM as:
“The monitoring and management of application performance and availability.”
Why the other options are incorrect:
A describes telemetry, not APM.
C describes system monitoring (infrastructure), not application performance monitoring.
D refers to communication of metrics, not the monitoring of application performance.
Therefore, B is the correct definition.
Reference: SRE Workbook, “Monitoring”
Google Cloud Operations Suite (APM documentation)

Question No : 6

A bank has been using traditional monitoring tools for ensuring that their systems are available and operating as planned. Their strategic initiatives now include a renewed focus on customer experience as well as identifying ways to scale service.
Why would migrating to an observability approach be important now?

A.It’s better for managing container workloads and dynamic architectures
B.Monitoring at the component level may no longer provide the right data
C.It is impossible to anticipate all potential problems
D.All of the above

정답:
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
All the listed reasons correctly describe why observability becomes essential in modern, user-focused, dynamically scaling architectures.
The SRE Workbook and Google Observability guidance both emphasize that traditional monitoring is insufficient in environments where:
Services are distributed
Traffic is unpredictable
Customer experience is a priority
Cloud-native, containerized, or microservice architectures are used
Key excerpts:
From Google’s Observability guidance:
“Monitoring relies on known failure modes; observability enables teams to explore unknown-unknowns and understand complex, dynamic systems.”
From the SRE Workbook:
“As systems scale and architectures shift toward microservices or containers, component-level monitoring provides an incomplete picture. Observability enables teams to understand user impact and system behavior holistically.”
Thus:
A Observability is critical for containerized and dynamic environments.
B Component monitoring alone cannot show customer experience or end-to-end reliability.
C Observability helps teams diagnose issues that could not be predicted in advance ("unknown unknowns").
All statements are correct, making D the correct answer.
Reference: SRE Workbook, “Monitoring and Observability”
Google Cloud Architecture Framework: “Observability vs Monitoring”
Site Reliability Engineering Book, Alerting & Monitoring chapters

Question No : 7

Which of the following BEST describes observability?

A.Monitoring applications to detect problems and anomalies
B.Performing fitness tests and health checks
C.A measure of how well internal states of a system can be inferred from knowledge of its external outputs
D.Collecting data from multiple endpoints to aggregate and observe application performance

정답:
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
The term observability comes directly from control theory and refers to the ability to infer the internal state of a system from its external outputs. Modern SRE and observability practices adopt this definition.
Google’s Site Reliability Engineering guidance (SRE Book Addendum on Observability) states:
“Observability is a property of a system that allows operators to understand its internal state by examining its outputs such as logs, metrics, and traces.”
This aligns exactly with Option C, the formal definition.
Why the other options are incorrect:
A Monitoring is part of observability, but observability is much broader.
B Health checks are simply one signal; they do not represent observability.
D Data collection is a mechanism, not the definition of observability itself.
Thus, C is the correct and academically accurate definition.
Reference: Site Reliability Engineering Book Addendum: Observability
Google Cloud Architecture Framework: Observability Principles

Question No : 8

Service Level Indicator data helps to understand how much Error Budget is left.
TRUE or FALSE?

A.True
B.False

정답:
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
Service Level Indicators (SLIs) provide the quantitative measurements needed to determine how much of the Service Level Objective (SLO) has been consumed. Since the error budget is defined as the allowable amount of unreliability, SLI data is the source of truth for calculating how much of that budget remains.
From the Site Reliability Engineering Book, Chapter “Service Level Objectives”:
“SLIs provide the measurements used to determine compliance with SLOs. Error budgets are computed directly from the SLI measurements over the defined time window.”
The SRE Workbook further explains:
“Error budgets quantify the inverse of SLO performance. SLIs provide the raw data that allow teams to calculate how much of the budget has been consumed and how much remains.”
Thus, SLI data is the Only mechanism that determines remaining error budget.
Therefore, the statement is True.
Reference: Site Reliability Engineering Book, “Service Level Objectives” SRE Workbook, “Implementing SLIs and SLOs”

Question No : 9

Why would some Service Level Indicators require client-side data?

A.There may be metrics affecting users that are not reflected on the server side
B.It would be difficult to negotiate service level agreements with customers without client data
C.It would be difficult to engineer external automation without client side data
D.Service Level Objectives may not be achievable without client side data

정답:
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
SLIs must measure user experience, and sometimes server-side metrics alone do not show the full picture. Client-side data may reveal issues such as:
Slow networks
Browser rendering delays
Mobile device limitations
CDN performance issues
Last-mile latency
The Site Reliability Engineering Book, Chapter “Service Level Indicators,” states:
“Server-side metrics do not always fully capture the user experience. In many cases, client-side measurements are required to understand the actual reliability delivered to users.”
The SRE Workbook reinforces:
“Some SLIs require client instrumentation because user-visible performance problems may not be observable from backend systems alone.”
Why the other options are incorrect:
B SLA negotiation has nothing to do with SLI selection.
C Automation engineering is unrelated to client-side measurement needs.
D Achievability of SLOs does not determine whether client-side data is needed; accuracy of user-experience measurement does.
Thus, the correct answer is A.
Reference: Site Reliability Engineering Book, “Service Level Indicators” SRE Workbook, “Choosing the Right SLIs”

Question No : 10

What is one of the key characteristics of a Service Level Indicator (SLI)?

A.It must be captured in a Service Level Agreement (SLA)
B.It should focus on server-side metrics
C.It must have a time horizon
D.It must be agreed to by the SRE team and the Agile Team

정답:
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
A Service Level Indicator (SLI) is a measurement of some aspect of reliability (e.g., latency, availability, quality). One of its defining characteristics is that it must be measured over a specific time window. Without a time horizon, the SLI has no actionable meaning.
From the Site Reliability Engineering Book, Chapter “Service Level Indicators”:
“An SLI is a quantitative measure of some aspect of the level of service that is provided. SLIs are evaluated over a specific period of time in order to understand reliability as experienced by the user.”
The SRE Workbook further states:
“Every SLI must define a measurement window. Without a time horizon, the indicator cannot be used to calculate SLO compliance.”
Why the other options are incorrect:
A SLIs do not need to appear in an SLA; SLAs are external contracts, SLOs/SLIs are internal engineering tools.
B SLIs may include client-side, server-side, or network metrics depending on what reflects user experience.
D SLI agreement is not defined by SRE vs. Agile teams; it is defined by business and user need.
Thus, the correct answer is C.
Reference: Site Reliability Engineering Book, “Service Level Indicators” SRE Workbook, “Defining SLIs and SLOs”

Question No : 11

Which of the following describes work that would be considered "toil"?

A.Work that is devoid of enduring value
B.Work that has some enduring value but requires manual tasks
C.Engineering work to add service features
D.Engineering work that does not add enduring value

정답:
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
“Toil” in SRE has a very specific meaning. According to the Site Reliability Engineering Book, Chapter “Eliminating Toil”:
“Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, has no enduring value, and scales linearly as the service grows.”
The key phrase is “no enduring value.” Toil does not produce lasting improvement, even though it may be necessary in the short term. It consumes engineering effort without making the system better over time.
Why the other options are incorrect:
B Work that has some enduring value cannot be classified as toil by definition.
C Engineering work that adds service features is explicitly non-toil, because SRE defines feature work as “project work,” not operational toil.
D Seems close but is misleading: engineering work without enduring value is poor engineering, not necessarily toil. Toil refers to operations workload specifically.
Thus, A is the correct and precise definition of toil.
Reference: Site Reliability Engineering Book, “Eliminating Toil”

Question No : 12

Which type of engineering work will reduce toil within the service?

A.Continuous delivery pipelines
B.Scripts and automation tools outside of the service
C.Scalable infrastructure
D.Internal automation

정답:
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
Toil-reduction engineering focuses on making the service itself easier to operate. The most direct way to achieve this is through internal automation ― automation built into the service that eliminates repetitive, manual operational tasks.
The Site Reliability Engineering Book, Chapter “Eliminating Toil,” states:
“Automation that replaces manual, repetitive operational tasks is the primary mechanism for reducing toil. The most effective form of toil reduction is automation that is integrated directly into the service itself.”
The SRE Workbook reinforces:
“Internal automation contributes directly to service reliability and reduces the operational burden by ensuring that manual tasks are permanently removed.”
Why the other options are not the best answer:
A Continuous delivery pipelines reduce release friction but do not directly remove service-operational toil.
B External scripts and tools help but are less effective and harder to maintain than internal automation.
C Scalable infrastructure reduces linear-scaling toil but does not address broader operational burdens.
Thus, the correct answer is D.
Reference: Site Reliability Engineering Book, “Eliminating Toil”
SRE Workbook, “Toil Reduction Approaches”

Question No : 13

An organization is experiencing significant turnover of IT operational staff with most not staying more than one year. The HR Director and IT Director are trying to determine why they are having difficulty retaining IT operations professionals.
What could be one of the reasons?

A.Overload and disruptive work patterns
B.Lack of time for skills development
C.More time spent managing the backlog than fixing problems
D.All of the above

정답:
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
High turnover in IT operations roles is often driven by a combination of factors, not just one. The Google SRE Book, Chapter “Eliminating Toil,” outlines that excessive toil, unpredictable work, and overload contribute to burnout and churn:
“Excessive operational workload and interrupt-driven work lead to burnout and high attrition among engineering and operational staff.”
The SRE Workbook adds:
“Teams overwhelmed with toil struggle to innovate, automate, or develop new skills, creating frustration and increasing turnover.”
Each option listed represents a recognized driver of burnout in SRE and operations environments:
Overload and disruptive work patterns are known contributors to burnout.
Lack of time for skills development demotivates engineers and prevents career growth.
Backlog-driven cultures force teams into reactive rather than proactive work.
The combination of these factors matches common causes of attrition in operations teams.
Therefore, all of the above is the correct answer.
Reference: Site Reliability Engineering Book, “Eliminating Toil”
SRE Workbook, “Addressing Operational Overload”

Question No : 14

Which of these approaches can alleviate linear scaling toil?

A.Manual scaling of services
B.Using auto-scaling capabilities
C.Outsourcing development
D.Switching cloud providers

정답:
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
Linear-scaling toil refers to work whose effort increases proportionally to service growth, such as manually provisioning servers or handling capacity expansion. The Google SRE Book, Chapter “Eliminating Toil,” explains:
“Toil is work that scales linearly with the size of your service. A core strategy for reducing toil is to introduce automation that breaks the linear relationship.”
Auto-scaling capabilities directly address linear-scaling toil by automating resource allocation based on load or demand. This prevents engineers from repeatedly and manually adjusting infrastructure as usage grows.
The SRE Workbook also emphasizes:
“Infrastructure automation such as auto-scaling removes a major source of linear scaling toil by ensuring that capacity adjusts automatically as services grow.”
Why the other options are incorrect:
A Manual scaling is linear-scaling toil, not a solution.
C Outsourcing development does not reduce operational toil.
D Switching cloud providers alone does not solve toil unless automation is introduced.
Thus, B is the correct answer.
Reference: Site Reliability Engineering Book, “Eliminating Toil”
SRE Workbook, “Toil Reduction Strategies”

Question No : 15

Known workarounds represent what type of toil?

A.Linear scaling
B.Tactical
C.Automatable
D.No enduring value

정답:
Explanation:
Comprehensive and Detailed Explanation From Exact Extract:
Known workarounds represent toil that has no enduring value, one of the key characteristics of toil defined by the SRE framework.
From the Site Reliability Engineering Book, Chapter “Eliminating Toil”:
“Toil is work that is manual, repetitive, automatable, tactical, has no enduring value, and scales linearly with service size.”
Known workarounds fit this definition because:
They solve the same recurring problems repeatedly
They do not permanently fix the underlying issue
They consume engineer time without contributing long-term improvements
These activities lack enduring value and should be eliminated through automation or engineering fixes.
Why the other options are incorrect:
A. Linear scaling ― Many forms of toil scale linearly, but this does not specifically describe workarounds.
B. Tactical ― Tactical means short-term, but not all tactical work is a workaround.
C. Automatable ― While some workarounds can be automated, not all are.
D. No enduring value ― This is the defining trait of workaround-type toil.
Therefore, option D is correct.
Reference: Site Reliability Engineering Book, “Eliminating Toil”
SRE Workbook, “Toil Reduction Strategies”

Peoplecert DevOps-SRE 시험