시험덤프
매달, 우리는 1000명 이상의 사람들이 시험 준비를 잘하고 시험을 잘 통과할 수 있도록 도와줍니다.
  / Databricks Certified Associate-Developer for Apache Spark 3.5 덤프  / Databricks Certified Associate-Developer for Apache Spark 3.5 문제 연습

Databricks Databricks Certified Associate-Developer for Apache Spark 3.5 시험

Databricks Certified Associate Developer for Apache Spark 3.5 - Python 온라인 연습

최종 업데이트 시간: 2025년12월09일

당신은 온라인 연습 문제를 통해 Databricks Databricks Certified Associate-Developer for Apache Spark 3.5 시험지식에 대해 자신이 어떻게 알고 있는지 파악한 후 시험 참가 신청 여부를 결정할 수 있다.

시험을 100% 합격하고 시험 준비 시간을 35% 절약하기를 바라며 Databricks Certified Associate-Developer for Apache Spark 3.5 덤프 (최신 실제 시험 문제)를 사용 선택하여 현재 최신 135개의 시험 문제와 답을 포함하십시오.

 / 4

Question No : 1


A data engineer uses a broadcast variable to share a DataFrame containing millions of rows across executors for lookup purposes.
What will be the outcome?

정답:
Explanation:
In Apache Spark, broadcast variables are used to efficiently distribute large, read-only data to all worker nodes. However, broadcasting very large datasets can lead to memory issues on executors if the data does not fit into the available memory.
According to the Spark documentation:
"Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. This can greatly reduce the amount of data sent over the network."
However, it also notes:
"Using the broadcast functionality available in SparkContext can greatly reduce the size of each serialized task, and the cost of launching a job over a cluster. If your tasks use any large object from the driver program inside of them (e.g., a static lookup table), consider turning it into a broadcast variable."
But caution is advised when broadcasting large datasets:
"Broadcasting large variables can cause out-of-memory errors if the data does not fit in the memory of each executor."
Therefore, if the broadcasted DataFrame containing millions of rows exceeds the memory capacity of the executors, the job may fail due to memory constraints.
Reference: Spark 3.5.5 Documentation - Tuning

Question No : 2


Which feature of Spark Connect is considered when designing an application to enable remote interaction with the Spark cluster?

정답:
Explanation:
Spark Connect introduces a decoupled client-server architecture. Its key feature is enabling Spark job submission and execution from remote clients ― in Python, Java, etc.
From Databricks documentation:
“Spark Connect allows remote clients to connect to a Spark cluster and execute Spark jobs without being co-located with the Spark driver.”
A is close, but "any language" is overstated (currently supports Python, Java, etc., not literally all).
B refers to REST, which is not Spark Connect's mechanism.
D is incorrect; Spark Connect isn’t focused on ingestion.
Final Answer C

Question No : 3


You have:
DataFrame A: 128 GB of transactions
DataFrame B: 1 GB user lookup table
Which strategy is correct for broadcasting?

정답:
Explanation:
Broadcast joins work by sending the smaller DataFrame to all executors, eliminating the shuffle of the larger DataFrame.
From Spark documentation:
“Broadcast joins are efficient when one DataFrame is small enough to fit in memory. Spark avoids shuffling the larger table.”
DataFrame B (1 GB) fits within the default threshold and should be broadcasted.
It eliminates the need to shuffle the large DataFrame A.
Final Answer B

Question No : 4


A data engineer replaces the exact percentile() function with approx_percentile() to improve performance, but the results are drifting too far from expected values.
Which change should be made to solve the issue?



정답:
Explanation:
The approx_percentile function in Spark is a performance-optimized alternative to percentile. It takes an optional accuracy parameter:
approx_percentile(column, percentage, accuracy)
Higher accuracy values → more precise results, but increased memory/computation.
Lower values → faster but less accurate.
From the documentation:
“Increasing the accuracy improves precision but increases memory usage.”
Final Answer D

Question No : 5


A data engineer is streaming data from Kafka and requires:
Minimal latency
Exactly-once processing guarantees
Which trigger mode should be used?

정답:
Explanation:
Exactly-once guarantees in Spark Structured Streaming require micro-batch mode (default), not continuous mode.
Continuous mode (.trigger(continuous=...)) only supports at-least-once semantics and lacks full fault-tolerance.
trigger(availableNow=True) is a batch-style trigger, not suited for low-latency streaming.
So:
Option A uses micro-batching with a tight trigger interval → minimal latency + exactly-once guarantee.
Final Answer A

Question No : 6


A data engineer wants to write a Spark job that creates a new managed table. If the table already exists, the job should fail and not modify anything.
Which save mode and method should be used?

정답:
Explanation:
The method saveAsTable() creates a new table and optionally fails if the table exists.
From Spark documentation:
"The mode 'ErrorIfExists' (default) will throw an error if the table already exists."
Thus:
Option A is correct.
Option B (Overwrite) would overwrite existing data ― not acceptable here.
Option C and D use save(), which doesn't create a managed table with metadata in the metastore.
Final Answer A

Question No : 7


A data engineer noticed improved performance after upgrading from Spark 3.0 to Spark 3.5. The engineer found that Adaptive Query Execution (AQE) was enabled.
Which operation is AQE implementing to improve performance?

정답:
Explanation:
Adaptive Query Execution (AQE) is a Spark 3.x feature that dynamically optimizes query plans at runtime.
One of its core features is:
Dynamically switching join strategies (e.g., from sort-merge to broadcast) based on runtime statistics.
Other AQE capabilities include:
Coalescing shuffle partitions
Skew join handling
Option A is correct.
Option B refers to statistics collection, which is not AQE's primary function.
Option C is too broad and not AQE-specific.
Option D refers to Delta Lake optimizations, unrelated to AQE.
Final Answer A

Question No : 8


A data engineer needs to write a Streaming DataFrame as Parquet files.
Given the code:



Which code fragment should be inserted to meet the requirement?
A)



B)



C)



D)



정답:
Explanation:
To write a structured streaming DataFrame to Parquet files, the correct way to specify the format and output directory is:
.writeStream
.format("parquet")
.option("path", "path/to/destination/dir")
According to Spark documentation:
“When writing to file-based sinks (like Parquet), you must specify the path using the .option("path", ...) method. Unlike batch writes, .save() is not supported.”
Option A incorrectly uses .option("location", ...) (invalid for Parquet sink).
Option B incorrectly sets the format via .option("format", ...), which is not the correct method.
Option C repeats the same issue.
Option D is correct: .format("parquet") + .option("path", ...) is the required syntax.
Final Answer D

Question No : 9


A Data Analyst is working on the DataFrame sensor_df, which contains two columns:
Which code fragment returns a DataFrame that splits the record column into separate columns and has one array item per row?
A)



B)



C)



D)



정답:
Explanation:
To flatten an array of structs into individual rows and access fields within each struct, you must:
Use explode() to expand the array so each struct becomes its own row.
Access the struct fields via dot notation (e.g., record_exploded.sensor_id).
Option C does exactly that:
First, explode the record array column into a new column record_exploded.
Then, access fields of the struct using the dot syntax in select.
This is standard practice in PySpark for nested data transformation.
Final Answer C

Question No : 10


A Spark application is experiencing performance issues in client mode because the driver is resource-constrained.
How should this issue be resolved?

정답:
Explanation:
In Spark’s client mode, the driver runs on the local machine that submitted the job. If that machine is resource-constrained (e.g., low memory), performance degrades.
From the Spark documentation:
"In cluster mode, the driver runs inside the cluster, benefiting from cluster resources and scalability." Option A is incorrect ― executors do not help the driver directly. Option B might help short-term but does not scale.
Option C is correct ― switching to cluster mode moves the driver to the cluster.
Option D (local mode) is for development/testing, not production.
Final Answer C

Question No : 11


Given a CSV file with the content:



And the following code:
from pyspark.sql.types import *
schema = StructType([
StructField("name", StringType()),
StructField("age", IntegerType())
])
spark.read.schema(schema).csv(path).collect()
What is the resulting output?

정답:
Explanation:
In Spark, when a CSV row does not match the provided schema, Spark does not raise an error by default. Instead, it returns null for fields that cannot be parsed correctly.
In the first row, "hello" cannot be cast to Integer for the age field → Spark sets age=None
In the second row, "20" is a valid integer → age=20
So the output will be:
[Row(name='bambi', age=None), Row(name='alladin', age=20)]
Final Answer C

Question No : 12


A Data Analyst needs to retrieve employees with 5 or more years of tenure.
Which code snippet filters and shows the list?

정답:
Explanation:
To filter rows based on a condition and display them in Spark, use filter(...).show():
employees_df.filter(employees_df.tenure >= 5).show()
Option A is correct and shows the results.
Option B filters but doesn’t display them.
Option C uses Python’s built-in filter, not Spark.
Option D collects the results to the driver, which is unnecessary if .show() is sufficient.
Final Answer A

Question No : 13


A data engineer observes that an upstream streaming source sends duplicate records, where duplicates share the same key and have at most a 30-minute difference in event_timestamp.
The engineer adds:
dropDuplicatesWithinWatermark("event_timestamp", "30 minutes")
What is the result?

정답:
Explanation:
The method dropDuplicatesWithinWatermark() in Structured Streaming drops duplicate records based on a specified column and watermark window. The watermark defines the threshold for how late data is considered valid.
From the Spark documentation:
"dropDuplicatesWithinWatermark removes duplicates that occur within the event-time watermark window."
In this case, Spark will retain the first occurrence and drop subsequent records within the 30-minute watermark window.
Final Answer B

Question No : 14


Which UDF implementation calculates the length of strings in a Spark DataFrame?

정답:
Explanation:
Option B uses Spark’s built-in SQL function length(), which is efficient and avoids the overhead of a Python UDF:
from pyspark.sql.functions import length, col
df.select(length(col("stringColumn")).alias("length"))
Explanation of other options:
Option A is incorrect syntax; spark.udf is not called this way.
Option C registers a UDF but doesn’t apply it in the DataFrame transformation.
Option D is syntactically valid but uses a Python UDF which is less efficient than built-in functions.
Final Answer B

Question No : 15


A Spark application developer wants to identify which operations cause shuffling, leading to a new stage in the Spark execution plan.
Which operation results in a shuffle and a new stage?

정답:
Explanation:
Operations that trigger data movement across partitions (like groupBy, join, repartition) result in a shuffle and a new stage.
From Spark documentation:
“groupBy and aggregation cause data to be shuffled across partitions to combine rows with the same key.”
Option A (groupBy + agg) → causes shuffle.
Options B, C, and D (filter, withColumn, select) → transformations that do not require shuffling; they are narrow dependencies.
Final Answer A

 / 4
Databricks