Databricks Certified Associate Developer for Apache Spark 3.5 - Python 온라인 연습
최종 업데이트 시간: 2025년12월09일
당신은 온라인 연습 문제를 통해 Databricks Databricks Certified Associate-Developer for Apache Spark 3.5 시험지식에 대해 자신이 어떻게 알고 있는지 파악한 후 시험 참가 신청 여부를 결정할 수 있다.
시험을 100% 합격하고 시험 준비 시간을 35% 절약하기를 바라며 Databricks Certified Associate-Developer for Apache Spark 3.5 덤프 (최신 실제 시험 문제)를 사용 선택하여 현재 최신 135개의 시험 문제와 답을 포함하십시오.
정답:
Explanation:
In Apache Spark, broadcast variables are used to efficiently distribute large, read-only data to all worker nodes. However, broadcasting very large datasets can lead to memory issues on executors if the data does not fit into the available memory.
According to the Spark documentation:
"Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. This can greatly reduce the amount of data sent over the network."
However, it also notes:
"Using the broadcast functionality available in SparkContext can greatly reduce the size of each serialized task, and the cost of launching a job over a cluster. If your tasks use any large object from the driver program inside of them (e.g., a static lookup table), consider turning it into a broadcast variable."
But caution is advised when broadcasting large datasets:
"Broadcasting large variables can cause out-of-memory errors if the data does not fit in the memory of each executor."
Therefore, if the broadcasted DataFrame containing millions of rows exceeds the memory capacity of the executors, the job may fail due to memory constraints.
Reference: Spark 3.5.5 Documentation - Tuning
정답:
Explanation:
Spark Connect introduces a decoupled client-server architecture. Its key feature is enabling Spark job submission and execution from remote clients ― in Python, Java, etc.
From Databricks documentation:
“Spark Connect allows remote clients to connect to a Spark cluster and execute Spark jobs without being co-located with the Spark driver.”
A is close, but "any language" is overstated (currently supports Python, Java, etc., not literally all).
B refers to REST, which is not Spark Connect's mechanism.
D is incorrect; Spark Connect isn’t focused on ingestion.
Final Answer C
정답:
Explanation:
Broadcast joins work by sending the smaller DataFrame to all executors, eliminating the shuffle of the larger DataFrame.
From Spark documentation:
“Broadcast joins are efficient when one DataFrame is small enough to fit in memory. Spark avoids shuffling the larger table.”
DataFrame B (1 GB) fits within the default threshold and should be broadcasted.
It eliminates the need to shuffle the large DataFrame A.
Final Answer B
정답:
Explanation:
The approx_percentile function in Spark is a performance-optimized alternative to percentile. It takes an optional accuracy parameter:
approx_percentile(column, percentage, accuracy)
Higher accuracy values → more precise results, but increased memory/computation.
Lower values → faster but less accurate.
From the documentation:
“Increasing the accuracy improves precision but increases memory usage.”
Final Answer D
정답:
Explanation:
Exactly-once guarantees in Spark Structured Streaming require micro-batch mode (default), not continuous mode.
Continuous mode (.trigger(continuous=...)) only supports at-least-once semantics and lacks full fault-tolerance.
trigger(availableNow=True) is a batch-style trigger, not suited for low-latency streaming.
So:
Option A uses micro-batching with a tight trigger interval → minimal latency + exactly-once guarantee.
Final Answer A
정답:
Explanation:
The method saveAsTable() creates a new table and optionally fails if the table exists.
From Spark documentation:
"The mode 'ErrorIfExists' (default) will throw an error if the table already exists."
Thus:
Option A is correct.
Option B (Overwrite) would overwrite existing data ― not acceptable here.
Option C and D use save(), which doesn't create a managed table with metadata in the metastore.
Final Answer A
정답:
Explanation:
Adaptive Query Execution (AQE) is a Spark 3.x feature that dynamically optimizes query plans at runtime.
One of its core features is:
Dynamically switching join strategies (e.g., from sort-merge to broadcast) based on runtime statistics.
Other AQE capabilities include:
Coalescing shuffle partitions
Skew join handling
Option A is correct.
Option B refers to statistics collection, which is not AQE's primary function.
Option C is too broad and not AQE-specific.
Option D refers to Delta Lake optimizations, unrelated to AQE.
Final Answer A

정답:
Explanation:
To write a structured streaming DataFrame to Parquet files, the correct way to specify the format and output directory is:
.writeStream
.format("parquet")
.option("path", "path/to/destination/dir")
According to Spark documentation:
“When writing to file-based sinks (like Parquet), you must specify the path using the .option("path", ...) method. Unlike batch writes, .save() is not supported.”
Option A incorrectly uses .option("location", ...) (invalid for Parquet sink).
Option B incorrectly sets the format via .option("format", ...), which is not the correct method.
Option C repeats the same issue.
Option D is correct: .format("parquet") + .option("path", ...) is the required syntax.
Final Answer D



정답:
Explanation:
To flatten an array of structs into individual rows and access fields within each struct, you must:
Use explode() to expand the array so each struct becomes its own row.
Access the struct fields via dot notation (e.g., record_exploded.sensor_id).
Option C does exactly that:
First, explode the record array column into a new column record_exploded.
Then, access fields of the struct using the dot syntax in select.
This is standard practice in PySpark for nested data transformation.
Final Answer C
정답:
Explanation:
In Spark’s client mode, the driver runs on the local machine that submitted the job. If that machine is resource-constrained (e.g., low memory), performance degrades.
From the Spark documentation:
"In cluster mode, the driver runs inside the cluster, benefiting from cluster resources and scalability." Option A is incorrect ― executors do not help the driver directly. Option B might help short-term but does not scale.
Option C is correct ― switching to cluster mode moves the driver to the cluster.
Option D (local mode) is for development/testing, not production.
Final Answer C

정답:
Explanation:
In Spark, when a CSV row does not match the provided schema, Spark does not raise an error by default. Instead, it returns null for fields that cannot be parsed correctly.
In the first row, "hello" cannot be cast to Integer for the age field → Spark sets age=None
In the second row, "20" is a valid integer → age=20
So the output will be:
[Row(name='bambi', age=None), Row(name='alladin', age=20)]
Final Answer C
정답:
Explanation:
To filter rows based on a condition and display them in Spark, use filter(...).show():
employees_df.filter(employees_df.tenure >= 5).show()
Option A is correct and shows the results.
Option B filters but doesn’t display them.
Option C uses Python’s built-in filter, not Spark.
Option D collects the results to the driver, which is unnecessary if .show() is sufficient.
Final Answer A
정답:
Explanation:
The method dropDuplicatesWithinWatermark() in Structured Streaming drops duplicate records based on a specified column and watermark window. The watermark defines the threshold for how late data is considered valid.
From the Spark documentation:
"dropDuplicatesWithinWatermark removes duplicates that occur within the event-time watermark window."
In this case, Spark will retain the first occurrence and drop subsequent records within the 30-minute watermark window.
Final Answer B
정답:
Explanation:
Option B uses Spark’s built-in SQL function length(), which is efficient and avoids the overhead of a Python UDF:
from pyspark.sql.functions import length, col
df.select(length(col("stringColumn")).alias("length"))
Explanation of other options:
Option A is incorrect syntax; spark.udf is not called this way.
Option C registers a UDF but doesn’t apply it in the DataFrame transformation.
Option D is syntactically valid but uses a Python UDF which is less efficient than built-in functions.
Final Answer B
정답:
Explanation:
Operations that trigger data movement across partitions (like groupBy, join, repartition) result in a shuffle and a new stage.
From Spark documentation:
“groupBy and aggregation cause data to be shuffled across partitions to combine rows with the same key.”
Option A (groupBy + agg) → causes shuffle.
Options B, C, and D (filter, withColumn, select) → transformations that do not require shuffling; they are narrow dependencies.
Final Answer A