Databricks Certified Associate-Developer for Apache Spark 3.5 시험 - Databricks실제시험문제와 답 - 135문항

Question No : 1

A data engineer uses a broadcast variable to share a DataFrame containing millions of rows across executors for lookup purposes.
What will be the outcome?

A.The job may fail if the memory on each executor is not large enough to accommodate the DataFrame being broadcasted
B.The job may fail if the executors do not have enough CPU cores to process the broadcasted dataset
C.The job will hang indefinitely as Spark will struggle to distribute and serialize such a large broadcast variable to all executors
D.The job may fail because the driver does not have enough CPU cores to serialize the large DataFrame

정답:
Explanation:
In Apache Spark, broadcast variables are used to efficiently distribute large, read-only data to all worker nodes. However, broadcasting very large datasets can lead to memory issues on executors if the data does not fit into the available memory.
According to the Spark documentation:
"Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. This can greatly reduce the amount of data sent over the network."
However, it also notes:
"Using the broadcast functionality available in SparkContext can greatly reduce the size of each serialized task, and the cost of launching a job over a cluster. If your tasks use any large object from the driver program inside of them (e.g., a static lookup table), consider turning it into a broadcast variable."
But caution is advised when broadcasting large datasets:
"Broadcasting large variables can cause out-of-memory errors if the data does not fit in the memory of each executor."
Therefore, if the broadcasted DataFrame containing millions of rows exceeds the memory capacity of the executors, the job may fail due to memory constraints.
Reference: Spark 3.5.5 Documentation - Tuning

Question No : 2

Which feature of Spark Connect is considered when designing an application to enable remote interaction with the Spark cluster?

A.It provides a way to run Spark applications remotely in any programming language
B.It can be used to interact with any remote cluster using the REST API
C.It allows for remote execution of Spark jobs
D.It is primarily used for data ingestion into Spark from external sources

정답:
Explanation:
Spark Connect introduces a decoupled client-server architecture. Its key feature is enabling Spark job submission and execution from remote clients ― in Python, Java, etc.
From Databricks documentation:
“Spark Connect allows remote clients to connect to a Spark cluster and execute Spark jobs without being co-located with the Spark driver.”
A is close, but "any language" is overstated (currently supports Python, Java, etc., not literally all).
B refers to REST, which is not Spark Connect's mechanism.
D is incorrect; Spark Connect isn’t focused on ingestion.
Final Answer C

Question No : 3

You have:
DataFrame A: 128 GB of transactions
DataFrame B: 1 GB user lookup table
Which strategy is correct for broadcasting?

A.DataFrame B should be broadcasted because it is smaller and will eliminate the need for shuffling itself
B.DataFrame B should be broadcasted because it is smaller and will eliminate the need for shuffling DataFrame A
C.DataFrame A should be broadcasted because it is larger and will eliminate the need for shuffling DataFrame B
D.DataFrame A should be broadcasted because it is smaller and will eliminate the need for shuffling itself

정답:
Explanation:
Broadcast joins work by sending the smaller DataFrame to all executors, eliminating the shuffle of the larger DataFrame.
From Spark documentation:
“Broadcast joins are efficient when one DataFrame is small enough to fit in memory. Spark avoids shuffling the larger table.”
DataFrame B (1 GB) fits within the default threshold and should be broadcasted.
It eliminates the need to shuffle the large DataFrame A.
Final Answer B

Question No : 4

A data engineer replaces the exact percentile() function with approx_percentile() to improve performance, but the results are drifting too far from expected values.
Which change should be made to solve the issue?

A.Decrease the first value of the percentage parameter to increase the accuracy of the percentile ranges
B.Decrease the value of the accuracy parameter in order to decrease the memory usage but also improve the accuracy
C.Increase the last value of the percentage parameter to increase the accuracy of the percentile ranges
D.Increase the value of the accuracy parameter in order to increase the memory usage but also improve the accuracy

정답:
Explanation:
The approx_percentile function in Spark is a performance-optimized alternative to percentile. It takes an optional accuracy parameter:
approx_percentile(column, percentage, accuracy)
Higher accuracy values → more precise results, but increased memory/computation.
Lower values → faster but less accurate.
From the documentation:
“Increasing the accuracy improves precision but increases memory usage.”
Final Answer D

Question No : 5

A data engineer is streaming data from Kafka and requires:
Minimal latency
Exactly-once processing guarantees
Which trigger mode should be used?

A..trigger(processingTime='1 second')
B..trigger(continuous=True)
C..trigger(continuous='1 second')
D..trigger(availableNow=True)

정답:
Explanation:
Exactly-once guarantees in Spark Structured Streaming require micro-batch mode (default), not continuous mode.
Continuous mode (.trigger(continuous=...)) only supports at-least-once semantics and lacks full fault-tolerance.
trigger(availableNow=True) is a batch-style trigger, not suited for low-latency streaming.
So:
Option A uses micro-batching with a tight trigger interval → minimal latency + exactly-once guarantee.
Final Answer A

Question No : 6

A data engineer wants to write a Spark job that creates a new managed table. If the table already exists, the job should fail and not modify anything.
Which save mode and method should be used?

A.saveAsTable with mode ErrorIfExists
B.saveAsTable with mode Overwrite
C.save with mode Ignore
D.save with mode ErrorIfExists

정답:
Explanation:
The method saveAsTable() creates a new table and optionally fails if the table exists.
From Spark documentation:
"The mode 'ErrorIfExists' (default) will throw an error if the table already exists."
Thus:
Option A is correct.
Option B (Overwrite) would overwrite existing data ― not acceptable here.
Option C and D use save(), which doesn't create a managed table with metadata in the metastore.
Final Answer A

Question No : 7

A data engineer noticed improved performance after upgrading from Spark 3.0 to Spark 3.5. The engineer found that Adaptive Query Execution (AQE) was enabled.
Which operation is AQE implementing to improve performance?

A.Dynamically switching join strategies
B.Collecting persistent table statistics and storing them in the metastore for future use
C.Improving the performance of single-stage Spark jobs
D.Optimizing the layout of Delta files on disk

정답:
Explanation:
Adaptive Query Execution (AQE) is a Spark 3.x feature that dynamically optimizes query plans at runtime.
One of its core features is:
Dynamically switching join strategies (e.g., from sort-merge to broadcast) based on runtime statistics.
Other AQE capabilities include:
Coalescing shuffle partitions
Skew join handling
Option A is correct.
Option B refers to statistics collection, which is not AQE's primary function.
Option C is too broad and not AQE-specific.
Option D refers to Delta Lake optimizations, unrelated to AQE.
Final Answer A

Question No : 8

A data engineer needs to write a Streaming DataFrame as Parquet files.
Given the code:

Which code fragment should be inserted to meet the requirement?
A)

B)

C)

D)

A.Option A
B.Option B
C.Option C
D.Option D

정답:
Explanation:
To write a structured streaming DataFrame to Parquet files, the correct way to specify the format and output directory is:
.writeStream
.format("parquet")
.option("path", "path/to/destination/dir")
According to Spark documentation:
“When writing to file-based sinks (like Parquet), you must specify the path using the .option("path", ...) method. Unlike batch writes, .save() is not supported.”
Option A incorrectly uses .option("location", ...) (invalid for Parquet sink).
Option B incorrectly sets the format via .option("format", ...), which is not the correct method.
Option C repeats the same issue.
Option D is correct: .format("parquet") + .option("path", ...) is the required syntax.
Final Answer D

Question No : 9

A Data Analyst is working on the DataFrame sensor_df, which contains two columns:
Which code fragment returns a DataFrame that splits the record column into separate columns and has one array item per row?
A)

B)

C)

D)

A.Option A
B.Option B
C.Option C
D.Option D

정답:
Explanation:
To flatten an array of structs into individual rows and access fields within each struct, you must:
Use explode() to expand the array so each struct becomes its own row.
Access the struct fields via dot notation (e.g., record_exploded.sensor_id).
Option C does exactly that:
First, explode the record array column into a new column record_exploded.
Then, access fields of the struct using the dot syntax in select.
This is standard practice in PySpark for nested data transformation.
Final Answer C

Question No : 10

A Spark application is experiencing performance issues in client mode because the driver is resource-constrained.
How should this issue be resolved?

A.Add more executor instances to the cluster
B.Increase the driver memory on the client machine
C.Switch the deployment mode to cluster mode
D.Switch the deployment mode to local mode

정답:
Explanation:
In Spark’s client mode, the driver runs on the local machine that submitted the job. If that machine is resource-constrained (e.g., low memory), performance degrades.
From the Spark documentation:
"In cluster mode, the driver runs inside the cluster, benefiting from cluster resources and scalability." Option A is incorrect ― executors do not help the driver directly. Option B might help short-term but does not scale.
Option C is correct ― switching to cluster mode moves the driver to the cluster.
Option D (local mode) is for development/testing, not production.
Final Answer C

Question No : 11

Given a CSV file with the content:

And the following code:
from pyspark.sql.types import *
schema = StructType([
StructField("name", StringType()),
StructField("age", IntegerType())
])
spark.read.schema(schema).csv(path).collect()
What is the resulting output?

A.[Row(name='bambi'), Row(name='alladin', age=20)]
B.[Row(name='alladin', age=20)]
C.[Row(name='bambi', age=None), Row(name='alladin', age=20)]
D.The code throws an error due to a schema mismatch.

정답:
Explanation:
In Spark, when a CSV row does not match the provided schema, Spark does not raise an error by default. Instead, it returns null for fields that cannot be parsed correctly.
In the first row, "hello" cannot be cast to Integer for the age field → Spark sets age=None
In the second row, "20" is a valid integer → age=20
So the output will be:
[Row(name='bambi', age=None), Row(name='alladin', age=20)]
Final Answer C

Question No : 12

A Data Analyst needs to retrieve employees with 5 or more years of tenure.
Which code snippet filters and shows the list?

A.employees_df.filter(employees_df.tenure >= 5).show()
B.employees_df.where(employees_df.tenure >= 5)
C.filter(employees_df.tenure >= 5)
D.employees_df.filter(employees_df.tenure >= 5).collect()

정답:
Explanation:
To filter rows based on a condition and display them in Spark, use filter(...).show():
employees_df.filter(employees_df.tenure >= 5).show()
Option A is correct and shows the results.
Option B filters but doesn’t display them.
Option C uses Python’s built-in filter, not Spark.
Option D collects the results to the driver, which is unnecessary if .show() is sufficient.
Final Answer A

Question No : 13

A data engineer observes that an upstream streaming source sends duplicate records, where duplicates share the same key and have at most a 30-minute difference in event_timestamp.
The engineer adds:
dropDuplicatesWithinWatermark("event_timestamp", "30 minutes")
What is the result?

A.It is not able to handle deduplication in this scenario
B.It removes duplicates that arrive within the 30-minute window specified by the watermark
C.It removes all duplicates regardless of when they arrive
D.It accepts watermarks in seconds and the code results in an error

정답:
Explanation:
The method dropDuplicatesWithinWatermark() in Structured Streaming drops duplicate records based on a specified column and watermark window. The watermark defines the threshold for how late data is considered valid.
From the Spark documentation:
"dropDuplicatesWithinWatermark removes duplicates that occur within the event-time watermark window."
In this case, Spark will retain the first occurrence and drop subsequent records within the 30-minute watermark window.
Final Answer B

Question No : 14

Which UDF implementation calculates the length of strings in a Spark DataFrame?

A.df.withColumn("length", spark.udf("len", StringType()))
B.df.select(length(col("stringColumn")).alias("length"))
C.spark.udf.register("stringLength", lambda s: len(s))
D.df.withColumn("length", udf(lambda s: len(s), StringType()))

정답:
Explanation:
Option B uses Spark’s built-in SQL function length(), which is efficient and avoids the overhead of a Python UDF:
from pyspark.sql.functions import length, col
df.select(length(col("stringColumn")).alias("length"))
Explanation of other options:
Option A is incorrect syntax; spark.udf is not called this way.
Option C registers a UDF but doesn’t apply it in the DataFrame transformation.
Option D is syntactically valid but uses a Python UDF which is less efficient than built-in functions.
Final Answer B

Question No : 15

A Spark application developer wants to identify which operations cause shuffling, leading to a new stage in the Spark execution plan.
Which operation results in a shuffle and a new stage?

A.DataFrame.groupBy().agg()
B.DataFrame.filter()
C.DataFrame.withColumn()
D.DataFrame.select()

정답:
Explanation:
Operations that trigger data movement across partitions (like groupBy, join, repartition) result in a shuffle and a new stage.
From Spark documentation:
“groupBy and aggregation cause data to be shuffled across partitions to combine rows with the same key.”
Option A (groupBy + agg) → causes shuffle.
Options B, C, and D (filter, withColumn, select) → transformations that do not require shuffling; they are narrow dependencies.
Final Answer A

Databricks Databricks Certified Associate-Developer for Apache Spark 3.5 시험