Databricks Databricks Certified Professional Data Engineer 시험

Databricks Certified Data Engineer Professional Exam 온라인 연습

최종 업데이트 시간: 2026년03월09일

당신은 온라인 연습 문제를 통해 Databricks Databricks Certified Professional Data Engineer 시험지식에 대해 자신이 어떻게 알고 있는지 파악한 후 시험 참가 신청 여부를 결정할 수 있다.

시험을 100% 합격하고 시험 준비 시간을 35% 절약하기를 바라며 Databricks Certified Professional Data Engineer 덤프 (최신 실제 시험 문제)를 사용 선택하여 현재 최신 207개의 시험 문제와 답을 포함하십시오.

/ 9

Question No : 1

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Incremental state information should be maintained for 10 minutes for late-arriving data.
Streaming DataFrame df has the following schema:
"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"
Code block:

Choose the response that correctly fills in the blank within the code block to complete this task.

A.withWatermark("event_time", "10 minutes")
B.awaitArrival("event_time", "10 minutes")
C.await("event_time + ‘10 minutes'")
D.slidingWindow("event_time", "10 minutes")
E.delayWrite("event_time", "10 minutes")

정답:
Explanation:
The correct answer is
A. withWatermark(“event_time”, “10 minutes”). This is because the question asks for incremental state information to be maintained for 10 minutes for late-arriving data. The withWatermark method is used to define the watermark for late data. The watermark is a timestamp column and a threshold that tells the system how long to wait for late data. In this case, the watermark is set to 10 minutes. The other options are incorrect because they are not valid methods or syntax for watermarking in Structured Streaming.
Reference:
Watermarking: https://docs.databricks.com/spark/latest/structured-streaming/watermarks.html
Windowed aggregations: https://docs.databricks.com/spark/latest/structured-streaming/window-operations.html

Question No : 2

A CHECK constraint has been successfully added to the Delta table named activity_details using the following logic:

A batch job is attempting to insert new records to the table, including a record where latitude = 45.50 and longitude = 212.67.
Which statement describes the outcome of this batch insert?

A.The write will fail when the violating record is reached; any records previously processed will be recorded to the target table.
B.The write will fail completely because of the constraint violation and no records will be inserted into the target table.
C.The write will insert all records except those that violate the table constraints; the violating records will be recorded to a quarantine table.
D.The write will include all records in the target table; any violations will be indicated in the boolean column named valid_coordinates.
E.The write will insert all records except those that violate the table constraints; the violating records will be reported in a warning log.

정답:
Explanation:
The CHECK constraint is used to ensure that the data inserted into the table meets the specified conditions. In this case, the CHECK constraint is used to ensure that the latitude and longitude values are within the specified range. If the data does not meet the specified conditions, the write operation will fail completely and no records will be inserted into the target table. This is because Delta Lake supports ACID transactions, which means that either all the data is written or none of it is written.
Therefore, the batch insert will fail when it encounters a record that violates the constraint, and the target table will not be updated.
Reference: Constraints: https://docs.delta.io/latest/delta-constraints.html
ACID Transactions: https://docs.delta.io/latest/delta-intro.html#acid-transactions

Question No : 3

A junior data engineer has manually configured a series of jobs using the Databricks Jobs UI. Upon reviewing their work, the engineer realizes that they are listed as the "Owner" for each job. They attempt to transfer "Owner" privileges to the "DevOps" group, but cannot successfully accomplish this task.
Which statement explains what is preventing this privilege transfer?

A.Databricks jobs must have exactly one owner; "Owner" privileges cannot be assigned to a group.
B.The creator of a Databricks job will always have "Owner" privileges; this configuration cannot be changed.
C.Other than the default "admins" group, only individual users can be granted privileges on jobs.
D.A user can only transfer job ownership to a group if they are also a member of that group.
E.Only workspace administrators can grant "Owner" privileges to a group.

정답:
Explanation:
The reason why the junior data engineer cannot transfer “Owner” privileges to the “DevOps” group is that Databricks jobs must have exactly one owner, and the owner must be an individual user, not a group. A job cannot have more than one owner, and a job cannot have a group as an owner. The owner of a job is the user who created the job, or the user who was assigned the ownership by another user. The owner of a job has the highest level of permission on the job, and can grant or revoke permissions to other users or groups. However, the owner cannot transfer the ownership to a group, only to another user. Therefore, the junior data engineer’s attempt to transfer “Owner” privileges to the “DevOps” group is not possible.
Reference:
Jobs access control: https://docs.databricks.com/security/access-control/table-acls/index.html
Job permissions: https://docs.databricks.com/security/access-control/table-acls/privileges.html#job-permissions

Question No : 4

The data architect has decided that once data has been ingested from external sources into the
Databricks Lakehouse, table access controls will be leveraged to manage permissions for all production tables and views.
The following logic was executed to grant privileges for interactive queries on a production database to the core engineering group.
GRANT USAGE ON DATABASE prod TO eng;
GRANT SELECT ON DATABASE prod TO eng;
Assuming these are the only privileges that have been granted to the eng group and that these users are not workspace administrators, which statement describes their privileges?

A.Group members have full permissions on the prod database and can also assign permissions to other users or groups.
B.Group members are able to list all tables in the prod database but are not able to see the results of any queries on those tables.
C.Group members are able to query and modify all tables and views in the prod database, but cannot create new tables or views.
D.Group members are able to query all tables and views in the prod database, but cannot create or edit anything in the database.
E.Group members are able to create, query, and modify all tables and views in the prod database, but cannot define custom functions.

정답:
Explanation:
The GRANT USAGE ON DATABASE prod TO eng command grants the eng group the permission to use the prod database, which means they can list and access the tables and views in the database. The GRANT SELECT ON DATABASE prod TO eng command grants the eng group the permission to select data from the tables and views in the prod database, which means they can query the data using SQL or DataFrame API. However, these commands do not grant the eng group any other permissions, such as creating, modifying, or deleting tables and views, or defining custom functions. Therefore, the eng group members are able to query all tables and views in the prod database, but cannot create or edit anything in the database.
Reference: Grant privileges on a database: https://docs.databricks.com/en/security/auth-authz/table-acls/grant-privileges-database.html
Privileges you can grant on Hive metastore objects: https://docs.databricks.com/en/security/auth-authz/table-acls/privileges.html

Question No : 5

The data science team has requested assistance in accelerating queries on free form text from user reviews.
The data is currently stored in Parquet with the below schema:
item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING
The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.
A junior data engineer suggests converting this data to Delta Lake will improve query performance.
Which response to the junior data engineer s suggestion is correct?

A.Delta Lake statistics are not optimized for free text fields with high cardinality.
B.Text data cannot be stored with Delta Lake.
C.ZORDER ON review will need to be run to see performance gains.
D.The Delta log creates a term matrix for free text fields to support selective filtering.
E.Delta Lake statistics are only collected on the first 4 columns in a table.

정답:
Explanation:
Converting the data to Delta Lake may not improve query performance on free text fields with high cardinality, such as the review column. This is because Delta Lake collects statistics on the minimum and maximum values of each column, which are not very useful for filtering or skipping data on free text fields. Moreover, Delta Lake collects statistics on the first 32 columns by default, which may not include the review column if the table has more columns. Therefore, the junior data engineer’s suggestion is not correct. A better approach would be to use a full-text search engine, such as Elasticsearch, to index and query the review column. Alternatively, you can use natural language processing techniques, such as tokenization, stemming, and lemmatization, to preprocess the review column and create a new column with normalized terms that can be used for filtering or skipping data.
Reference:
Optimizations: https://docs.delta.io/latest/optimizations-oss.html
Full-text search with Elasticsearch: https://docs.databricks.com/data/data-sources/elasticsearch.html
Natural language processing: https://docs.databricks.com/applications/nlp/index.html

Question No : 6

Assuming that the Databricks CLI has been installed and configured correctly, which Databricks CLI command can be used to upload a custom Python Wheel to object storage mounted with the DBFS for use with a production job?

A.configure
B.fs
C.jobs
D.libraries
E.workspace

정답:
Explanation:
The libraries command group allows you to install, uninstall, and list libraries on Databricks clusters. You can use the libraries install command to install a custom Python Wheel on a cluster by specifying the --whl option and the path to the wheel file. For example, you can use the following command to install a custom Python Wheel named mylib-0.1-py3-none-any.whl on a cluster with the id 1234-567890-abcde123:
databricks libraries install --cluster-id 1234-567890-abcde123 --whl dbfs:/mnt/mylib/mylib-0.1-py3-none-any.whl
This will upload the custom Python Wheel to the cluster and make it available for use with a production job. You can also use the libraries uninstall command to uninstall a library from a cluster, and the libraries list command to list the libraries installed on a cluster.
Reference: Libraries CLI (legacy): https://docs.databricks.com/en/archive/dev-tools/cli/libraries-cli.html
Library operations: https://docs.databricks.com/en/dev-tools/cli/commands.html#library-operations
Install or update the Databricks CLI: https://docs.databricks.com/en/dev-tools/cli/install.html

Question No : 7

In order to facilitate near real-time workloads, a data engineer is creating a helper function to leverage the schema detection and evolution functionality of Databricks Auto Loader. The desired function will automatically detect the schema of the source directly, incrementally process JSON files as they arrive in a source directory, and automatically evolve the schema of the table when new fields are detected.
The function is displayed below with a blank:

Which response correctly fills in the blank to meet the specified requirements?

A.Option A
B.Option B
C.Option C
D.Option D
E.Option E

정답:
Explanation:
Option B correctly fills in the blank to meet the specified requirements.
Option B uses the “cloudFiles.schemaLocation” option, which is required for the schema detection and evolution functionality of Databricks Auto Loader. Additionally, option B uses the “mergeSchema” option, which is required for the schema evolution functionality of Databricks Auto Loader. Finally, option B uses the “writeStream” method, which is required for the incremental processing of JSON files as they arrive in a source directory. The other options are incorrect because they either omit the required options, use the wrong method, or use the wrong format.
Reference:
Configure schema inference and evolution in Auto Loader: https://docs.databricks.com/en/ingestion/auto-loader/schema.html
Write streaming data: https://docs.databricks.com/spark/latest/structured-streaming/writing-streaming-data.html

Question No : 8

The data engineering team maintains the following code:

Assuming that this code produces logically correct results and the data in the source table has been de-duplicated and validated, which statement describes what will occur when this code is executed?

A.The silver_customer_sales table will be overwritten by aggregated values calculated from all records in the gold_customer_lifetime_sales_summary table as a batch job.
B.A batch job will update the gold_customer_lifetime_sales_summary table, replacing only those rows that have different values than the current version of the table, using customer_id as the primary key.
C.The gold_customer_lifetime_sales_summary table will be overwritten by aggregated values calculated from all records in the silver_customer_sales table as a batch job.
D.An incremental job will leverage running information in the state store to update aggregate values in the gold_customer_lifetime_sales_summary table.
E.An incremental job will detect if new rows have been written to the silver_customer_sales table; if new rows are detected, all aggregates will be recalculated and used to overwrite the gold_customer_lifetime_sales_summary table.

정답:
Explanation:
This code is using the pyspark.sql.functions library to group the silver_customer_sales table by customer_id and then aggregate the data using the minimum sale date, maximum sale total, and sum of distinct order ids. The resulting aggregated data is then written to the gold_customer_lifetime_sales_summary table, overwriting any existing data in that table. This is a batch job that does not use any incremental or streaming logic, and does not perform any merge or update operations. Therefore, the code will overwrite the gold table with the aggregated values from the silver table every time it is executed.
Reference:
https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-python.html
https://docs.databricks.com/spark/latest/dataframes-datasets/transforming-data-with-dataframes.html
https://docs.databricks.com/spark/latest/dataframes-datasets/aggregating-data-with-dataframes.html

Question No : 9

A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on Task A.
If task A fails during a scheduled run, which statement describes the results of this run?
A. Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until all tasks have successfully been completed.
B. Tasks B and C will attempt to run as configured; any changes made in task A will be rolled back due to task failure.
C. Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task A failed, all commits will be rolled back automatically.
D. Tasks B and C will be skipped; some logic expressed in task A may have been committed before task failure.
E. Tasks B and C will be skipped; task A will not commit any changes because of stage failure.

정답: D
Explanation:
When a Databricks job runs multiple tasks with dependencies, the tasks are executed in a dependency graph. If a task fails, the downstream tasks that depend on it are skipped and marked as Upstream failed. However, the failed task may have already committed some changes to the Lakehouse before the failure occurred, and those changes are not rolled back automatically. Therefore, the job run may result in a partial update of the Lakehouse. To avoid this, you can use the transactional writes feature of Delta Lake to ensure that the changes are only committed when the entire job run succeeds. Alternatively, you can use the Run if condition to configure tasks to run even when some or all of their dependencies have failed, allowing your job to recover from failures and continue running.
Reference:
transactional writes: https://docs.databricks.com/delta/delta-intro.html#transactional-writes
Run if: https://docs.databricks.com/en/workflows/jobs/conditional-tasks.html

Question No : 10

A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records.
In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?

A.Set the configuration delta.deduplicate = true.
B.VACUUM the Delta table after each batch completes.
C.Perform an insert-only merge with a matching condition on a unique key.
D.Perform a full outer join on a unique key and overwrite existing data.
E.Rely on Delta Lake schema enforcement to prevent duplicate records.

정답:
Explanation:
To deduplicate data against previously processed records as it is inserted into a Delta table, you can use the merge operation with an insert-only clause. This allows you to insert new records that do not match any existing records based on a unique key, while ignoring duplicate records that match existing records. For example, you can use the following syntax:
MERGE INTO target_table USING source_table ON target_table.unique_key = source_table.unique_key WHEN NOT MATCHED THEN INSERT *
This will insert only the records from the source table that have a unique key that is not present in the target table, and skip the records that have a matching key. This way, you can avoid inserting duplicate records into the Delta table.
Reference:
https://docs.databricks.com/delta/delta-update.html#upsert-into-a-table-using-merge
https://docs.databricks.com/delta/delta-update.html#insert-only-merge

Question No : 11

The following code has been migrated to a Databricks notebook from a legacy workload:

The code executes successfully and provides the logically correct results, however, it takes over 20
minutes to extract and load around 1 GB of data.
Which statement is a possible explanation for this behavior?

A.%sh triggers a cluster restart to collect and install Git. Most of the latency is related to cluster startup time.
B.Instead of cloning, the code should use %sh pip install so that the Python code can get executed in parallel across all nodes in a cluster.
C.%sh does not distribute file moving operations; the final line of code should be updated to use %fs instead.
D.Python will always execute slower than Scala on Databricks. The run.py script should be refactored to Scala.
E.%sh executes shell code on the driver node. The code does not take advantage of the worker nodes or Databricks optimized Spark.

정답:
Explanation:
https://www.databricks.com/blog/2020/08/31/introducing-the-databricks-web-terminal.html
The code is using %sh to execute shell code on the driver node. This means that the code is not taking advantage of the worker nodes or Databricks optimized Spark. This is why the code is taking longer to execute. A better approach would be to use Databricks libraries and APIs to read and write data from Git and DBFS, and to leverage the parallelism and performance of Spark. For example, you can use the Databricks Connect feature to run your Python code on a remote Databricks cluster, or you can use the Spark Git Connector to read data from Git repositories as Spark DataFrames.

Question No : 12

The data architect has mandated that all tables in the Lakehouse should be configured as external (also known as "unmanaged") Delta Lake tables.
Which approach will ensure that this requirement is met?

A.When a database is being created, make sure that the LOCATION keyword is used.
B.When configuring an external data warehouse for all table storage, leverage Databricks for all EL
C.When data is saved to a table, make sure that a full file path is specified alongside the Delta format.
D.When tables are created, make sure that the EXTERNAL keyword is used in the CREATE TABLE statement.
E.When the workspace is being configured, make sure that external cloud object storage has been mounted.

정답:
Explanation:
To create an external or unmanaged Delta Lake table, you need to use the EXTERNAL keyword in the CREATE TABLE statement. This indicates that the table is not managed by the catalog and the data files are not deleted when the table is dropped. You also need to provide a LOCATION clause to specify the path where the data files are stored.
For example:
CREATE EXTERNAL TABLE events ( date DATE, eventId STRING, eventType STRING, data STRING) USING DELTA LOCATION ‘/mnt/delta/events’;
This creates an external Delta Lake table named events that references the data files in the ‘/mnt/delta/events’ path. If you drop this table, the data files will remain intact and you can recreate the table with the same statement.
Reference:
https://docs.databricks.com/delta/delta-batch.html#create-a-table
https://docs.databricks.com/delta/delta-batch.html#drop-a-table

Question No : 13

All records from an Apache Kafka producer are being ingested into a single Delta Lake table with the following schema:
key BINARY, value BINARY, topic STRING, partition LONG, offset LONG, timestamp LONG
There are 5 unique topics being ingested. Only the "registration" topic contains Personal Identifiable Information (PII). The company wishes to restrict access to PII. The company also wishes to only retain records containing PII in this table for 14 days after initial ingestion. However, for non-PII information, it would like to retain these records indefinitely.
Which of the following solutions meets the requirements?

A.All data should be deleted biweekly; Delta Lake's time travel functionality should be leveraged to maintain a history of non-PII information.
B.Data should be partitioned by the registration field, allowing ACLs and delete statements to be set for the PII directory.
C.Because the value field is stored as binary data, this information is not considered PII and no special precautions should be taken.
D.Separate object storage containers should be specified based on the partition field, allowing isolation at the storage level.
E.Data should be partitioned by the topic field, allowing ACLs and delete statements to leverage partition boundaries.

정답:
Explanation:
Partitioning the data by the topic field allows the company to apply different access control policies and retention policies for different topics. For example, the company can use the Table Access Control feature to grant or revoke permissions to the registration topic based on user roles or groups. The company can also use the DELETE command to remove records from the registration topic that are older than 14 days, while keeping the records from other topics indefinitely. Partitioning by the topic field also improves the performance of queries that filter by the topic field, as they can skip reading irrelevant partitions.
Reference:
Table Access Control: https://docs.databricks.com/security/access-control/table-acls/index.html
DELETE: https://docs.databricks.com/delta/delta-update.html#delete-from-a-table

Question No : 14

Which of the following technologies can be used to identify key areas of text when parsing Spark Driver log4j output?

A.Regex
B.Julia
C.pyspsark.ml.feature
D.Scala Datasets
E.C++

정답:
Explanation:
Regex, or regular expressions, are a powerful way of matching patterns in text. They can be used to identify key areas of text when parsing Spark Driver log4j output, such as the log level, the timestamp, the thread name, the class name, the method name, and the message. Regex can be applied in various languages and frameworks, such as Scala, Python, Java, Spark SQL, and Databricks notebooks.
Reference:
https://docs.databricks.com/notebooks/notebooks-use.html#use-regular-expressions
https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html#using-regular-expressions-in-udfs
https://docs.databricks.com/spark/latest/sparkr/functions/regexp_extract.html
https://docs.databricks.com/spark/latest/sparkr/functions/regexp_replace.html

Question No : 15

A Delta Lake table was created with the below query:

Consider the following query:
DROP TABLE prod.sales_by_store -
If this statement is executed by a workspace admin, which result will occur?

A.Nothing will occur until a COMMIT command is executed.
B.The table will be removed from the catalog but the data will remain in storage.
C.The table will be removed from the catalog and the data will be deleted.
D.An error will occur because Delta Lake prevents the deletion of production data.
E.Data will be marked as deleted but still recoverable with Time Travel.

정답:
Explanation:
When a table is dropped in Delta Lake, the table is removed from the catalog and the data is deleted. This is because Delta Lake is a transactional storage layer that provides ACID guarantees. When a table is dropped, the transaction log is updated to reflect the deletion of the table and the data is deleted from the underlying storage.
Reference:
https://docs.databricks.com/delta/quick-start.html#drop-a-table
https://docs.databricks.com/delta/delta-batch.html#drop-table

/ 9

Databricks