Most Recent Databricks-Certified-Professional-Data-Engineer Exam Dumps

Prepare for the Databricks Certified Data Engineer Professional exam with our extensive collection of questions and answers. These practice Q&A are updated according to the latest syllabus, providing you with the tools needed to review and test your knowledge.

QA4Exam focus on the latest syllabus and exam objectives, our practice Q&A are designed to help you identify key topics and solidify your understanding. By focusing on the core curriculum, These Questions & Answers helps you cover all the essential topics, ensuring you're well-prepared for every section of the exam. Each question comes with a detailed explanation, offering valuable insights and helping you to learn from your mistakes. Whether you're looking to assess your progress or dive deeper into complex topics, our updated Q&A will provide the support you need to confidently approach the Databricks-Certified-Professional-Data-Engineer exam and achieve success.

The questions for Databricks-Certified-Professional-Data-Engineer were last updated on May 4, 2025.

Viewing page 1 out of 24 pages.
Viewing questions 1-5 out of 120 questions

Get All 120 Questions & Answers

Question No. 1

Where in the Spark UI can one diagnose a performance problem induced by not leveraging predicate push-down?

AIn the Executor's log file, by gripping for 'predicate push-down'

BIn the Stage's Detail screen, in the Completed Stages table, by noting the size of data read from the Input column

CIn the Storage Detail screen, by noting which RDDs are not stored on disk

DIn the Delta Lake transaction log. by noting the column statistics

EIn the Query Detail screen, by interpreting the Physical Plan

Show Answer

Correct Answer: E

This is the correct answer because it is where in the Spark UI one can diagnose a performance problem induced by not leveraging predicate push-down. Predicate push-down is an optimization technique that allows filtering data at the source before loading it into memory or processing it further. This can improve performance and reduce I/O costs by avoiding reading unnecessary data. To leverage predicate push-down, one should use supported data sources and formats, such as Delta Lake, Parquet, or JDBC, and use filter expressions that can be pushed down to the source. To diagnose a performance problem induced by not leveraging predicate push-down, one can use the Spark UI to access the Query Detail screen, which shows information about a SQL query executed on a Spark cluster. The Query Detail screen includes the Physical Plan, which is the actual plan executed by Spark to perform the query. The Physical Plan shows the physical operators used by Spark, such as Scan, Filter, Project, or Aggregate, and their input and output statistics, such as rows and bytes. By interpreting the Physical Plan, one can see if the filter expressions are pushed down to the source or not, and how much data is read or processed by each operator. Verified Reference: [Databricks Certified Data Engineer Professional], under ''Spark Core'' section; Databricks Documentation, under ''Predicate pushdown'' section; Databricks Documentation, under ''Query detail page'' section.

Question No. 2

A CHECK constraint has been successfully added to the Delta table named activity_details using the following logic:

A batch job is attempting to insert new records to the table, including a record where latitude = 45.50 and longitude = 212.67.

Which statement describes the outcome of this batch insert?

AThe write will fail when the violating record is reached; any records previously processed will be recorded to the target table.

BThe write will fail completely because of the constraint violation and no records will be inserted into the target table.

CThe write will insert all records except those that violate the table constraints; the violating records will be recorded to a quarantine table.

DThe write will include all records in the target table; any violations will be indicated in the boolean column named valid_coordinates.

EThe write will insert all records except those that violate the table constraints; the violating records will be reported in a warning log.

Show Answer

Correct Answer: B

The CHECK constraint is used to ensure that the data inserted into the table meets the specified conditions. In this case, the CHECK constraint is used to ensure that the latitude and longitude values are within the specified range. If the data does not meet the specified conditions, the write operation will fail completely and no records will be inserted into the target table. This is because Delta Lake supports ACID transactions, which means that either all the data is written or none of it is written. Therefore, the batch insert will fail when it encounters a record that violates the constraint, and the target table will not be updated.Reference:

Constraints: https://docs.delta.io/latest/delta-constraints.html

ACID Transactions: https://docs.delta.io/latest/delta-intro.html#acid-transactions

Question No. 3

A table named user_ltv is being used to create a view that will be used by data analysis on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.

The user_ltv table has the following schema:

An analyze who is not a member of the auditing group executing the following query:

Which result will be returned by this query?

AAll columns will be displayed normally for those records that have an age greater than 18; records not meeting this condition will be omitted.

BAll columns will be displayed normally for those records that have an age greater than 17; records not meeting this condition will be omitted.

CAll age values less than 18 will be returned as null values all other columns will be returned with the values in user_ltv.

DAll records from all columns will be displayed with the values in user_ltv.

Show Answer

Correct Answer: A

Given the CASE statement in the view definition, the result set for a user not in the auditing group would be constrained by the ELSE condition, which filters out records based on age. Therefore, the view will return all columns normally for records with an age greater than 18, as users who are not in the auditing group will not satisfy the is_member('auditing') condition. Records not meeting the age > 18 condition will not be displayed.

Question No. 4

The data engineering team maintains a table of aggregate statistics through batch nightly updates. This includes total sales for the previous day alongside totals and averages for a variety of time periods including the 7 previous days, year-to-date, and quarter-to-date. This table is named store_saies_summary and the schema is as follows:

The table daily_store_sales contains all the information needed to update store_sales_summary. The schema for this table is:

store_id INT, sales_date DATE, total_sales FLOAT

If daily_store_sales is implemented as a Type 1 table and the total_sales column might be adjusted after manual data auditing, which approach is the safest to generate accurate reports in the store_sales_summary table?

AImplement the appropriate aggregate logic as a batch read against the daily_store_sales table and overwrite the store_sales_summary table with each Update.

BImplement the appropriate aggregate logic as a batch read against the daily_store_sales table and append new rows nightly to the store_sales_summary table.

CImplement the appropriate aggregate logic as a batch read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.

DImplement the appropriate aggregate logic as a Structured Streaming read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.

EUse Structured Streaming to subscribe to the change data feed for daily_store_sales and apply changes to the aggregates in the store_sales_summary table with each update.

Show Answer

Correct Answer: E

The daily_store_sales table contains all the information needed to update store_sales_summary. The schema of the table is:

store_id INT, sales_date DATE, total_sales FLOAT

The daily_store_sales table is implemented as a Type 1 table, which means that old values are overwritten by new values and no history is maintained. The total_sales column might be adjusted after manual data auditing, which means that the data in the table may change over time.

The safest approach to generate accurate reports in the store_sales_summary table is to use Structured Streaming to subscribe to the change data feed for daily_store_sales and apply changes to the aggregates in the store_sales_summary table with each update. Structured Streaming is a scalable and fault-tolerant stream processing engine built on Spark SQL. Structured Streaming allows processing data streams as if they were tables or DataFrames, using familiar operations such as select, filter, groupBy, or join. Structured Streaming also supports output modes that specify how to write the results of a streaming query to a sink, such as append, update, or complete. Structured Streaming can handle both streaming and batch data sources in a unified manner.

The change data feed is a feature of Delta Lake that provides structured streaming sources that can subscribe to changes made to a Delta Lake table. The change data feed captures both data changes and schema changes as ordered events that can be processed by downstream applications or services. The change data feed can be configured with different options, such as starting from a specific version or timestamp, filtering by operation type or partition values, or excluding no-op changes.

By using Structured Streaming to subscribe to the change data feed for daily_store_sales, one can capture and process any changes made to the total_sales column due to manual data auditing. By applying these changes to the aggregates in the store_sales_summary table with each update, one can ensure that the reports are always consistent and accurate with the latest data. Verified Reference: [Databricks Certified Data Engineer Professional], under ''Spark Core'' section; Databricks Documentation, under ''Structured Streaming'' section; Databricks Documentation, under ''Delta Change Data Feed'' section.

Question No. 5

The data governance team has instituted a requirement that all tables containing Personal Identifiable Information (PH) must be clearly annotated. This includes adding column comments, table comments, and setting the custom table property "contains_pii" = true.

The following SQL DDL statement is executed to create a new table:

Which command allows manual confirmation that these three requirements have been met?

ADESCRIBE EXTENDED dev.pii test

BDESCRIBE DETAIL dev.pii test

CSHOW TBLPROPERTIES dev.pii test

DDESCRIBE HISTORY dev.pii test

ESHOW TABLES dev

Show Answer

Correct Answer: A

This is the correct answer because it allows manual confirmation that these three requirements have been met. The requirements are that all tables containing Personal Identifiable Information (PII) must be clearly annotated, which includes adding column comments, table comments, and setting the custom table property ''contains_pii'' = true. The DESCRIBE EXTENDED command is used to display detailed information about a table, such as its schema, location, properties, and comments. By using this command on the dev.pii_test table, one can verify that the table has been created with the correct column comments, table comment, and custom table property as specified in the SQL DDL statement. Verified Reference: [Databricks Certified Data Engineer Professional], under ''Lakehouse'' section; Databricks Documentation, under ''DESCRIBE EXTENDED'' section.

Unlock All Questions for Databricks Databricks-Certified-Professional-Data-Engineer Exam

Full Exam Access, Actual Exam Questions, Validated Answers, Anytime Anywhere, No Download Limits, No Practice Limits

Get All 120 Questions & Answers