Prepare for the Databricks Certified Associate Developer for Apache Spark 3.0 exam with our extensive collection of questions and answers. These practice Q&A are updated according to the latest syllabus, providing you with the tools needed to review and test your knowledge.
QA4Exam focus on the latest syllabus and exam objectives, our practice Q&A are designed to help you identify key topics and solidify your understanding. By focusing on the core curriculum, These Questions & Answers helps you cover all the essential topics, ensuring you're well-prepared for every section of the exam. Each question comes with a detailed explanation, offering valuable insights and helping you to learn from your mistakes. Whether you're looking to assess your progress or dive deeper into complex topics, our updated Q&A will provide the support you need to confidently approach the Databricks-Certified-Associate-Developer-for-Apache-Spark-3.0 exam and achieve success.
Which of the following code blocks shows the structure of a DataFrame in a tree-like way, containing both column names and types?
itemsDf.printSchema()
Correct! Here is an example of what itemsDf.printSchema() shows, you can see the tree-like structure containing both column names and types:
root
|-- itemId: integer (nullable = true)
|-- attributes: array (nullable = true)
| |-- element: string (containsNull = true)
|-- supplier: string (nullable = true)
itemsDf.rdd.printSchema()
No, the DataFrame's underlying RDD does not have a printSchema() method.
spark.schema(itemsDf)
Incorrect, there is no spark.schema command.
print(itemsDf.columns)
print(itemsDf.dtypes)
Wrong. While the output of this code blocks contains both column names and column types, the information is not arranges in a tree-like way.
itemsDf.print.schema()
No, DataFrame does not have a print method.
Static notebook | Dynamic notebook: See test 3, Question: 36 (Databricks import instructions)
Which of the following describes Spark actions?
The driver receives data upon request by actions.
Correct! Actions trigger the distributed execution of tasks on executors which, upon task completion, transfer result data back to the driver.
Actions are Spark's way of exchanging data between executors.
No. In Spark, data is exchanged between executors via shuffles.
Writing data to disk is the primary purpose of actions.
No. The primary purpose of actions is to access data that is stored in Spark's RDDs and return the data, often in aggregated form, back to the driver.
Actions are Spark's way of modifying RDDs.
Incorrect. Firstly, RDDs are immutable -- they cannot be modified. Secondly, Spark generates new RDDs via transformations and not actions.
Stage boundaries are commonly established by actions.
Wrong. A stage boundary is commonly established by a shuffle, for example caused by a wide transformation.
Which of the following code blocks returns a single-column DataFrame showing the number of words in column supplier of DataFrame itemsDf?
Sample of DataFrame itemsDf:
1. +------+-----------------------------+-------------------+
2. |itemId|attributes |supplier |
3. +------+-----------------------------+-------------------+
4. |1 |[blue, winter, cozy] |Sports Company Inc.|
5. |2 |[red, summer, fresh, cooling]|YetiX |
6. |3 |[green, summer, travel] |Sports Company Inc.|
7. +------+-----------------------------+-------------------+
Output of correct code block:
+----------------------------+
|size(split(supplier, , -1))|
+----------------------------+
| 3|
| 1|
| 3|
+----------------------------+
This Question: shows a typical use case for the split command: Splitting a string into words. An additional difficulty is that you are asked to count the words. Although it is tempting to use the
count method here, the size method (as in: size of an array) is actually the correct one to use. Familiarize yourself with the split and the size methods using the linked documentation below.
More info:
Split method: pyspark.sql.functions.split --- PySpark 3.1.2 documentation
Size method: pyspark.sql.functions.size --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2, Question: 29 (Databricks import instructions)
Which of the following describes the characteristics of accumulators?
If an action including an accumulator fails during execution and Spark manages to restart the action and complete it successfully, only the successful attempt will be counted in the accumulator.
Correct, when Spark tries to rerun a failed action that includes an accumulator, it will only update the accumulator if the action succeeded.
Accumulators are immutable.
No. Although accumulators behave like write-only variables towards the executors and can only be read by the driver, they are not immutable.
All accumulators used in a Spark application are listed in the Spark UI.
Incorrect. For scala, only named, but not unnamed, accumulators are listed in the Spark UI. For pySpark, no accumulators are listed in the Spark UI -- this feature is not yet implemented.
Accumulators are used to pass around lookup tables across the cluster.
Wrong -- this is what broadcast variables do.
Accumulators can be instantiated directly via the accumulator(n) method of the pyspark.RDD module.
Wrong, accumulators are instantiated via the accumulator(n) method of the sparkContext, for example: counter = spark.sparkContext.accumulator(0).
More info: python - In Spark, RDDs are immutable, then how Accumulators are implemented? - Stack Overflow, apache spark - When are accumulators truly reliable? - Stack Overflow, Spark -- The
Definitive Guide, Chapter 14
Which of the following code blocks reads in the JSON file stored at filePath, enforcing the schema expressed in JSON format in variable json_schema, shown in the code block below?
Code block:
1. json_schema = """
2. {"type": "struct",
3. "fields": [
4. {
5. "name": "itemId",
6. "type": "integer",
7. "nullable": true,
8. "metadata": {}
9. },
10. {
11. "name": "supplier",
12. "type": "string",
13. "nullable": true,
14. "metadata": {}
15. }
16. ]
17. }
18. """
Spark provides a way to digest JSON-formatted strings as schema. However, it is not trivial to use. Although slightly above exam difficulty, this Question: is beneficial to your exam
preparation, since
it helps you to familiarize yourself with the concept of enforcing schemas on data you are reading in - a topic within the scope of the exam.
The first answer that jumps out here is the one that uses spark.read.schema instead of spark.read.json. Looking at the documentation of spark.read.schema (linked below), we notice that the
operator expects types pyspark.sql.types.StructType or str as its first argument. While variable json_schema is a string, the documentation states that the str should be 'a DDL-formatted string (For
example col0 INT, col1 DOUBLE)'. Variable json_schema does not contain a string in this type of format, so this answer option must be wrong.
With four potentially correct answers to go, we now look at the schema parameter of spark.read.json() (documentation linked below). Here, too, the schema parameter expects an input of type
pyspark.sql.types.StructType or 'a DDL-formatted string (For example col0 INT, col1 DOUBLE)'. We already know that json_schema does not follow this format, so we should focus on how we can
transform json_schema into pyspark.sql.types.StructType. Hereby, we also eliminate the option where schema=json_schema.
The option that includes schema=spark.read.json(json_schema) is also a wrong pick, since spark.read.json returns a DataFrame, and not a pyspark.sql.types.StructType type.
Ruling out the option which includes schema_of_json(json_schema) is rather difficult. The operator's documentation (linked below) states that it '[p]arses a JSON string and infers its schema in DDL
format'. This use case is slightly different from the case at hand: json_schema already is a schema definition, it does not make sense to 'infer' a schema from it. In the documentation you can see
an example use case which helps you understand the difference better. Here, you pass string '{a: 1}' to schema_of_json() and the method infers a DDL-format schema STRUCT
In our case, we may end up with the output schema of schema_of_json() describing the schema of the JSON schema, instead of using the schema itself. This is not the right answer option.
Now you may consider looking at the StructType.fromJson() method. It returns a variable of type StructType - exactly the type which the schema parameter of spark.read.json expects.
Although we could have looked at the correct answer option earlier, this explanation is kept as exhaustive as necessary to teach you how to systematically eliminate wrong answer options.
More info:
- pyspark.sql.DataFrameReader.schema --- PySpark 3.1.2 documentation
- pyspark.sql.DataFrameReader.json --- PySpark 3.1.2 documentation
- pyspark.sql.functions.schema_of_json --- PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3, Question: 51 (Databricks import instructions)
Full Exam Access, Actual Exam Questions, Validated Answers, Anytime Anywhere, No Download Limits, No Practice Limits
Get All 180 Questions & Answers