DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK Online Practice Questions and Answers

Questions 4

The code block displayed below contains multiple errors. The code block should return a DataFrame that

contains only columns transactionId, predError, value and storeId of DataFrame

transactionsDf. Find the errors.

Code block:

transactionsDf.select([col(productId), col(f)])

Sample of transactionsDf:

1.+-------------+---------+-----+-------+---------+----+

4.| 1| 3| 4| 25| 1|null|

5.| 2| 6| 7| 2| 2|null|

6.| 3| 3| null| 25| 3|null|

7.+-------------+---------+-----+-------+---------+----+

A. The column names should be listed directly as arguments to the operator and not as a list.

B. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.

C. The select operator should be replaced by a drop operator.

D. The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.

E. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.

Buy Now

Questions 5

Which of the following code blocks returns a DataFrame showing the mean value of column "value" of DataFrame transactionsDf, grouped by its column storeId?

A. transactionsDf.groupBy(col(storeId).avg())

B. transactionsDf.groupBy("storeId").avg(col("value"))

C. transactionsDf.groupBy("storeId").agg(avg("value"))

D. transactionsDf.groupBy("storeId").agg(average("value"))

E. transactionsDf.groupBy("value").average()

Buy Now

Questions 6

Which of the following code blocks returns a new DataFrame in which column attributes of DataFrame itemsDf is renamed to feature0 and column supplier to feature1?

A. itemsDf.withColumnRenamed(attributes, feature0).withColumnRenamed(supplier, feature1)

B. 1.itemsDf.withColumnRenamed("attributes", "feature0") 2.itemsDf.withColumnRenamed("supplier", "feature1")

C. itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))

D. itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier", "feature1")

E. itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")

Buy Now

Questions 7

Which of the following code blocks stores a part of the data in DataFrame itemsDf on executors?

A. itemsDf.cache().count()

B. itemsDf.cache(eager=True)

C. cache(itemsDf)

D. itemsDf.cache().filter()

E. itemsDf.rdd.storeCopy()

Buy Now

Questions 8

Which of the following code blocks reduces a DataFrame from 12 to 6 partitions and performs a full shuffle?

A. DataFrame.repartition(12)

B. DataFrame.coalesce(6).shuffle()

C. DataFrame.coalesce(6)

D. DataFrame.coalesce(6, shuffle=True)

E. DataFrame.repartition(6)

Buy Now

Questions 9

Which of the following is one of the big performance advantages that Spark has over Hadoop?

A. Spark achieves great performance by storing data in the DAG format, whereas Hadoop can only use parquet files.

B. Spark achieves higher resiliency for queries since, different from Hadoop, it can be deployed on Kubernetes.

C. Spark achieves great performance by storing data and performing computation in memory, whereas large jobs in Hadoop require a large amount of relatively slow disk I/O operations.

D. Spark achieves great performance by storing data in the HDFS format, whereas Hadoop can only use parquet files.

E. Spark achieves performance gains for developers by extending Hadoop's DataFrames with a user-friendly API.

Buy Now

Questions 10

Which of the following code blocks returns DataFrame transactionsDf sorted in descending order by column predError, showing missing values last?

A. transactionsDf.sort(asc_nulls_last("predError"))

B. transactionsDf.orderBy("predError").desc_nulls_last()

C. transactionsDf.sort("predError", ascending=False)

D. transactionsDf.desc_nulls_last("predError")

E. transactionsDf.orderBy("predError").asc_nulls_last()

Buy Now

Questions 11

The code block shown below should return a DataFrame with two columns, itemId and col. In this DataFrame, for each element in column attributes of DataFrame itemDf there should be a separate

row in which the column itemId contains the associated itemId from DataFrame itemsDf. The new DataFrame should only contain rows for rows in DataFrame itemsDf in which the column attributes

contains the element cozy.

A sample of DataFrame itemsDf is below.

Code block:

itemsDf.__1__(__2__).__3__(__4__, __5__(__6__))

A. 1. filter

array_contains("cozy")

select

"itemId"

explode

"attributes"

B. 1. where

"array_contains(attributes, 'cozy')"

select

itemId

explode

attributes

C. 1. filter

"array_contains(attributes, 'cozy')"

select

"itemId"

map

"attributes"

D. 1. filter

"array_contains(attributes, cozy)"

select

"itemId"

explode

"attributes"

E. 1. filter

"array_contains(attributes, 'cozy')"

select

"itemId"

explode

"attributes"

Buy Now

Questions 12

Which of the following statements about data skew is incorrect?

A. Spark will not automatically optimize skew joins by default.

B. Broadcast joins are a viable way to increase join performance for skewed data over sort- merge joins.

C. In skewed DataFrames, the largest and the smallest partition consume very different amounts of memory.

D. To mitigate skew, Spark automatically disregards null values in keys when joining.

E. Salting can resolve data skew.

Buy Now

Correct Answer: D

To mitigate skew, Spark automatically disregards null values in keys when joining. This statement is incorrect, and thus the correct answer to the question. Joining keys that contain null values is of particular concern with regard to data skew. In real-world applications, a table may contain a great number of records that do not have a value assigned to the column used as a join key. During the join, the data is at risk of being heavily skewed. This is because all records with a null-value join key are then evaluated as a single large partition, standing in stark contrast to the potentially diverse key values (and therefore small partitions) of the non-null-key records. Spark specifically does not handle this automatically. However, there are several strategies to mitigate this problem like discarding null values temporarily, only to merge them back later (see last link below). In skewed DataFrames, the largest and the smallest partition consume very different amounts of memory. This statement is correct. In fact, having very different partition sizes is the very definition of skew. Skew can degrade Spark performance because the largest partition occupies a single executor for a long time. This blocks a Spark job and is an inefficient use of resources, since other executors that processed smaller partitions need to idle until the large partition is processed. Salting can resolve data skew. This statement is correct. The purpose of salting is to provide Spark with an opportunity to repartition data into partitions of similar size, based on a salted partitioning key. A salted partitioning key typically is a column that consists of uniformly distributed random numbers. The number of unique entries in the partitioning key column should match the number of your desired number of partitions. After repartitioning by the salted key, all partitions should have roughly the same size. Spark does not automatically optimize skew joins by default. This statement is correct. Automatic skew join optimization is a feature of Adaptive Query Execution (AQE). By default, AQE is disabled in Spark. To enable it, Spark's spark.sql.adaptive.enabled configuration option needs to be set to true instead of leaving it at the default false. To automatically optimize skew joins, Spark's spark.sql.adaptive.skewJoin.enabled options also needs to be set to true, which it is by default. When skew join optimization is enabled, Spark recognizes skew joins and optimizes them by splitting the bigger partitions into smaller partitions which leads to performance increases. Broadcast joins are a viable way to increase join performance for skewed data over sort- merge joins. This statement is correct. Broadcast joins can indeed help increase join performance for skewed data, under some conditions. One of the DataFrames to be joined needs to be small enough to fit into each executor's memory, along a partition from the other DataFrame. If this is the case, a broadcast join increases join performance over a sort-merge join. The reason is that a sort-merge join with skewed data involves excessive shuffling. During shuffling, data is sent around the cluster, ultimately slowing down the Spark application. For skewed data, the amount of data, and thus the slowdown, is particularly big. Broadcast joins, however, help reduce shuffling data. The smaller table is directly stored on all executors, eliminating a great amount of network traffic, ultimately increasing join performance relative to the sort-merge join. It is worth noting that for optimizing skew join behavior it may make sense to manually adjust Spark's spark.sql.autoBroadcastJoinThreshold configuration property if the smaller DataFrame is bigger than the 10 MB set by default. More info:

-Performance Tuning - Spark 3.0.0 Documentation

-Data Skew and Garbage Collection to Improve Spark Performance

-Section 1.2 - Joins on Skewed Data ?GitBook

Questions 13

The code block shown below should return a DataFrame with all columns of DataFrame transactionsDf, but only maximum 2 rows in which column productId has at least the value 2. Choose the answer that correctly fills the blanks in the code block to accomplish this.

transactionsDf.__1__(__2__).__3__

A. 1. where

"productId" > 2

max(2)

B. 1. where

transactionsDf[productId] >= 2

limit(2)

C. 1. filter

productId > 2

max(2)

D. 1. filter

col("productId") >= 2

limit(2)

E. 1. where

productId >= 2

limit(2)

Buy Now

Exam Code: DATABRICKS-CERTIFIED-ASSOCIATE-DEVELOPER-FOR-APACHE-SPARK

Exam Name: Databricks Certified Associate Developer for Apache Spark 3.0

Last Update: Apr 19, 2024

Questions: 180

10%OFF Coupon Code: SAVE10

PDF (Q&A)

$45.99

ADD TO CART

VCE

$49.99

ADD TO CART

PDF + VCE

$59.99

ADD TO CART