The code block displayed below contains multiple errors. The code block should return a DataFrame that
contains only columns transactionId, predError, value and storeId of DataFrame
transactionsDf. Find the errors.
Code block:
transactionsDf.select([col(productId), col(f)])
Sample of transactionsDf:
1.+-------------+---------+-----+-------+---------+----+
2.|transactionId|predError|value|storeId|productId| f| 3.+-------------+---------+-----+-------+---------+----+
4.| 1| 3| 4| 25| 1|null|
5.| 2| 6| 7| 2| 2|null|
6.| 3| 3| null| 25| 3|null|
7.+-------------+---------+-----+-------+---------+----+
A. The column names should be listed directly as arguments to the operator and not as a list.
B. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all column names should be expressed as strings without being wrapped in a col() operator.
C. The select operator should be replaced by a drop operator.
D. The column names should be listed directly as arguments to the operator and not as a list and following the pattern of how column names are expressed in the code block, columns productId and f should be replaced by transactionId, predError, value and storeId.
E. The select operator should be replaced by a drop operator, the column names should be listed directly as arguments to the operator and not as a list, and all col() operators should be removed.
Which of the following code blocks returns a DataFrame showing the mean value of column "value" of DataFrame transactionsDf, grouped by its column storeId?
A. transactionsDf.groupBy(col(storeId).avg())
B. transactionsDf.groupBy("storeId").avg(col("value"))
C. transactionsDf.groupBy("storeId").agg(avg("value"))
D. transactionsDf.groupBy("storeId").agg(average("value"))
E. transactionsDf.groupBy("value").average()
Which of the following code blocks returns a new DataFrame in which column attributes of DataFrame itemsDf is renamed to feature0 and column supplier to feature1?
A. itemsDf.withColumnRenamed(attributes, feature0).withColumnRenamed(supplier, feature1)
B. 1.itemsDf.withColumnRenamed("attributes", "feature0") 2.itemsDf.withColumnRenamed("supplier", "feature1")
C. itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))
D. itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier", "feature1")
E. itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")
Which of the following code blocks stores a part of the data in DataFrame itemsDf on executors?
A. itemsDf.cache().count()
B. itemsDf.cache(eager=True)
C. cache(itemsDf)
D. itemsDf.cache().filter()
E. itemsDf.rdd.storeCopy()
Which of the following code blocks reduces a DataFrame from 12 to 6 partitions and performs a full shuffle?
A. DataFrame.repartition(12)
B. DataFrame.coalesce(6).shuffle()
C. DataFrame.coalesce(6)
D. DataFrame.coalesce(6, shuffle=True)
E. DataFrame.repartition(6)
Which of the following is one of the big performance advantages that Spark has over Hadoop?
A. Spark achieves great performance by storing data in the DAG format, whereas Hadoop can only use parquet files.
B. Spark achieves higher resiliency for queries since, different from Hadoop, it can be deployed on Kubernetes.
C. Spark achieves great performance by storing data and performing computation in memory, whereas large jobs in Hadoop require a large amount of relatively slow disk I/O operations.
D. Spark achieves great performance by storing data in the HDFS format, whereas Hadoop can only use parquet files.
E. Spark achieves performance gains for developers by extending Hadoop's DataFrames with a user-friendly API.
Which of the following code blocks returns DataFrame transactionsDf sorted in descending order by column predError, showing missing values last?
A. transactionsDf.sort(asc_nulls_last("predError"))
B. transactionsDf.orderBy("predError").desc_nulls_last()
C. transactionsDf.sort("predError", ascending=False)
D. transactionsDf.desc_nulls_last("predError")
E. transactionsDf.orderBy("predError").asc_nulls_last()
The code block shown below should return a DataFrame with two columns, itemId and col. In this DataFrame, for each element in column attributes of DataFrame itemDf there should be a separate
row in which the column itemId contains the associated itemId from DataFrame itemsDf. The new DataFrame should only contain rows for rows in DataFrame itemsDf in which the column attributes
contains the element cozy.
A sample of DataFrame itemsDf is below.
Code block:
itemsDf.__1__(__2__).__3__(__4__, __5__(__6__))
A. 1. filter
2.
array_contains("cozy")
3.
select
4.
"itemId"
5.
explode
6.
"attributes"
B. 1. where
2.
"array_contains(attributes, 'cozy')"
3.
select
4.
itemId
5.
explode
6.
attributes
C. 1. filter
2.
"array_contains(attributes, 'cozy')"
3.
select
4.
"itemId"
5.
map
6.
"attributes"
D. 1. filter
2.
"array_contains(attributes, cozy)"
3.
select
4.
"itemId"
5.
explode
6.
"attributes"
E. 1. filter
2.
"array_contains(attributes, 'cozy')"
3.
select
4.
"itemId"
5.
explode
6.
"attributes"
Which of the following statements about data skew is incorrect?
A. Spark will not automatically optimize skew joins by default.
B. Broadcast joins are a viable way to increase join performance for skewed data over sort- merge joins.
C. In skewed DataFrames, the largest and the smallest partition consume very different amounts of memory.
D. To mitigate skew, Spark automatically disregards null values in keys when joining.
E. Salting can resolve data skew.
The code block shown below should return a DataFrame with all columns of DataFrame transactionsDf, but only maximum 2 rows in which column productId has at least the value 2. Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__(__2__).__3__
A. 1. where
2.
"productId" > 2
3.
max(2)
B. 1. where
2.
transactionsDf[productId] >= 2
3.
limit(2)
C. 1. filter
2.
productId > 2
3.
max(2)
D. 1. filter
2.
col("productId") >= 2
3.
limit(2)
E. 1. where
2.
productId >= 2
3.
limit(2)