Which of the following code blocks can be used to save DataFrame transactionsDf to memory only, recalculating partitions that do not fit in memory when they are needed?
A. from pyspark import StorageLevel transactionsDf.cache(StorageLevel.MEMORY_ONLY)
B. transactionsDf.cache()
C. transactionsDf.storage_level('MEMORY_ONLY')
D. transactionsDf.persist()
E. transactionsDf.clear_persist()
F. from pyspark import StorageLevel transactionsDf.persist(StorageLevel.MEMORY_ONLY)
Which of the following code blocks returns a new DataFrame in which column attributes of DataFrame itemsDf is renamed to feature0 and column supplier to feature1?
A. itemsDf.withColumnRenamed(attributes, feature0).withColumnRenamed(supplier, feature1)
B. 1.itemsDf.withColumnRenamed("attributes", "feature0") 2.itemsDf.withColumnRenamed("supplier", "feature1")
C. itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1"))
D. itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier", "feature1")
E. itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1")
Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?
A. transactionsDf.drop(["predError", "value"])
B. transactionsDf.drop("predError", "value")
C. transactionsDf.drop(col("predError"), col("value"))
D. transactionsDf.drop(predError, value)
E. transactionsDf.drop("predError and value")
The code block shown below should return the number of columns in the CSV file stored at location filePath. From the CSV file, only lines should be read that do not start with a # character. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Code block:
__1__(__2__.__3__.csv(filePath, __4__).__5__)
A. 1. size
2.
spark
3.
read()
4.
escape='#'
5.
columns
B. 1. DataFrame
2.
spark
3.
read()
4.
escape='#'
5.
shape[0]
C. 1. len
2.
pyspark
3.
DataFrameReader
4.
comment='#'
5.
columns
D. 1. size
2.
pyspark
3.
DataFrameReader
4.
comment='#'
5.
columns
E. 1. len
2.
spark
3.
read
4.
comment='#'
5.
columns
Which of the following describes the characteristics of accumulators?
A. Accumulators are used to pass around lookup tables across the cluster.
B. All accumulators used in a Spark application are listed in the Spark UI.
C. Accumulators can be instantiated directly via the accumulator(n) method of the pyspark.RDD module.
D. Accumulators are immutable.
E. If an action including an accumulator fails during execution and Spark manages to restart the action and complete it successfully, only the successful attempt will be counted in the accumulator.
Which of the following statements about storage levels is incorrect?
A. The cache operator on DataFrames is evaluated like a transformation.
B. In client mode, DataFrames cached with the MEMORY_ONLY_2 level will not be stored in the edge node's memory.
C. Caching can be undone using the DataFrame.unpersist() operator.
D. MEMORY_AND_DISK replicates cached DataFrames both on memory and disk.
E. DISK_ONLY will not use the worker node's memory.
Which of the following describes slots?
A. Slots are dynamically created and destroyed in accordance with an executor's workload.
B. To optimize I/O performance, Spark stores data on disk in multiple slots.
C. A Java Virtual Machine (JVM) working as an executor can be considered as a pool of slots for task execution.
D. A slot is always limited to a single core. Slots are the communication interface for executors and are used for receiving commands and sending results to the driver.
The code block shown below should set the number of partitions that Spark uses when shuffling data for joins or aggregations to 100. Choose the answer that correctly fills the blanks in the code block to accomplish this.
spark.sql.shuffle.partitions
__1__.__2__.__3__(__4__, 100)
A. 1. spark
2.
conf
3.
set
4.
"spark.sql.shuffle.partitions"
B. 1. pyspark
2.
config
3.
set
4.
spark.shuffle.partitions
C. 1. spark
2.
conf
3.
get
4.
"spark.sql.shuffle.partitions"
D. 1. pyspark
2.
config
3.
set
4.
"spark.sql.shuffle.partitions"
E. 1. spark
2.
conf
3.
set
4.
"spark.sql.aggregate.partitions"
Which of the following code blocks returns about 150 randomly selected rows from the 1000-row DataFrame transactionsDf, assuming that any row can appear more than once in the returned DataFrame?
A. transactionsDf.resample(0.15, False, 3142)
B. transactionsDf.sample(0.15, False, 3142)
C. transactionsDf.sample(0.15)
D. transactionsDf.sample(0.85, 8429)
E. transactionsDf.sample(True, 0.15, 8261)
Which of the following describes a narrow transformation?
A. narrow transformation is an operation in which data is exchanged across partitions.
B. A narrow transformation is a process in which data from multiple RDDs is used.
C. A narrow transformation is a process in which 32-bit float variables are cast to smaller float variables, like 16-bit or 8-bit float variables.
D. A narrow transformation is an operation in which data is exchanged across the cluster.
E. A narrow transformation is an operation in which no data is exchanged across the cluster.