Which of the following statements describes a Spark ML estimator?
A. An estimator is a hyperparameter arid that can be used to train a model
B. An estimator chains multiple alqorithms toqether to specify an ML workflow
C. An estimator is a trained ML model which turns a DataFrame with features into a DataFrame with predictions
D. An estimator is an alqorithm which can be fit on a DataFrame to produce a Transformer
E. An estimator is an evaluation tool to assess to the quality of a model
A data scientist has created two linear regression models. The first model uses price as a label variable and the second model uses log(price) as a label variable. When evaluating the RMSE of each model bycomparing the label predictions to the actual price values, the data scientist notices that the RMSE for the second model is much larger than the RMSE of the first model.
Which of the following possible explanations for this difference is invalid?
A. The second model is much more accurate than the first model
B. The data scientist failed to exponentiate the predictions in the second model prior tocomputingthe RMSE
C. The datascientist failed to take the logof the predictions in the first model prior to computingthe RMSE
D. The first model is much more accurate than the second model
E. The RMSE is an invalid evaluation metric for regression problems
Which of the following evaluation metrics is not suitable to evaluate runs in AutoML experiments for regression problems?
A. F1
B. R-squared
C. MAE
D. MSE
Which of the following describes the relationship between native Spark DataFrames and pandas API on Spark DataFrames?
A. pandas API on Spark DataFrames are single-node versions of Spark DataFrames with additional metadata
B. pandas API on Spark DataFrames are more performant than Spark DataFrames
C. pandas API on Spark DataFrames are made up of Spark DataFrames and additional metadata
D. pandas API on Spark DataFrames are less mutable versions of Spark DataFrames
E. pandas API on Spark DataFrames are unrelated to Spark DataFrames
A data scientist wants to use Spark ML to impute missing values in their PySpark DataFrame features_df. They want to replace missing values in all numeric columns in features_df with each respective numeric column's median value.
They have developed the following code block to accomplish this task:

The code block is not accomplishing the task.
Which reasons describes why the code block is not accomplishing the imputation task?
A. It does not impute both the training and test data sets.
B. The inputCols and outputCols need to be exactly the same.
C. The fit method needs to be called instead of transform.
D. It does not fit the imputer on the data to create an ImputerModel.
A data scientist has developed a linear regression model using Spark ML and computed the predictions in a Spark DataFrame preds_df with the following schema:
prediction DOUBLE actual DOUBLE Which of the following code blocks can be used to compute the root mean-squared-error of the model according to the data in preds_df and assign it to the rmse variable?

A. Option A
B. Option B
C. Option C
D. Option D
E. Option E
A data scientist wants to tune a set of hyperparameters for a machine learning model. They have wrapped a Spark ML model in the objective functionobjective_functionand they have defined the search spacesearch_space.
As a result, they have the following code block: Which of the following changes do they need to make to the above code block in order to accomplish the task?

A. Change SparkTrials() to Trials()
B. Reduce num_evals to be less than 10
C. Change fmin() to fmax()
D. Remove the trials=trials argument
E. Remove the algo=tpe.suggest argument
A data scientist is using MLflow to track their machine learning experiment. As a part of each of their MLflow runs, they are performing hyperparameter tuning. The data scientist would like to have one parent run for the tuning process with a child run for each unique combination of hyperparameter values. All parent and child runs are being manually started with mlflow.start_run.
Which of the following approaches can the data scientist use to accomplish this MLflow run organization?
A. Theycan turn on Databricks Autologging
B. Theycan specify nested=True when startingthe child run for each unique combination of hyperparameter values
C. Theycan start each child run inside the parentrun's indented code block usingmlflow.start runO
D. They can start each child run with the same experiment ID as the parent run
E. They can specify nested=True when starting the parent run for the tuningprocess
Which of the following machine learning algorithms typically uses bagging?
A. Gradient boosted trees B. K-means
C. Random forest
D. Linear regression
E. Decision tree
What is the name of the method that transforms categorical features into a series of binary indicator feature variables?
A. Leave-one-out encoding
B. Target encoding
C. One-hot encoding
D. Categorical
E. String indexing