Pass4itsure > Cloudera > CCDH > DS-200 > DS-200 Online Practice Questions and Answers

DS-200 Online Practice Questions and Answers

Questions 4

You are working with a logistic regression model to predict the probability that a user will click on an ad. Your model has hundreds of features, and you're not sure if all of those features are helping your prediction. Which regularization technique should you use to prune features that aren't contributing to the model?

A. Convex

B. Uniform

C. L2

D. L1

Buy Now
Questions 5

You have a directory containing a number of comma-separated files. Each file has three columns and each filename has a .csv extension. You want to have a single tab-separated file (all .tsv) that contains all the rows from all the files.

Which command is guaranteed to produce the desired output if you have more than 20,000 files to process?

A. Find . name `*, CSV' print0 | sargs -0 cat | tr `,' `\t' > all.tsv

B. Find . name `name * .CSV' | cat | awk `BEGIN {FS = "," OFS = "\t"} {print $1, $2, $3}' > all.tsv

C. Find . name `*.CSV' | tr `,' `\t' | cat > all.tsv

D. Find . name `*.CSV' | cat > all.tsv

E. Cat *.CSV > all.tsv

Buy Now
Questions 6

What are three benefits of running feature selection analysis before filtering a classification model?

A. Eliminates the need to include a regularization term

B. Reduces the number of subjective decisions required to construct the model

C. Guarantee the optimally of the final model

D. Speeds up the model fitting process

E. Develops an understanding of the importance of different features

F. Improves the predictive performance of the model

Buy Now
Questions 7

You have acquired a new data source of millions of customer records, and you've this data into HDFS. Prior to analysis, you want to change all customer registration to the same date format, make all addresses uppercase, and remove all customer names (for anonymization). Which process will accomplish all three objectives?

A. Adapt the data cleansing module in Mahout to your data, and invoke the Mahout library when you run your analysis

B. Pull this data into an RDBMS using sqoop and scrub records using stored procedures

C. Write a script that receives records on stdin, corrects them, and then writes them to stdout. Then, invoke this script in a map-only Hadoop Streaming Job

D. Write a MapReduce job with a mapper to change words to uppercase and to reduce different forms of dates to a single form

Buy Now
Questions 8

A company has 20 software engineers working to fix on a project. Over the past week, the team has fixed 100 bugs. Although the average number of bugs. Although the average number of bugs fixed per engineer id five. None of the engineer fixed exactly five bugs last week.

You want to understand how productive each engineer is at fixing bugs. What is the best way to visualize the distribution of bug fixes per engineer?

A. A bar chart of engineers vs. number of bugs fixed

B. A scatter plot of engineers vs. number of bugs fixed

C. A normal distribution of the mean and standard deviation of bug fixes per engineer D. A histogram that groups engineers to together based on the number of bugs they fixed

Buy Now
Questions 9

You are building a system to perform outlier detection for a large online retailer. You need to build a system to detect if the total dollar value of sales are outside the norm for each U.S. state, as determined from the physical location of the buyer for each purchase. The retailer's data sources are scattered across multiple systems and databases and are unorganized with little coordination or shared data or keys between the various data sources.

Below are the sources of data available to you. Determine which three will give you the smallest set of data sources but still allow you to implement the outlier detector by state.

A. Database of employees that Includes only the employee ID, start date, and department

B. Database of users that contains only their user ID, name, and a list of every Item the user has viewed

C. Transaction log that contains only basket ID, basket amount, time of sale completion, and a session ID

D. Database of user sessions that includes only session ID, corresponding user ID, and the corresponding IP address

E. External database mapping IP addresses to geographic locations

F. Database of items that includes only the item name, item ID, and warehouse location

G. Database of shipments that includes only the basket ID, shipment address, shipment date, and shipment method

Buy Now
Questions 10

Consider the following sample from a distribution that contains a continuous X and label Y that is either A or B:

Which is the best choice of cut points for X if you want to discretize these values into three buckets that minimizes the sum of chi-square values?

A. X 5 and X 8

B. X 4 and X 6

C. X 3 and X 8

D. X 3 and X 6

E. X 2 and X 9

Buy Now
Questions 11

You want to understand more about how users browse your public website. For example, you war know which pages they visit prior to placing an order. You have a server farm of 200 web server hosting your website. Which is the most efficient process to gather these web servers access logs into your Hadoop cluster for analysis?

A. Sample the web server logs web servers and copy them into HDFS using curl

B. Channel these click streams into Hadoop using Hadoop Streaming

C. Write a MapReduce job with the web servers for mappers and the Hadoop cluster nodes for reducers

D. Import all user clicks from your OLTP databases Into Hadoop using Sqoop

E. Ingest the server web logs into HDFS using Flume

Buy Now
Questions 12

Given the following sample of numbers from a distribution: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89

What are two benefits of using the five-number summary of sample percentiles to summarize a data set?

A. You can calculate unbiased estimators for the parameters of the distribution

B. It's robust to outliers

C. It's well-defined for any probability distribution

D. You can calculate it quickly using a relational database like MySQL, even when we have a large sample

Buy Now
Questions 13

You've built a model that has ten different variables with complicated independence relationships between them, and both continuous and discrete variables that have complicated, multi-parameter distributions. Computing the joint probability distribution is complex, but it turns out that computing the conditional probabilities for the variables is easy. What is the most computationally efficient for computing the expected value?

A. Method of moments

B. Markov Chain Monte Carlo

C. Gibbs sampling

D. Numerical quadrature

Buy Now
Exam Code: DS-200
Exam Name: Data Science Essentials
Last Update: Apr 27, 2024
Questions: 60
10%OFF Coupon Code: SAVE10

PDF (Q&A)

$45.99

VCE

$49.99

PDF + VCE

$59.99