Practice Free D-DS-FN-23 Exam Online Questions
You have created a Logistic Regression model to predict customer churn for your company. The company’s Marketing department wants to use your model to identify at-risk customers and offer incentives to keep them from leaving.
Using two different thresholds for the model provides the two confusion matrices shown in the graphic. Marketing understands the relative costs of missing at-risk customers versus offering incentives to customers who are not at risk. Therefore, you need their advice on how to set the appropriate threshold on the churn model.
You are meeting with the Marketing team. In the meeting, you plan to state: “Raising the threshold from 0.5 to 0.75 reduces the number of unnecessary incentives that can be offered, at the cost of missing more of the customers who churned.”
What is the most appropriate visual to reinforce this statement?
A)
B)
C)
D)
- A . Option A
- B . Option B
- C . Option C
- D . Option D
Refer to the Exhibit.
In the Exhibit, the table shows the values for the input Boolean attributes "A", "B", and "C". It also shows the values for the output attribute "class".
Which decision tree is valid for the data?
- A . Tree B
- B . Tree A
- C . Tree C
- D . Tree D
Refer to the exhibit.
In association rules, for itemsets X and Y, which expression defines leverage?
- A . a
- B . b
- C . c
- D . d
You submit a MapReduce job to a Hadoop cluster and notice that although the job was successfully submitted, it is not completing.
What should you do?
- A . Ensure that the TaskTracker is running.
- B . Ensure that the JobTracker is running
- C . Ensure that the NameNode is running
- D . Ensure that a DataNode is running
When is a Naïve Bayesian Classifier model for classification preferred versus a Logistic Regression model?
- A . When using several categorical input variables with over 1000 possible values each
- B . When an estimate of the probability of an outcome is needed, not just which class it is in
- C . When all input variables are numerical
- D . When some of the input variables might be correlated
What is the purpose of the process step "parsing" in text analysis?
- A . imposes a structure on the unstructured/semi-structured text for downstream analysis
- B . performs the search and/or retrieval in finding a specific topic or an entity in a document
- C . executes the clustering and classification to organize the contents
- D . computes the TF-IDF values for all keywords and indices
You are using k-means clustering to classify heart patients for a hospital. You have chosen Patient Sex, Height, Weight, Age and Income as measures and have used 3 clusters.
When you create a pair-wise plot of the clusters, you notice that there is significant overlap between the clusters.
What should you do?
- A . Identify additional measures to add to the analysis
- B . Remove one of the measures
- C . Decrease the number of clusters
- D . Increase the number of clusters
Which statement about linear regression is correct?
- A . All input variables must be continuous
- B . All input variables must be discrete
- C . Outcome variable is discrete
- D . Outcome variable is continuous
You do a Student’s t-test to compare the average test scores of sample groups from populations A and B. Group A averaged 10 points higher than group B. You find that this difference is significant, with a p-value of 0.03.
What does that mean?
- A . There is a 3% chance that you have identified a difference between the populations when in reality there is none.
- B . The difference in scores between a sample from population A and a sample from population B will tend to be within 3% of 10 points.
- C . There is a 3% chance that a sample group from population A will score 10 points higher that a sample group from population B.
- D . There is a 97% chance that a sample group from population A will score 10 points higher that a sample group from population B.
A data scientist plans to classify the sentiment polarity of 10, 000 product reviews collected from the Internet.
What is the most appropriate model to use? Suppose labeled training data is available.
- A . Naïve Bayesian classifier
- B . Linear regression
- C . Logistic regression
- D . K-means clustering