Practice Free MLS-C01 Exam Online Questions
A gaming company has launched an online game where people can start playing for free but they need to pay if they choose to use certain features. The company needs to build an automated system to predict whether or not a new user will become a paid user within 1 year. The company has gathered a labeled dataset from 1 million users
The training dataset consists of 1.000 positive samples (from users who ended up paying within 1 year) and 999.000 negative samples (from users who did not use any paid features) Each data sample consists of 200 features including user age, device, location, and play patterns
Using this dataset for training, the Data Science team trained a random forest model that converged with over 99% accuracy on the training set However, the prediction results on a test dataset were not satisfactory.
Which of the following approaches should the Data Science team take to mitigate this issue? (Select TWO.)
- A . Add more deep trees to the random forest to enable the model to learn more features.
- B . indicate a copy of the samples in the test database in the training dataset
- C . Generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data.
- D . Change the cost function so that false negatives have a higher impact on the cost value than false positives
- E . Change the cost function so that false positives have a higher impact on the cost value than false negatives
C, D
Explanation:
The Data Science team is facing a problem of imbalanced data, where the positive class (paid users) is much less frequent than the negative class (non-paid users). This can cause the random forest model to be biased towards the majority class and have poor performance on the minority class. To mitigate this issue, the Data Science team can try the following approaches:
C) Generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data. This is a technique called data augmentation, which can help increase the size and diversity of the training data for the minority class. This can help the random forest model learn more features and patterns from the positive class and reduce the imbalance ratio.
D) Change the cost function so that false negatives have a higher impact on the cost value than false positives. This is a technique called cost-sensitive learning, which can assign different weights or costs to different classes or errors. By assigning a higher cost to false negatives (predicting non-paid when the user is actually paid), the random forest model can be more sensitive to the minority class and try to minimize the misclassification of the positive class.
Reference: Bagging and Random Forest for Imbalanced Classification Surviving in a Random Forest with Imbalanced Datasets
machine learning – random forest for imbalanced data? – Cross Validated Biased Random Forest For Dealing With the Class Imbalance Problem
A Machine Learning Specialist is deciding between building a naive Bayesian model or a full Bayesian network for a classification problem. The Specialist computes the Pearson correlation coefficients between each feature and finds that their absolute values range between 0.1 to 0.95.
Which model describes the underlying data in this situation?
- A . A naive Bayesian model, since the features are all conditionally independent.
- B . A full Bayesian network, since the features are all conditionally independent.
- C . A naive Bayesian model, since some of the features are statistically dependent.
- D . A full Bayesian network, since some of the features are statistically dependent.
D
Explanation:
A naive Bayesian model assumes that the features are conditionally independent given the class label. This means that the joint probability of the features and the class can be factorized as the product of the class prior and the feature likelihoods. A full Bayesian network, on the other hand, does not make this assumption and allows for modeling arbitrary dependencies between the features and the class using a directed acyclic graph. In this case, the joint probability of the features and the class is given by the product of the conditional probabilities of each node given its parents in the graph. If the features are statistically dependent, meaning that their correlation coefficients are not close to zero, then a naive Bayesian model would not capture these dependencies and would likely perform worse than a full Bayesian network that can account for them. Therefore, a full Bayesian network describes the underlying data better in this situation.
Reference: Naive Bayes and Text Classification I
Bayesian Networks
An engraving company wants to automate its quality control process for plaques. The company performs the process before mailing each customized plaque to a customer. The company has created an Amazon S3 bucket that contains images of defects that should cause a plaque to be rejected. Low-confidence predictions must be sent to an internal team of reviewers who are using Amazon Augmented Al (Amazon A2I).
Which solution will meet these requirements?
- A . Use Amazon Textract for automatic processing. Use Amazon A2I with Amazon Mechanical Turk for manual review.
- B . Use Amazon Rekognition for automatic processing. Use Amazon A2I with a private workforce option for manual review.
- C . Use Amazon Transcribe for automatic processing. Use Amazon A2I with a private workforce option for manual review.
- D . Use AWS Panorama for automatic processing Use Amazon A2I with Amazon Mechanical Turk for manual review
B
Explanation:
Amazon Rekognition is a service that provides computer vision capabilities for image and video analysis, such as object, scene, and activity detection, face and text recognition, and custom label detection. Amazon Rekognition can be used to automate the quality control process for plaques by comparing the images of the plaques with the images of defects in the Amazon S3 bucket and returning a confidence score for each defect. Amazon A2I is a service that enables human review of machine learning predictions, such as low-confidence predictions from Amazon Rekognition. Amazon A2I can be integrated with a private workforce option, which allows the engraving company to use its own internal team of reviewers to manually inspect the plaques that are flagged by Amazon Rekognition. This solution meets the requirements of automating the quality control process, sending low-confidence predictions to an internal team of reviewers, and using Amazon A2I for manual review.
References:
1: Amazon Rekognition documentation
2: Amazon A2I documentation
3: Amazon Rekognition Custom Labels documentation
4: Amazon A2I Private Workforce documentation
An obtain relator collects the following data on customer orders: demographics, behaviors, location, shipment progress, and delivery time. A data scientist joins all the collected datasets. The result is a single dataset that includes 980 variables.
The data scientist must develop a machine learning (ML) model to identify groups of customers who are likely to respond to a marketing campaign.
Which combination of algorithms should the data scientist use to meet this requirement? (Select TWO.)
- A . Latent Dirichlet Allocation (LDA)
- B . K-means
- C . Se mantic feg mentation
- D . Principal component analysis (PCA)
- E . Factorization machines (FM)
BD
Explanation:
The data scientist should use K-means and principal component analysis (PCA) to meet this requirement. K-means is a clustering algorithm that can group customers based on their similarity in the feature space. PCA is a dimensionality reduction technique that can transform the original 980 variables into a smaller set of uncorrelated variables that capture most of the variance in the data.
This can help reduce the computational cost and noise in the data, and improve the performance of the clustering algorithm.
References:
Clustering – Amazon SageMaker
Dimensionality Reduction – Amazon SageMaker
A company is building a predictive maintenance model based on machine learning (ML). The data is stored in a fully private Amazon S3 bucket that is encrypted at rest with AWS Key Management Service (AWS KMS) CMKs. An ML specialist must run data preprocessing by using an Amazon SageMaker Processing job that is triggered from code in an Amazon SageMaker notebook. The job should read data from Amazon S3, process it, and upload it back to the same S3 bucket. The preprocessing code is stored in a container image in Amazon Elastic Container Registry (Amazon ECR). The ML specialist needs to grant permissions to ensure a smooth data preprocessing workflow.
Which set of actions should the ML specialist take to meet these requirements?
- A . Create an IAM role that has permissions to create Amazon SageMaker Processing jobs, S3 read and write access to the relevant S3 bucket, and appropriate KMS and ECR permissions. Attach the role to the SageMaker notebook instance. Create an Amazon SageMaker Processing job from the notebook.
- B . Create an IAM role that has permissions to create Amazon SageMaker Processing jobs. Attach the role to the SageMaker notebook instance. Create an Amazon SageMaker Processing job with an IAM role that has read and write permissions to the relevant S3 bucket, and appropriate KMS and ECR permissions.
- C . Create an IAM role that has permissions to create Amazon SageMaker Processing jobs and to access Amazon ECR. Attach the role to the SageMaker notebook instance. Set up both an S3 endpoint and a KMS endpoint in the default VPC. Create Amazon SageMaker Processing jobs from the notebook.
- D . Create an IAM role that has permissions to create Amazon SageMaker Processing jobs. Attach the role to the SageMaker notebook instance. Set up an S3 endpoint in the default VPC. Create Amazon SageMaker Processing jobs with the access key and secret key of the IAM user with appropriate KMS and ECR permissions.
B
Explanation:
The correct solution for granting permissions for data preprocessing is to use the following steps: Create an IAM role that has permissions to create Amazon SageMaker Processing jobs. Attach the role to the SageMaker notebook instance. This role allows the ML specialist to run Processing jobs from the notebook code1
Create an Amazon SageMaker Processing job with an IAM role that has read and write permissions to the relevant S3 bucket, and appropriate KMS and ECR permissions. This role allows the Processing job to access the data in the encrypted S3 bucket, decrypt it with the KMS CMK, and pull the container image from ECR23
The other options are incorrect because they either miss some permissions or use unnecessary steps. For example:
Option A uses a single IAM role for both the notebook instance and the Processing job. This role may have more permissions than necessary for the notebook instance, which violates the principle of least privilege4
Option C sets up both an S3 endpoint and a KMS endpoint in the default VPC. These endpoints are
not required for the Processing job to access the data in the encrypted S3 bucket. They are only
needed if the Processing job runs in network isolation mode, which is not specified in the question.
Option D uses the access key and secret key of the IAM user with appropriate KMS and ECR
permissions. This is not a secure way to pass credentials to the Processing job. It also requires the ML
specialist to manage the IAM user and the keys.
Reference:
1: Create an Amazon SageMaker Notebook Instance – Amazon SageMaker
2: Create a Processing Job – Amazon SageMaker
3: Use AWS KMSCManaged Encryption Keys – Amazon Simple Storage Service
4: IAM Best Practices – AWS Identity and Access Management
: Network Isolation – Amazon SageMaker
: Understanding and Getting Your Security Credentials – AWS General Reference
An ecommerce company is automating the categorization of its products based on images. A data
scientist has trained a computer vision model using the Amazon SageMaker image classification algorithm. The images for each product are classified according to specific product lines. The accuracy of the model is too low when categorizing new products. All of the product images have the same dimensions and are stored within an Amazon S3 bucket. The company wants to improve the model so it can be used for new products as soon as possible.
Which steps would improve the accuracy of the solution? (Choose three.)
- A . Use the SageMaker semantic segmentation algorithm to train a new model to achieve improved accuracy.
- B . Use the Amazon Rekognition DetectLabels API to classify the products in the dataset.
- C . Augment the images in the dataset. Use open-source libraries to crop, resize, flip, rotate, and adjust the brightness and contrast of the images.
- D . Use a SageMaker notebook to implement the normalization of pixels and scaling of the images.
Store the new dataset in Amazon S3. - E . Use Amazon Rekognition Custom Labels to train a new model.
- F . Check whether there are class imbalances in the product categories, and apply oversampling or undersampling as required. Store the new dataset in Amazon S3.
C, E, F
Explanation:
Option C is correct because augmenting the images in the dataset can help the model learn more features and generalize better to new products. Image augmentation is a common technique to increase the diversity and size of the training data.
Option E is correct because Amazon Rekognition Custom Labels can train a custom model to detect specific objects and scenes that are relevant to the business use case. It can also leverage the existing models from Amazon Rekognition that are trained on tens of millions of images across many categories.
Option F is correct because class imbalance can affect the performance and accuracy of the model, as it can cause the model to be biased towards the majority class and ignore the minority class. Applying oversampling or undersampling can help balance the classes and improve the model’s ability to learn from the data.
Option A is incorrect because the semantic segmentation algorithm is used to assign a label to every pixel in an image, not to classify the whole image into a category. Semantic segmentation is useful for applications such as autonomous driving, medical imaging, and satellite imagery analysis. Option B is incorrect because the DetectLabels API is a general-purpose image analysis service that can detect objects, scenes, and concepts in an image, but it cannot be customized to the specific product lines of the ecommerce company. The DetectLabels API is based on the pre-trained models from Amazon Rekognition, which may not cover all the categories that the company needs.
Option D is incorrect because normalizing the pixels and scaling the images are preprocessing steps
that should be done before training the model, not after. These steps can help improve the model’s
convergence and performance, but they are not sufficient to increase the accuracy of the model on
new products.
References:
: Image Augmentation – Amazon SageMaker
: Amazon Rekognition Custom Labels Features
: [Handling Imbalanced Datasets in Machine Learning]
: [Semantic Segmentation – Amazon SageMaker]
: [DetectLabels – Amazon Rekognition]
: [Image Classification – MXNet – Amazon SageMaker]
: [https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28]
: [https://docs.aws.amazon.com/sagemaker/latest/dg/semantic-segmentation.html]
: [https://docs.aws.amazon.com/rekognition/latest/dg/API_DetectLabels.html]
: [https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html]
: [https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28]
: [https://docs.aws.amazon.com/sagemaker/latest/dg/semantic-segmentation.html]
: [https://docs.aws.amazon.com/rekognition/latest/dg/API_DetectLabels.html]
: [https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html]
: [https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28]
: [https://docs.aws.amazon.com/sagemaker/latest/dg/semantic-segmentation.html]
: [https://docs.aws.amazon.com/rekognition/latest/dg/API_DetectLabels.html]
: [https://docs.aws.amazon.com/sagemaker/latest/dg/image-classification.html]
A manufacturing company has a production line with sensors that collect hundreds of quality metrics. The company has stored sensor data and manual inspection results in a data lake for several months. To automate quality control, the machine learning team must build an automated mechanism that determines whether the produced goods are good quality, replacement market quality, or scrap quality based on the manual inspection results.
Which modeling approach will deliver the MOST accurate prediction of product quality?
- A . Amazon SageMaker DeepAR forecasting algorithm
- B . Amazon SageMaker XGBoost algorithm
- C . Amazon SageMaker Latent Dirichlet Allocation (LDA) algorithm
- D . A convolutional neural network (CNN) and ResNet
D
Explanation:
A convolutional neural network (CNN) is a type of deep learning model that can learn to extract features from images and perform tasks such as classification, segmentation, and detection1. ResNet is a popular CNN architecture that uses residual connections to overcome the problem of vanishing gradients and enable very deep networks2. For the task of predicting product quality based on sensor data, a CNN and ResNet approach can leverage the spatial structure of the data and learn complex patterns that distinguish different quality levels.
References:
Convolutional Neural Networks (CNNs / ConvNets)
PyTorch ResNet: The Basics and a Quick Tutorial
A Marketing Manager at a pet insurance company plans to launch a targeted marketing campaign on social media to acquire new customers
Currently, the company has the following data in Amazon Aurora
• Profiles for all past and existing customers
• Profiles for all past and existing insured pets
• Policy-level information
• Premiums received
• Claims paid
What steps should be taken to implement a machine learning model to identify potential new customers on social media?
- A . Use regression on customer profile data to understand key characteristics of consumer segments Find similar profiles on social media.
- B . Use clustering on customer profile data to understand key characteristics of consumer segments
Find similar profiles on social media. - C . Use a recommendation engine on customer profile data to understand key characteristics of consumer segments. Find similar profiles on social media
- D . Use a decision tree classifier engine on customer profile data to understand key characteristics of consumer segments. Find similar profiles on social media
B
Explanation:
Clustering is a machine learning technique that can group data points into clusters based on their similarity or proximity. Clustering can help discover the underlying structure and patterns in the data, as well as identify outliers or anomalies. Clustering can also be used for customer segmentation, which is the process of dividing customers into groups based on their characteristics, behaviors, preferences, or needs. Customer segmentation can help understand the key features and needs of different customer segments, as well as design and implement targeted marketing campaigns for each segment. In this case, the Marketing Manager at a pet insurance company plans to launch a targeted marketing campaign on social media to acquire new customers. To do this, the Manager can use clustering on customer profile data to understand the key characteristics of consumer segments, such as their demographics, pet types, policy preferences, premiums paid, claims made, etc. The Manager can then find similar profiles on social media, such as Facebook, Twitter, Instagram, etc., by using the cluster features as filters or keywords. The Manager can then target these potential new customers with personalized and relevant ads or offers that match their segment’s needs and interests. This way, the Manager can implement a machine learning model to identify potential new customers on social media.
A financial services company wants to automate its loan approval process by building a machine learning (ML) model. Each loan data point contains credit history from a third-party data source and demographic information about the customer. Each loan approval prediction must come with a report that contains an explanation for why the customer was approved for a loan or was denied for a loan. The company will use Amazon SageMaker to build the model.
Which solution will meet these requirements with the LEAST development effort?
- A . Use SageMaker Model Debugger to automatically debug the predictions, generate the explanation, and attach the explanation report.
- B . Use AWS Lambda to provide feature importance and partial dependence plots. Use the plots to generate and attach the explanation report.
- C . Use SageMaker Clarify to generate the explanation report. Attach the report to the predicted results.
- D . Use custom Amazon Cloud Watch metrics to generate the explanation report. Attach the report to the predicted results.
C
Explanation:
The best solution for this scenario is to use SageMaker Clarify to generate the explanation report and attach it to the predicted results. SageMaker Clarify provides tools to help explain how machine learning (ML) models make predictions using a model-agnostic feature attribution approach based on SHAP values. It can also detect and measure potential bias in the data and the model. SageMaker Clarify can generate explanation reports during data preparation, model training, and model deployment. The reports include metrics, graphs, and examples that help understand the model behavior and predictions. The reports can be attached to the predicted results using the SageMaker SDK or the SageMaker API.
The other solutions are less optimal because they require more development effort and additional services. Using SageMaker Model Debugger would require modifying the training script to save the model output tensors and writing custom rules to debug and explain the predictions. Using AWS Lambda would require writing code to invoke the ML model, compute the feature importance and partial dependence plots, and generate and attach the explanation report. Using custom Amazon CloudWatch metrics would require writing code to publish the metrics, create dashboards, and generate and attach the explanation report.
References:
Bias Detection and Model Explainability – Amazon SageMaker Clarify – AWS Amazon SageMaker Clarify Model Explainability
Amazon SageMaker Clarify: Machine Learning Bias Detection and Explainability
GitHub – aws/amazon-sagemaker-clarify: Fairness Aware Machine Learning
A Machine Learning Specialist is building a model to predict future employment rates based on a wide range of economic factors While exploring the data, the Specialist notices that the magnitude of the input features vary greatly. The Specialist does not want variables with a larger magnitude to dominate the model
What should the Specialist do to prepare the data for model training?
- A . Apply quantile binning to group the data into categorical bins to keep any relationships in the data by replacing the magnitude with distribution
- B . Apply the Cartesian product transformation to create new combinations of fields that are independent of the magnitude
- C . Apply normalization to ensure each field will have a mean of 0 and a variance of 1 to remove any significant magnitude
- D . Apply the orthogonal sparse Diagram (OSB) transformation to apply a fixed-size sliding window to generate new features of a similar magnitude.
C
Explanation:
Normalization is a data preprocessing technique that can be used to scale the input features to a common range, such as [-1, 1] or [0, 1]. Normalization can help reduce the effect of outliers, improve the convergence of gradient-based algorithms, and prevent variables with a larger magnitude from dominating the model. One common method of normalization is standardization, which transforms each feature to have a mean of 0 and a variance of 1. This can be done by subtracting the mean and dividing by the standard deviation of each feature. Standardization can be useful for models that assume the input features are normally distributed, such as linear regression, logistic regression, and support vector machines.
References:
Data normalization and standardization: A video that explains the concept and benefits of data normalization and standardization.
Standardize or Normalize?: A blog post that compares different methods of scaling the input
features.