Practice Free MLS-C01 Exam Online Questions
A company builds computer-vision models that use deep learning for the autonomous vehicle industry. A machine learning (ML) specialist uses an Amazon EC2 instance that has a CPU: GPU ratio of 12:1 to train the models.
The ML specialist examines the instance metric logs and notices that the GPU is idle half of the time The ML specialist must reduce training costs without increasing the duration of the training jobs.
Which solution will meet these requirements?
- A . Switch to an instance type that has only CPUs.
- B . Use a heterogeneous cluster that has two different instances groups.
- C . Use memory-optimized EC2 Spot Instances for the training jobs.
- D . Switch to an instance type that has a CPU GPU ratio of 6:1.
D
Explanation:
Switching to an instance type that has a CPU: GPU ratio of 6:1 will reduce the training costs by using fewer CPUs and GPUs, while maintaining the same level of performance. The GPU idle time indicates that the CPU is not able to feed the GPU with enough data, so reducing the CPU: GPU ratio will balance the workload and improve the GPU utilization. A lower CPU: GPU ratio also means less overhead for inter-process communication and synchronization between the CPU and GPU processes.
References:
Optimizing GPU utilization for AI/ML workloads on Amazon EC2 Analyze CPU vs. GPU Performance for AWS Machine Learning
A data scientist must build a custom recommendation model in Amazon SageMaker for an online retail company. Due to the nature of the company’s products, customers buy only 4-5 products every 5-10 years. So, the company relies on a steady stream of new customers. When a new customer signs up, the company collects data on the customer’s preferences.
Below is a sample of the data available to the data scientist.
How should the data scientist split the dataset into a training and test set for this use case?
- A . Shuffle all interaction data. Split off the last 10% of the interaction data for the test set.
- B . Identify the most recent 10% of interactions for each user. Split off these interactions for the test set.
- C . Identify the 10% of users with the least interaction data. Split off all interaction data from these users for the test set.
- D . Randomly select 10% of the users. Split off all interaction data from these users for the test set.
D
Explanation:
The best way to split the dataset into a training and test set for this use case is to randomly select 10% of the users and split off all interaction data from these users for the test set. This is because the company relies on a steady stream of new customers, so the test set should reflect the behavior of new customers who have not been seen by the model before. The other options are not suitable because they either mix old and new customers in the test set (A and B), or they bias the test set towards users with less interaction data ©.
References:
Amazon SageMaker Developer Guide: Train and Test Datasets
Amazon Personalize Developer Guide: Preparing and Importing Data
A Machine Learning Specialist previously trained a logistic regression model using scikit-learn on a local machine, and the Specialist now wants to deploy it to production for inference only.
What steps should be taken to ensure Amazon SageMaker can host a model that was trained locally?
- A . Build the Docker image with the inference code. Tag the Docker image with the registry hostname and upload it to Amazon ECR.
- B . Serialize the trained model so the format is compressed for deployment. Tag the Docker image with the registry hostname and upload it to Amazon S3.
- C . Serialize the trained model so the format is compressed for deployment. Build the image and upload it to Docker Hub.
- D . Build the Docker image with the inference code. Configure Docker Hub and upload the image to Amazon ECR.
A
Explanation:
To deploy a model that was trained locally to Amazon SageMaker, the steps are:
Build the Docker image with the inference code. The inference code should include the model loading, data preprocessing, prediction, and postprocessing logic. The Docker image should also include the dependencies and libraries required by the inference code and the model.
Tag the Docker image with the registry hostname and upload it to Amazon ECR. Amazon ECR is a fully managed container registry that makes it easy to store, manage, and deploy container images. The registry hostname is the Amazon ECR registry URI for your account and Region. You can use the AWS CLI or the Amazon ECR console to tag and push the Docker image to Amazon ECR.
Create a SageMaker model entity that points to the Docker image in Amazon ECR and the model artifacts in Amazon S3. The model entity is a logical representation of the model that contains the information needed to deploy the model for inference. The model artifacts are the files generated by the model training process, such as the model parameters and weights. You can use the AWS CLI, the SageMaker Python SDK, or the SageMaker console to create the model entity.
Create an endpoint configuration that specifies the instance type and number of instances to use for hosting the model. The endpoint configuration also defines the production variants, which are the different versions of the model that you want to deploy. You can use the AWS CLI, the SageMaker Python SDK, or the SageMaker console to create the endpoint configuration.
Create an endpoint that uses the endpoint configuration to deploy the model. The endpoint is a web service that exposes an HTTP API for inference requests. You can use the AWS CLI, the SageMaker Python SDK, or the SageMaker console to create the endpoint.
References:
AWS Machine Learning Specialty Exam Guide
AWS Machine Learning Training – Deploy a Model on Amazon SageMaker
AWS Machine Learning Training – Use Your Own Inference Code with Amazon SageMaker Hosting Services
A logistics company needs a forecast model to predict next month’s inventory requirements for a single item in 10 warehouses. A machine learning specialist uses Amazon Forecast to develop a forecast model from 3 years of monthly data. There is no missing data. The specialist selects the DeepAR+ algorithm to train a predictor. The predictor means absolute percentage error (MAPE) is much larger than the MAPE produced by the current human forecasters.
Which changes to the CreatePredictor API call could improve the MAPE? (Choose two.)
- A . Set PerformAutoML to true.
- B . Set ForecastHorizon to 4.
- C . Set ForecastFrequency to W for weekly.
- D . Set PerformHPO to true.
- E . Set FeaturizationMethodName to filling.
A, D
Explanation:
The MAPE of the predictor could be improved by making the following changes to the CreatePredictor API call:
Set PerformAutoML to true. This will allow Amazon Forecast to automatically evaluate different algorithms and choose the one that minimizes the objective function, which is the mean of the weighted losses over the forecast types. By default, these are the p10, p50, and p90 quantile losses1. This option can help find a better algorithm than DeepAR+ for the given data.
Set PerformHPO to true. This will enable hyperparameter optimization (HPO), which is the process of finding the optimal values for the algorithm-specific parameters that affect the quality of the forecasts. HPO can improve the accuracy of the predictor by tuning the hyperparameters based on the training data2.
The other options are not likely to improve the MAPE of the predictor. Setting ForecastHorizon to 4 will reduce the number of time steps that the model predicts, which may not match the business requirement of predicting next month’s inventory. Setting ForecastFrequency to W for weekly will change the granularity of the forecasts, which may not be appropriate for the monthly data. Setting FeaturizationMethodName to filling will not have any effect, since there is no missing data in the dataset.
References:
CreatePredictor – Amazon Forecast
HPOConfig – Amazon Forecast
A pharmaceutical company performs periodic audits of clinical trial sites to quickly resolve critical findings. The company stores audit documents in text format. Auditors have requested help from a data science team to quickly analyze the documents. The auditors need to discover the 10 main topics within the documents to prioritize and distribute the review work among the auditing team members. Documents that describe adverse events must receive the highest priority.
A data scientist will use statistical modeling to discover abstract topics and to provide a list of the top words for each category to help the auditors assess the relevance of the topic.
Which algorithms are best suited to this scenario? (Choose two.)
- A . Latent Dirichlet allocation (LDA)
- B . Random Forest classifier
- C . Neural topic modeling (NTM)
- D . Linear support vector machine
- E . Linear regression
A, C
Explanation:
The algorithms that are best suited to this scenario are latent Dirichlet allocation (LDA) and neural topic modeling (NTM), as they are both unsupervised learning methods that can discover abstract topics from a collection of text documents. LDA and NTM can provide a list of the top words for each topic, as well as the topic distribution for each document, which can help the auditors assess the relevance and priority of the topic12.
The other options are not suitable because:
Option B: A random forest classifier is a supervised learning method that can perform classification or regression tasks by using an ensemble of decision trees. A random forest classifier is not suitable for discovering abstract topics from text documents, as it requires labeled data and predefined classes3.
Option D: A linear support vector machine is a supervised learning method that can perform classification or regression tasks by using a linear function that separates the data into different classes. A linear support vector machine is not suitable for discovering abstract topics from text documents, as it requires labeled data and predefined classes4.
Option E: A linear regression is a supervised learning method that can perform regression tasks by using a linear function that models the relationship between a dependent variable and one or more independent variables. A linear regression is not suitable for discovering abstract topics from text documents, as it requires labeled data and a continuous output variable5.
References:
1: Latent Dirichlet Allocation
2: Neural Topic Modeling
3: Random Forest Classifier
4: Linear Support Vector Machine
5: Linear Regression
A machine learning specialist is developing a regression model to predict rental rates from rental listings. A variable named Wall_Color represents the most prominent exterior wall color of the property.
The following is the sample data, excluding all other variables:
The specialist chose a model that needs numerical input data.
Which feature engineering approaches should the specialist use to allow the regression model to learn from the Wall_Color data? (Choose two.)
- A . Apply integer transformation and set Red = 1, White = 5, and Green = 10.
- B . Add new columns that store one-hot representation of colors.
- C . Replace the color name string by its length.
- D . Create three columns to encode the color in RGB format.
- E . Replace each color name by its training set frequency.
B, D
Explanation:
In this scenario, the specialist should use one-hot encoding and RGB encoding to allow the regression model to learn from the Wall_Color data. One-hot encoding is a technique used to convert categorical data into numerical data. It creates new columns that store one-hot representation of colors.
For example, a variable named color has three categories: red, green, and blue. After one-hot encoding, the new variables should be like this:
One-hot encoding can capture the presence or absence of a color, but it cannot capture the intensity or hue of a color. RGB encoding is a technique used to represent colors in a digital image. It creates three columns to encode the color in RGB format.
For example, a variable named color has three categories: red, green, and blue. After RGB encoding, the new variables should be like this:
RGB encoding can capture the intensity and hue of a color, but it may also introduce correlation among the three columns. Therefore, using both one-hot encoding and RGB encoding can provide more information to the regression model than using either one alone.
References:
Feature Engineering for Categorical Data
How to Perform Feature Selection with Categorical Data
A Machine Learning Specialist is configuring automatic model tuning in Amazon SageMaker.
When using the hyperparameter optimization feature, which of the following guidelines should be followed to improve optimization? Choose the maximum number of hyperparameters supported by
- A . Amazon SageMaker to search the largest number of combinations possible
- B . Specify a very large hyperparameter range to allow Amazon SageMaker to cover every possible value.
- C . Use log-scaled hyperparameters to allow the hyperparameter space to be searched as quickly as possible
- D . Execute only one hyperparameter tuning job at a time and improve tuning through successive rounds of experiments
C
Explanation:
Using log-scaled hyperparameters is a guideline that can improve the automatic model tuning in Amazon SageMaker. Log-scaled hyperparameters are hyperparameters that have values that span several orders of magnitude, such as learning rate, regularization parameter, or number of hidden units. Log-scaled hyperparameters can be specified by using a log-uniform distribution, which assigns equal probability to each order of magnitude within a range. For example, a log-uniform distribution between 0.001 and 1000 can sample values such as 0.001, 0.01, 0.1, 1, 10, 100, or 1000 with equal probability. Using log-scaled hyperparameters can allow the hyperparameter optimization feature to search the hyperparameter space more efficiently and effectively, as it can explore different scales of values and avoid sampling values that are too small or too large. Using log-scaled hyperparameters can also help avoid numerical issues, such as underflow or overflow, that may occur when using linear-scaled hyperparameters. Using log-scaled hyperparameters can be done by setting the ScalingType parameter to Logarithmic when defining the hyperparameter ranges in Amazon SageMaker12
The other options are not valid or relevant guidelines for improving the automatic model tuning in Amazon SageMaker. Choosing the maximum number of hyperparameters supported by Amazon SageMaker to search the largest number of combinations possible is not a good practice, as it can increase the time and cost of the tuning job and make it harder to find the optimal values. Amazon SageMaker supports up to 20 hyperparameters for tuning, but it is recommended to choose only the most important and influential hyperparameters for the model and algorithm, and use default or fixed values for the rest3 Specifying a very large hyperparameter range to allow Amazon SageMaker to cover every possible value is not a good practice, as it can result in sampling values that are irrelevant or impractical for the model and algorithm, and waste the tuning budget. It is recommended to specify a reasonable and realistic hyperparameter range based on the prior knowledge and experience of the model and algorithm, and use the results of the tuning job to refine the range if needed4 Executing only one hyperparameter tuning job at a time and improving tuning through successive rounds of experiments is not a good practice, as it can limit the exploration and exploitation of the hyperparameter space and make the tuning process slower and less efficient. It is recommended to use parallelism and concurrency to run multiple training jobs simultaneously and leverage the Bayesian optimization algorithm that Amazon SageMaker uses to guide the search for the best hyperparameter values5
Which of the following metrics should a Machine Learning Specialist generally use to compare/evaluate machine learning classification models against each other?
- A . Recall
- B . Misclassification rate
- C . Mean absolute percentage error (MAPE)
- D . Area Under the ROC Curve (AUC)
D
Explanation:
Area Under the ROC Curve (AUC) is a metric that measures the performance of a binary classifier across all possible thresholds. It is also known as the probability that a randomly chosen positive example will be ranked higher than a randomly chosen negative example by the classifier. AUC is a good metric to compare different classification models because it is independent of the class distribution and the decision threshold. It also captures both the sensitivity (true positive rate) and the specificity (true negative rate) of the model.
References:
AWS Machine Learning Specialty Exam Guide
AWS Machine Learning Specialty Sample Questions
A medical imaging company wants to train a computer vision model to detect areas of concern on patients’ CT scans. The company has a large collection of unlabeled CT scans that are linked to each patient and stored in an Amazon S3 bucket. The scans must be accessible to authorized users only. A machine learning engineer needs to build a labeling pipeline.
Which set of steps should the engineer take to build the labeling pipeline with the LEAST effort?
- A . Create a workforce with AWS Identity and Access Management (IAM). Build a labeling tool on Amazon EC2 Queue images for labeling by using Amazon Simple Queue Service (Amazon SQS). Write the labeling instructions.
- B . Create an Amazon Mechanical Turk workforce and manifest file. Create a labeling job by using the built-in image classification task type in Amazon SageMaker Ground Truth. Write the labeling instructions.
- C . Create a private workforce and manifest file. Create a labeling job by using the built-in bounding box task type in Amazon SageMaker Ground Truth. Write the labeling instructions.
- D . Create a workforce with Amazon Cognito. Build a labeling web application with AWS Amplify.
Build a labeling workflow backend using AWS Lambda. Write the labeling instructions.
C
Explanation:
The engineer should create a private workforce and manifest file, and then create a labeling job by using the built-in bounding box task type in Amazon SageMaker Ground Truth. This will allow the engineer to build the labeling pipeline with the least effort.
A private workforce is a group of workers that you manage and who have access to your labeling tasks. You can use a private workforce to label sensitive data that requires confidentiality, such as medical images. You can create a private workforce by using Amazon Cognito and inviting workers by email. You can also use AWS Single Sign-On or your own authentication system to manage your private workforce.
A manifest file is a JSON file that lists the Amazon S3 locations of your input data. You can use a manifest file to specify the data objects that you want to label in your labeling job. You can create a manifest file by using the AWS CLI, the AWS SDK, or the Amazon SageMaker console.
A labeling job is a process that sends your input data to workers for labeling. You can use the Amazon SageMaker console to create a labeling job and choose from several built-in task types, such as image classification, text classification, semantic segmentation, and bounding box. A bounding box task type allows workers to draw boxes around objects in an image and assign labels to them. This is suitable for object detection tasks, such as identifying areas of concern on CT scans.
References:
Create and Manage Workforces – Amazon SageMaker
Use Input and Output Data – Amazon SageMaker
Create a Labeling Job – Amazon SageMaker
Bounding Box Task Type – Amazon SageMaker
A company is running an Amazon SageMaker training job that will access data stored in its Amazon S3 bucket A compliance policy requires that the data never be transmitted across the internet.
How should the company set up the job?
- A . Launch the notebook instances in a public subnet and access the data through the public S3 endpoint
- B . Launch the notebook instances in a private subnet and access the data through a NAT gateway
- C . Launch the notebook instances in a public subnet and access the data through a NAT gateway
- D . Launch the notebook instances in a private subnet and access the data through an S3 VPC endpoint.
D
Explanation:
A private subnet is a subnet that does not have a route to the internet gateway, which means that the resources in the private subnet cannot access the internet or be accessed from the internet. An S3 VPC endpoint is a gateway endpoint that allows the resources in the VPC to access the S3 service without going through the internet. By launching the notebook instances in a private subnet and accessing the data through an S3 VPC endpoint, the company can set up the job in a secure and compliant way, as the data never leaves the AWS network and is not exposed to the internet. This can also improve the performance and reliability of the data transfer, as the traffic does not depend on the internet bandwidth or availability.
References:
Amazon VPC Endpoints – Amazon Virtual Private Cloud Endpoints for Amazon S3 – Amazon Virtual Private Cloud Connect to SageMaker Within your VPC – Amazon SageMaker Working with VPCs and Subnets – Amazon Virtual Private Cloud