MLS-C01 시험 - Amazon실제시험문제와 답 - 104문항

Question No : 1

A Data Science team is designing a dataset repository where it will store a large amount of training data commonly used in its machine learning models. As Data Scientists may create an arbitrary number of new datasets every day, the solution has to scale automatically and be cost-effective. Also, it must be possible to explore the data using SQL.
Which storage scheme is MOST adapted to this scenario?

A.Store datasets as files in Amazon S3.
B.Store datasets as files in an Amazon EBS volume attached to an Amazon EC2 instance.
C.Store datasets as tables in a multi-node Amazon Redshift cluster.
D.Store datasets as global tables in Amazon DynamoD

정답:

Question No : 2

A Data Scientist is developing a machine learning model to predict future patient outcomes based on information collected about each patient and their treatment plans. The model should output a continuous value as its prediction. The data available includes labeled outcomes for a set of 4,000 patients. The study was conducted on a group of individuals over the age of 65 who have a particular disease that is known to worsen with age.
Initial models have performed poorly. While reviewing the underlying data, the Data Scientist notices that, out of 4,000 patient observations, there are 450 where the patient age has been input as 0. The other features for these observations appear normal compared to the rest of the sample population
How should the Data Scientist correct this issue?

A.Drop all records from the dataset where age has been set to 0.
B.Replace the age field value for records with a value of 0 with the mean or median value from the dataset
C.Drop the age feature from the dataset and train the model using the rest of the features.
D.Use k-means clustering to handle missing features

정답:

Question No : 3

A gaming company has launched an online game where people can start playing for free, but they need to pay if they choose to use certain features. The company needs to build an automated system to predict whether or not a new user will become a paid user within 1 year. The company has gathered a labeled dataset from 1 million users.
The training dataset consists of 1,000 positive samples (from users who ended up paying within 1 year) and 999,000 negative samples (from users who did not use any paid features). Each data sample consists of 200 features including user age, device, location, and play patterns.
Using this dataset for training, the Data Science team trained a random forest model that converged with over 99% accuracy on the training set. However, the prediction results on a test dataset were not satisfactory
Which of the following approaches should the Data Science team take to mitigate this issue? (Choose two.)

A.Add more deep trees to the random forest to enable the model to learn more features.
B.Include a copy of the samples in the test dataset in the training dataset.
C.Generate more positive samples by duplicating the positive samples and adding a small amount of noise to the duplicated data.
D.Change the cost function so that false negatives have a higher impact on the cost value than false positives.
E.Change the cost function so that false positives have a higher impact on the cost value than false negatives.

정답:

Question No : 4

Machine Learning Specialist is working with a media company to perform classification on popular articles from the company's website. The company is using random forests to classify how popular an article will be before it is published.
A sample of the data being used is below.

Given the dataset, the Specialist wants to convert the Day_Of_Week column to binary values.
What technique should be used to convert this column to binary values?

A.Binarization
B.One-hot encoding
C.Tokenization
D.Normalization transformation

정답:

Question No : 5

A monitoring service generates 1 TB of scale metrics record data every minute. A Research team performs queries on this data using Amazon Athena. The queries run slowly due to the large volume of data, and the team requires better performance.
How should the records be stored in Amazon S3 to improve query performance?

A.CSV files
B.Parquet files
C.Compressed JSON
D.RecordIO

정답:

Question No : 6

When submitting Amazon SageMaker training jobs using one of the built-in algorithms, which common parameters MUST be specified? (Choose three.)

A.The training channel identifying the location of training data on an Amazon S3 bucket.
B.The validation channel identifying the location of validation data on an Amazon S3 bucket.
C.The IAM role that Amazon SageMaker can assume to perform tasks on behalf of the users.
D.Hyperparameters in a JSON array as documented for the algorithm used.
E.The Amazon EC2 instance class specifying whether training will be run using CPU or GP
F.The output path specifying where on an Amazon S3 bucket the trained model will persist.

정답:

Question No : 7

An insurance company is developing a new device for vehicles that uses a camera to observe drivers’ behavior and alert them when they appear distracted. The company created approximately 10,000 training images in a controlled environment that a Machine Learning Specialist will use to train and evaluate machine learning models.
During the model evaluation, the Specialist notices that the training error rate diminishes faster as the number of epochs increases and the model is not accurately inferring on the unseen test images.
Which of the following should be used to resolve this issue? (Choose two.)

A.Add vanishing gradient to the model.
B.Perform data augmentation on the training data.
C.Make the neural network architecture complex.
D.Use gradient checking in the model.
E.Add L2 regularization to the model.

정답:

Question No : 8

A company is using Amazon Polly to translate plaintext documents to speech for automated company announcements. However, company acronyms are being mispronounced in the current documents.
How should a Machine Learning Specialist address this issue for future documents?

A.Convert current documents to SSML with pronunciation tags.
B.Create an appropriate pronunciation lexicon.
C.Output speech marks to guide in pronunciation.
D.Use Amazon Lex to preprocess the text files for pronunciation

정답:
Explanation:
Reference: https://docs.aws.amazon.com/polly/latest/dg/ssml.html

Question No : 9

A Machine Learning Specialist is creating a new natural language processing application that processes a dataset comprised of 1 million sentences. The aim is to then run Word2Vec to generate embeddings of the sentences and enable different types of predictions.
Here is an example from the dataset:
"The quck BROWN FOX jumps over the lazy dog.”
Which of the following are the operations the Specialist needs to perform to correctly sanitize and prepare the data in a repeatable manner? (Choose three.)

A.Perform part-of-speech tagging and keep the action verb and the nouns only.
B.Normalize all words by making the sentence lowercase.
C.Remove stop words using an English stopword dictionary.
D.Correct the typography on "quck" to "quick.”
E.One-hot encode all words in the sentence.
F.Tokenize the sentence into words.

정답:

Question No : 10

A Machine Learning Specialist kicks off a hyperparameter tuning job for a tree-based ensemble model using Amazon SageMaker with Area Under the ROC Curve (AUC) as the objective metric. This workflow will eventually be deployed in a pipeline that retrains and tunes hyperparameters each night to model click-through on data that goes stale every 24 hours.
With the goal of decreasing the amount of time it takes to train these models, and ultimately to decrease costs, the Specialist wants to reconfigure the input hyperparameter range(s).
Which visualization will accomplish this?

A.A histogram showing whether the most important input feature is Gaussian.
B.A scatter plot with points colored by target variable that uses t-Distributed Stochastic Neighbor Embedding (t-SNE) to visualize the large number of input variables in an easier-to-read dimension.
C.A scatter plot showing the performance of the objective metric over each training iteration.
D.A scatter plot showing the correlation between maximum tree depth and the objective metric.

정답:

Question No : 11

A company wants to classify user behavior as either fraudulent or normal. Based on internal research, a Machine Learning Specialist would like to build a binary classifier based on two features: age of account and transaction month. The class distribution for these features is illustrated in the figure provided.

Based on this information, which model would have the HIGHEST recall with respect to the fraudulent class?

A.Decision tree
B.Linear support vector machine (SVM)
C.Naive Bayesian classifier
D.Single Perceptron with sigmoidal activation function

정답:

Question No : 12

A Machine Learning Specialist trained a regression model, but the first iteration needs optimizing. The Specialist needs to understand whether the model is more frequently overestimating or underestimating the target.
What option can the Specialist use to determine whether it is overestimating or underestimating the target value?

A.Root Mean Square Error (RMSE)
B.Residual plots
C.Area under the curve
D.Confusion matrix

정답:

Question No : 13

A Machine Learning Specialist is building a convolutional neural network (CNN) that will classify 10 types of animals. The Specialist has built a series of layers in a neural network that will take an input image of an animal, pass it through a series of convolutional and pooling layers, and then finally pass it through a dense and fully connected layer with 10 nodes. The Specialist would like to get an output from the neural network that is a probability distribution of how likely it is that the input image belongs to each of the 10 classes.
Which function will produce the desired output?

A.Dropout
B.Smooth L1 loss
C.Softmax
D.Rectified linear units (ReLU)

정답:
Explanation:
Reference: https://towardsdatascience.com/building-a-convolutional-neural-network-cnn-in-keras329fbbadc5f5

Question No : 14

A retail chain has been ingesting purchasing records from its network of 20,000 stores to Amazon S3 using Amazon Kinesis Data Firehose. To support training an improved machine learning model, training records will require new but simple transformations, and some attributes will be combined. The model needs to be retrained daily.
Given the large number of stores and the legacy data ingestion, which change will require the LEAST amount of development effort?

A.Require that the stores to switch to capturing their data locally on AWS Storage Gateway for loading into Amazon S3, then use AWS Glue to do the transformation.
B.Deploy an Amazon EMR cluster running Apache Spark with the transformation logic, and have the cluster run each day on the accumulating records in Amazon S3, outputting new/transformed records to Amazon S3.
C.Spin up a fleet of Amazon EC2 instances with the transformation logic, have them transform the data records accumulating on Amazon S3, and output the transformed records to Amazon S3.
D.Insert an Amazon Kinesis Data Analytics stream downstream of the Kinesis Data Firehose stream that transforms raw record attributes into simple transformed values using SQ

정답:

Question No : 15

A Machine Learning Specialist is configuring Amazon SageMaker so multiple Data Scientists can access notebooks, train models, and deploy endpoints. To ensure the best operational performance, the Specialist needs to be able to track how often the Scientists are deploying models, GPU and CPU utilization on the deployed SageMaker endpoints, and all errors that are generated when an endpoint is invoked.
Which services are integrated with Amazon SageMaker to track this information? (Choose two.)

A.AWS CloudTrail
B.AWS Health
C.AWS Trusted Advisor
D.Amazon CloudWatch
E.AWS Config

정답:
Explanation:
Reference: https://aws.amazon.com/sagemaker/faqs/

Amazon MLS-C01 시험