Practice Free Professional Data Engineer Exam Online Questions
The Dataflow SDKs have been recently transitioned into which Apache service?
- A . Apache Spark
- B . Apache Hadoop
- C . Apache Kafka
- D . Apache Beam
D
Explanation:
Dataflow SDKs are being transitioned to Apache Beam, as per the latest Google directive
Reference: https://cloud.google.com/dataflow/docs/
You are designing the architecture to process your data from Cloud Storage to BigQuery by using Dataflow. The network team provided you with the Shared VPC network and subnetwork to be used by your pipelines. You need to enable the deployment of the pipeline on the Shared VPC network.
What should you do?
- A . Assign the compute. networkUser role to the Dataflow service agent.
- B . Assign the compute.networkUser role to the service account that executes the Dataflow pipeline.
- C . Assign the dataflow, admin role to the Dataflow service agent.
- D . Assign the dataflow, admin role to the service account that executes the Dataflow pipeline.
B
Explanation:
To use a Shared VPC network for a Dataflow pipeline, you need to specify the subnetwork parameter with the full URL of the subnetwork, and grant the service account that executes the pipeline the compute.network User role in the host project. This role allows the service account to use the subnetworks in the Shared VPC network. The Dataflow service agent does not need this role, as it only creates and manages the resources for the pipeline, but does not execute it. The dataflow.admin role is not related to the network access, but to the permissions to create and delete Dataflow jobs and resources.
Reference: Specify a network and subnetwork | Cloud Dataflow | Google Cloud.
How to config dataflow Pipeline to use a Shared VPC?
You are designing the architecture to process your data from Cloud Storage to BigQuery by using Dataflow. The network team provided you with the Shared VPC network and subnetwork to be used by your pipelines. You need to enable the deployment of the pipeline on the Shared VPC network.
What should you do?
- A . Assign the compute. networkUser role to the Dataflow service agent.
- B . Assign the compute.networkUser role to the service account that executes the Dataflow pipeline.
- C . Assign the dataflow, admin role to the Dataflow service agent.
- D . Assign the dataflow, admin role to the service account that executes the Dataflow pipeline.
B
Explanation:
To use a Shared VPC network for a Dataflow pipeline, you need to specify the subnetwork parameter with the full URL of the subnetwork, and grant the service account that executes the pipeline the compute.network User role in the host project. This role allows the service account to use the subnetworks in the Shared VPC network. The Dataflow service agent does not need this role, as it only creates and manages the resources for the pipeline, but does not execute it. The dataflow.admin role is not related to the network access, but to the permissions to create and delete Dataflow jobs and resources.
Reference: Specify a network and subnetwork | Cloud Dataflow | Google Cloud.
How to config dataflow Pipeline to use a Shared VPC?
You are operating a Cloud Dataflow streaming pipeline. The pipeline aggregates events from a Cloud Pub/Sub subscription source, within a window, and sinks the resulting aggregation to a Cloud Storage bucket. The source has consistent throughput. You want to monitor an alert on behavior of the pipeline with Cloud Stackdriver to ensure that it is processing data.
Which Stackdriver alerts should you create?
- A . An alert based on a decrease of subscription/num_undelivered_messages for the source and a rate of change increase of instance/storage/used_bytes for the destination
- B . An alert based on an increase of subscription/num_undelivered_messages for the source and a
rate of change decrease of instance/storage/used_bytes for the destination - C . An alert based on a decrease of instance/storage/used_bytes for the source and a rate of change increase of subscription/num_undelivered_messages for the destination
- D . An alert based on an increase of instance/storage/used_bytes for the source and a rate of change decrease of subscription/num_undelivered_messages for the destination
You are operating a Cloud Dataflow streaming pipeline. The pipeline aggregates events from a Cloud Pub/Sub subscription source, within a window, and sinks the resulting aggregation to a Cloud Storage bucket. The source has consistent throughput. You want to monitor an alert on behavior of the pipeline with Cloud Stackdriver to ensure that it is processing data.
Which Stackdriver alerts should you create?
- A . An alert based on a decrease of subscription/num_undelivered_messages for the source and a rate of change increase of instance/storage/used_bytes for the destination
- B . An alert based on an increase of subscription/num_undelivered_messages for the source and a
rate of change decrease of instance/storage/used_bytes for the destination - C . An alert based on a decrease of instance/storage/used_bytes for the source and a rate of change increase of subscription/num_undelivered_messages for the destination
- D . An alert based on an increase of instance/storage/used_bytes for the source and a rate of change decrease of subscription/num_undelivered_messages for the destination
You plan to deploy Cloud SQL using MySQL. You need to ensure high availability in the event of a zone failure.
What should you do?
- A . Create a Cloud SQL instance in one zone, and create a failover replica in another zone within the same region.
- B . Create a Cloud SQL instance in one zone, and create a read replica in another zone within the same region.
- C . Create a Cloud SQL instance in one zone, and configure an external read replica in a zone in a different region.
- D . Create a Cloud SQL instance in a region, and configure automatic backup to a Cloud Storage bucket in the same region.
You have Google Cloud Dataflow streaming pipeline running with a Google Cloud Pub/Sub subscription as the source. You need to make an update to the code that will make the new Cloud Dataflow pipeline incompatible with the current version. You do not want to lose any data when making this update.
What should you do?
- A . Update the current pipeline and use the drain flag.
- B . Update the current pipeline and provide the transform mapping JSON object.
- C . Create a new pipeline that has the same Cloud Pub/Sub subscription and cancel the old pipeline.
- D . Create a new pipeline that has a new Cloud Pub/Sub subscription and cancel the old pipeline.
Which of the following are examples of hyperparameters? (Select 2 answers.)
- A . Number of hidden layers
- B . Number of nodes in each hidden layer
- C . Biases
- D . Weights
AB
Explanation:
If model parameters are variables that get adjusted by training with existing data, your hyperparameters are the variables about the training process itself. For example, part of setting up a deep neural network is deciding how many "hidden" layers of nodes to use between the input layer and the output layer, as well as how many nodes each layer should use. These variables are not directly related to the training data at all. They are configuration variables. Another difference is that parameters change during a training job, while the hyperparameters are usually constant during a job.
Weights and biases are variables that get adjusted during the training process, so they are not hyperparameters.
Reference: https://cloud.google.com/ml-engine/docs/hyperparameter-tuning-overview
You are administering a BigQuery dataset that uses a customer-managed encryption key (CMEK). You need to share the dataset with a partner organization that does not have access to your CMEK.
What should you do?
- A . Create an authorized view that contains the CMEK to decrypt the data when accessed.
- B . Provide the partner organization a copy of your CMEKs to decrypt the data.
- C . Copy the tables you need to share to a dataset without CMEKs Create an Analytics Hub listing for this dataset.
- D . Export the tables to parquet files to a Cloud Storage bucket and grant the storageinsights. viewer role on the bucket to the partner organization.
C
Explanation:
If you want to share a BigQuery dataset that uses a customer-managed encryption key (CMEK) with a partner organization that does not have access to your CMEK, you cannot use an authorized view or provide them a copy of your CMEK, because these options would violate the security and privacy of your data. Instead, you can copy the tables you need to share to a dataset without CMEKs, and then create an Analytics Hub listing for this dataset. Analytics Hub is a service that allows you to securely share and discover data assets across your organization and with external partners. By creating an Analytics Hub listing, you can grant the partner organization access to the copied dataset without CMEKs, and also control the level of access and the duration of the sharing.
Reference: Customer-managed Cloud KMS keys
[Authorized views]
[Analytics Hub overview]
[Creating an Analytics Hub listing]
You are building a new data pipeline to share data between two different types of applications: jobs generators and job runners. Your solution must scale to accommodate increases in usage and must accommodate the addition of new applications without negatively affecting the performance of existing ones.
What should you do?
- A . Create an API using App Engine to receive and send messages to the applications
- B . Use a Cloud Pub/Sub topic to publish jobs, and use subscriptions to execute them
- C . Create a table on Cloud SQL, and insert and delete rows with the job information
- D . Create a table on Cloud Spanner, and insert and delete rows with the job information