Practice Free Professional Data Engineer Exam Online Questions
You are designing a Dataflow pipeline for a batch processing job. You want to mitigate multiple zonal failures at job submission time.
What should you do?
- A . Specify a worker region by using the ―region flag.
- B . Set the pipeline staging location as a regional Cloud Storage bucket.
- C . Submit duplicate pipelines in two different zones by using the ―zone flag.
- D . Create an Eventarc trigger to resubmit the job in case of zonal failure when submitting the job.
A
Explanation:
By specifying a worker region, you can run your Dataflow pipeline in a multi-zone or multi-region configuration, which provides higher availability and resilience in case of zonal failures1. The ― region flag allows you to specify the regional endpoint for your pipeline, which determines the location of the Dataflow service and the default location of the Compute Engine resources1. If you do not specify a zone by using the ―zone flag, Dataflow automatically selects a zone within the region for your job workers1. This option is recommended over submitting duplicate pipelines in two different zones, which would incur additional costs and complexity. Setting the pipeline staging location as a regional Cloud Storage bucket does not affect the availability of your pipeline, as the staging location only stores the pipeline code and dependencies2. Creating an Eventarc trigger to resubmit the job in case of zonal failure is not a reliable solution, as it depends on the availability of the Eventarc service and the zonal resources at the time of resubmission.
Reference: 1: Pipeline troubleshooting and debugging | Cloud Dataflow | Google Cloud 3: Regional endpoints | Cloud Dataflow | Google Cloud
You want to build a managed Hadoop system as your data lake. The data transformation process is composed of a series of Hadoop jobs executed in sequence. To accomplish the design of separating storage from compute, you decided to use the Cloud Storage connector to store all input data, output data, and intermediary data. However, you noticed that one Hadoop job runs very slowly with Cloud Dataproc, when compared with the on-premises bare-metal Hadoop environment (8-core nodes with 100-GB RAM). Analysis shows that this particular Hadoop job is disk I/O intensive. You want to resolve the issue.
What should you do?
- A . Allocate sufficient memory to the Hadoop cluster, so that the intermediary data of that particular Hadoop job can be held in memory
- B . Allocate sufficient persistent disk space to the Hadoop cluster, and store the intermediate data of that particular Hadoop job on native HDFS
- C . Allocate more CPU cores of the virtual machine instances of the Hadoop cluster so that the
networking bandwidth for each instance can scale up - D . Allocate additional network interface card (NIC), and configure link aggregation in the operating system to use the combined throughput when working with Cloud Storage
Your company is streaming real-time sensor data from their factory floor into Bigtable and they have noticed extremely poor performance.
How should the row key be redesigned to improve Bigtable performance on queries that populate real-time dashboards?
- A . Use a row key of the form <timestamp>.
- B . Use a row key of the form <sensorid>.
- C . Use a row key of the form <timestamp>#<sensorid>.
- D . Use a row key of the form >#<sensorid>#<timestamp>.
Your company is streaming real-time sensor data from their factory floor into Bigtable and they have noticed extremely poor performance.
How should the row key be redesigned to improve Bigtable performance on queries that populate real-time dashboards?
- A . Use a row key of the form <timestamp>.
- B . Use a row key of the form <sensorid>.
- C . Use a row key of the form <timestamp>#<sensorid>.
- D . Use a row key of the form >#<sensorid>#<timestamp>.
You use BigQuery as your centralized analytics platform. New data is loaded every day, and an ETL pipeline modifies the original data and prepares it for the final users. This ETL pipeline is regularly modified and can generate errors, but sometimes the errors are detected only after 2 weeks. You need to provide a method to recover from these errors, and your backups should be optimized for storage costs.
How should you organize your data in BigQuery and store your backups?
- A . Organize your data in a single table, export, and compress and store the BigQuery data in Cloud Storage.
- B . Organize your data in separate tables for each month, and export, compress, and store the data in Cloud Storage.
- C . Organize your data in separate tables for each month, and duplicate your data on a separate dataset in BigQuery.
- D . Organize your data in separate tables for each month, and use snapshot decorators to restore the table to a time prior to the corruption.
In order to securely transfer web traffic data from your computer’s web browser to the Cloud Dataproc cluster you should use a(n) _____.
- A . VPN connection
- B . Special browser
- C . SSH tunnel
- D . FTP connection
C
Explanation:
To connect to the web interfaces, it is recommended to use an SSH tunnel to create a secure connection to the master node.
Reference: https://cloud.google.com/dataproc/docs/concepts/cluster-web-interfaces#connecting_to_the_web_interfaces
You are working on a niche product in the image recognition domain. Your team has developed a model that is dominated by custom C++ TensorFlow ops your team has implemented. These ops are used inside your main training loop and are performing bulky matrix multiplications. It currently takes up to several days to train a model. You want to decrease this time significantly and keep the cost low by using an accelerator on Google Cloud.
What should you do?
- A . Use Cloud TPUs without any additional adjustment to your code.
- B . Use Cloud TPUs after implementing GPU kernel support for your customs ops.
- C . Use Cloud GPUs after implementing GPU kernel support for your customs ops.
- D . Stay on CPUs, and increase the size of the cluster you’re training your model on.
You have an Oracle database deployed in a VM as part of a Virtual Private Cloud (VPC) network. You want to replicate and continuously synchronize 50 tables to BigQuery. You want to minimize the need to manage infrastructure.
What should you do?
- A . Create a Datastream service from Oracle to BigQuery, use a private connectivity configuration to the same VPC network, and a connection profile to BigQuery.
- B . Create a Pub/Sub subscription to write to BigQuery directly Deploy the Debezium Oracle connector to capture changes in the Oracle database, and sink to the Pub/Sub topic.
- C . Deploy Apache Kafka in the same VPC network, use Kafka Connect Oracle Change Data Capture (CDC), and Dataflow to stream the Kafka topic to BigQuery.
D O Deploy Apache Kafka in the same VPC network, use Kafka Connect Oracle change data capture (CDC), and the Kafka Connect Google BigQuery Sink Connector.
A
Explanation:
Datastream is a serverless, scalable, and reliable service that enables you to stream data changes from Oracle and MySQL databases to Google Cloud services such as BigQuery, Cloud SQL, Google Cloud Storage, and Cloud Pub/Sub. Datastream captures and streams database changes using change data capture (CDC) technology. Datastream supports private connectivity to the source and destination systems using VPC networks. Datastream also provides a connection profile to BigQuery, which simplifies the configuration and management of the data replication.
Reference: Datastream overview
Creating a Datastream stream
Using Datastream with BigQuery
Your company receives both batch- and stream-based event data. You want to process the data using Google Cloud Dataflow over a predictable time period. However, you realize that in some instances data can arrive late or out of order.
How should you design your Cloud Dataflow pipeline to handle data that is late or out of order?
- A . Set a single global window to capture all the data.
- B . Set sliding windows to capture all the lagged data.
- C . Use watermarks and timestamps to capture the lagged data.
- D . Ensure every datasource type (stream or batch) has a timestamp, and use the timestamps to define the logic for lagged data.
By default, which of the following windowing behavior does Dataflow apply to unbounded data sets?
- A . Windows at every 100 MB of data
- B . Single, Global Window
- C . Windows at every 1 minute
- D . Windows at every 10 minutes
B
Explanation:
Dataflow’s default windowing behavior is to assign all elements of a PCollection to a single, global window, even for unbounded PCollections
Reference: https://cloud.google.com/dataflow/model/pcollection