Practice Free Professional Data Engineer Exam Online Questions
You are on the data governance team and are implementing security requirements to deploy resources. You need to ensure that resources are limited to only the europe-west 3 region You want to follow Google-recommended practices.
What should you do?
- A . Deploy resources with Terraform and implement a variable validation rule to ensure that the region is set to the europe-west3 region for all resources.
- B . Set the constraints/gcp. resourceLocations organization policy constraint to in:eu-locations.
- C . Create a Cloud Function to monitor all resources created and automatically destroy the ones created outside the europe-west3 region.
- D . Set the constraints/gcp. resourceLocations organization policy constraint to in: europe-west3-locations.
D
Explanation:
To ensure that resources are limited to only the europe-west3 region, you should set the organization policy constraint constraints/gcp.resourceLocations to in:europe-west3-locations. This policy restricts the deployment of resources to the specified locations, which in this case is the europe-west3 region. By setting this policy, you enforce location compliance across your Google Cloud resources, aligning with the best practices for data governance and regulatory compliance.
Reference: Professional Data Engineer Certification Exam Guide | Learn – Google Cloud1.
Preparing for Google Cloud Certification: Cloud Data Engineer2.
Professional Data Engineer Certification | Learn | Google Cloud3.
3: Professional Data Engineer Certification | Learn | Google Cloud 2: Preparing for Google Cloud Certification: Cloud Data Engineer 1: Professional Data Engineer Certification Exam Guide | Learn – Google Cloud
You are creating a new pipeline in Google Cloud to stream IoT data from Cloud Pub/Sub through Cloud Dataflow to BigQuery. While previewing the data, you notice that roughly 2% of the data appears to be corrupt. You need to modify the Cloud Dataflow pipeline to filter out this corrupt data.
What should you do?
- A . Add a SideInput that returns a Boolean if the element is corrupt.
- B . Add a ParDo transform in Cloud Dataflow to discard corrupt elements.
- C . Add a Partition transform in Cloud Dataflow to separate valid data from corrupt data.
- D . Add a GroupByKey transform in Cloud Dataflow to group all of the valid data together and discard
the rest.
Which role must be assigned to a service account used by the virtual machines in a Dataproc cluster so they can execute jobs?
- A . Dataproc Worker
- B . Dataproc Viewer
- C . Dataproc Runner
- D . Dataproc Editor
A
Explanation:
Service accounts used with Cloud Dataproc must have Dataproc/Dataproc Worker role (or have all the permissions granted by Dataproc Worker role).
Reference: https://cloud.google.com/dataproc/docs/concepts/service-accounts#important_notes
Your globally distributed auction application allows users to bid on items. Occasionally, users place identical bids at nearly identical times, and different application servers process those bids. Each bid event contains the item, amount, user, and timestamp. You want to collate those bid events into a single location in real time to determine which user bid first.
What should you do?
- A . Create a file on a shared file and have the application servers write all bid events to that file.
Process the file with Apache Hadoop to identify which user bid first. - B . Have each application server write the bid events to Cloud Pub/Sub as they occur. Push the events from Cloud Pub/Sub to a custom endpoint that writes the bid event information into Cloud SQL.
- C . Set up a MySQL database for each application server to write bid events into. Periodically query each of those distributed MySQL databases and update a master MySQL database with bid event information.
- D . Have each application server write the bid events to Google Cloud Pub/Sub as they occur. Use a pull
subscription to pull the bid events using Google Cloud Dataflow. Give the bid for each item to the user in
the bid event that is processed first.
Your globally distributed auction application allows users to bid on items. Occasionally, users place identical bids at nearly identical times, and different application servers process those bids. Each bid event contains the item, amount, user, and timestamp. You want to collate those bid events into a single location in real time to determine which user bid first.
What should you do?
- A . Create a file on a shared file and have the application servers write all bid events to that file.
Process the file with Apache Hadoop to identify which user bid first. - B . Have each application server write the bid events to Cloud Pub/Sub as they occur. Push the events from Cloud Pub/Sub to a custom endpoint that writes the bid event information into Cloud SQL.
- C . Set up a MySQL database for each application server to write bid events into. Periodically query each of those distributed MySQL databases and update a master MySQL database with bid event information.
- D . Have each application server write the bid events to Google Cloud Pub/Sub as they occur. Use a pull
subscription to pull the bid events using Google Cloud Dataflow. Give the bid for each item to the user in
the bid event that is processed first.
You are designing a real-time system for a ride hailing app that identifies areas with high demand for rides to effectively reroute available drivers to meet the demand. The system ingests data from multiple sources to Pub/Sub. processes the data, and stores the results for visualization and analysis in real-time dashboards. The data sources include driver location updates every 5 seconds and app-based booking events from riders. The data processing involves real-time aggregation of supply and demand data for the last 30 seconds, every 2 seconds, and storing the results in a low-latency system for visualization.
What should you do?
- A . Group the data by using a tumbling window in a Dataflow pipeline, and write the aggregated data to Memorystore
- B . Group the data by using a hopping window in a Dataflow pipeline, and write the aggregated data to Memorystore
- C . Group the data by using a session window in a Dataflow pipeline, and write the aggregated data to BigQuery.
- D . Group the data by using a hopping window in a Dataflow pipeline, and write the aggregated data to BigQuery.
B
Explanation:
A hopping window is a type of sliding window that advances by a fixed period of time, producing overlapping windows. This is suitable for the scenario where the system needs to aggregate data for the last 30 seconds, every 2 seconds, and provide real-time updates. A Dataflow pipeline can implement the hopping window logic using Apache Beam, and process both streaming and batch data sources. Memorystore is a low-latency, in-memory data store that can serve the aggregated data to the visualization layer. BigQuery is not a good choice for this scenario, as it is not optimized for low-latency queries and frequent updates.
You work for an airline and you need to store weather data in a BigQuery table Weather data will be used as input to a machine learning model. The model only uses the last 30 days of weather data. You want to avoid storing unnecessary data and minimize costs.
What should you do?
- A . Create a BigQuery table where each record has an ingestion timestamp Run a scheduled query to delete all the rows with an ingestion timestamp older than 30 days.
- B . Create a BigQuery table partitioned by ingestion time Set up partition expiration to 30 days.
- C . Create a BigQuery table partitioned by datetime value of the weather date Set up partition expiration to 30 days.
- D . Create a BigQuery table with a datetime column for the day the weather data refers to. Run a scheduled query to delete rows with a datetime value older than 30 days.
B
Explanation:
Partitioning a table by ingestion time means that the data is divided into partitions based on the time when the data was loaded into the table. This allows you to delete or archive old data by setting a partition expiration policy. You can specify the number of days to keep the data in each partition, and BigQuery automatically deletes the data when it expires. This way, you can avoid storing unnecessary data and minimize costs.
You are running a pipeline in Cloud Dataflow that receives messages from a Cloud Pub/Sub topic and writes the results to a BigQuery dataset in the EU. Currently, your pipeline is located in europe-west4 and has a maximum of 3 workers, instance type n1-standard-1. You notice that during peak periods, your pipeline is struggling to process records in a timely fashion, when all 3 workers are at maximum CPU utilization.
Which two actions can you take to increase performance of your pipeline? (Choose two.)
- A . Increase the number of max workers
- B . Use a larger instance type for your Cloud Dataflow workers
- C . Change the zone of your Cloud Dataflow pipeline to run in us-central1
- D . Create a temporary table in Cloud Bigtable that will act as a buffer for new data. Create a new step in your pipeline to write to this table first, and then create a new pipeline to write from Cloud Bigtable to BigQuery
- E . Create a temporary table in Cloud Spanner that will act as a buffer for new data. Create a new step in your pipeline to write to this table first, and then create a new pipeline to write from Cloud Spanner to BigQuery
You have a data pipeline with a Cloud Dataflow job that aggregates and writes time series metrics to Cloud Bigtable. This data feeds a dashboard used by thousands of users across the organization. You need to support additional concurrent users and reduce the amount of time required to write the data.
Which two actions should you take? (Choose two.)
- A . Configure your Cloud Dataflow pipeline to use local execution
- B . Increase the maximum number of Cloud Dataflow workers by setting maxNumWorkers in PipelineOptions
- C . Increase the number of nodes in the Cloud Bigtable cluster
- D . Modify your Cloud Dataflow pipeline to use the Flatten transform before writing to Cloud Bigtable
- E . Modify your Cloud Dataflow pipeline to use the CoGroupByKey transform before writing to Cloud Bigtable
Cloud Bigtable is Google’s ______ Big Data database service.
- A . Relational
- B . mySQL
- C . NoSQL
- D . SQL Server
C
Explanation:
Cloud Bigtable is Google’s NoSQL Big Data database service. It is the same database that Google uses for services, such as Search, Analytics, Maps, and Gmail.
It is used for requirements that are low latency and high throughput including Internet of Things (IoT), user analytics, and financial data analysis.
Reference: https://cloud.google.com/bigtable/