Practice Free Professional Data Engineer Exam Online Questions
Which methods can be used to reduce the number of rows processed by BigQuery?
- A . Splitting tables into multiple tables; putting data in partitions
- B . Splitting tables into multiple tables; putting data in partitions; using the LIMIT clause
- C . Putting data in partitions; using the LIMIT clause
- D . Splitting tables into multiple tables; using the LIMIT clause
A
Explanation:
If you split a table into multiple tables (such as one table for each day), then you can limit your query to the data in specific tables (such as for particular days). A better method is to use a partitioned table, as long as your data can be separated by the day.
If you use the LIMIT clause, BigQuery will still process the entire table.
Reference: https://cloud.google.com/bigquery/docs/partitioned-tables
Which methods can be used to reduce the number of rows processed by BigQuery?
- A . Splitting tables into multiple tables; putting data in partitions
- B . Splitting tables into multiple tables; putting data in partitions; using the LIMIT clause
- C . Putting data in partitions; using the LIMIT clause
- D . Splitting tables into multiple tables; using the LIMIT clause
A
Explanation:
If you split a table into multiple tables (such as one table for each day), then you can limit your query to the data in specific tables (such as for particular days). A better method is to use a partitioned table, as long as your data can be separated by the day.
If you use the LIMIT clause, BigQuery will still process the entire table.
Reference: https://cloud.google.com/bigquery/docs/partitioned-tables
Enable each department to own and share the data of their data lakes.
Explanation:
Implementing a data mesh approach involves treating data as a product and enabling decentralized data ownership and architecture. The steps outlined in option C support this approach by creating separate projects for each department, which aligns with the principle of domain-oriented decentralized data ownership. By allowing departments to create their own Cloud Storage buckets and BigQuery datasets, it promotes autonomy and self-service. Publishing the data in Analytics Hub facilitates data sharing and discovery across departments, enabling a collaborative environment where data can be easily accessed and utilized by different parts of the organization.
Reference: Architecture and functions in a data mesh – Google Cloud
Professional Data Engineer Certification Exam Guide | Learn – Google Cloud
Build a Data Mesh with Dataplex | Google Cloud Skills Boost
You need to copy millions of sensitive patient records from a relational database to BigQuery. The total size of the database is 10 TB. You need to design a solution that is secure and time-efficient.
What should you do?
- A . Export the records from the database as an Avro file. Upload the file to GCS using gsutil, and then load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.
- B . Export the records from the database as an Avro file. Copy the file onto a Transfer Appliance and send it to Google, and then load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.
- C . Export the records from the database into a CSV file. Create a public URL for the CSV file, and then use Storage Transfer Service to move the file to Cloud Storage. Load the CSV file into BigQuery using the BigQuery web UI in the GCP Console.
- D . Export the records from the database as an Avro file. Create a public URL for the Avro file, and then use Storage Transfer Service to move the file to Cloud Storage. Load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.
Scaling a Cloud Dataproc cluster typically involves ____.
- A . increasing or decreasing the number of worker nodes
- B . increasing or decreasing the number of master nodes
- C . moving memory to run more applications on a single node
- D . deleting applications from unused nodes periodically
A
Explanation:
After creating a Cloud Dataproc cluster, you can scale the cluster by increasing or decreasing the number of worker nodes in the cluster at any time, even when jobs are running on the cluster.
Cloud Dataproc clusters are typically scaled to:
1) increase the number of workers to make a job run faster
2) decrease the number of workers to save money
3) increase the number of nodes to expand available Hadoop Distributed Filesystem (HDFS) storage
Reference: https://cloud.google.com/dataproc/docs/concepts/scaling-clusters
You are deploying 10,000 new Internet of Things devices to collect temperature data in your warehouses globally. You need to process, store and analyze these very large datasets in real time.
What should you do?
- A . Send the data to Google Cloud Datastore and then export to BigQuery.
- B . Send the data to Google Cloud Pub/Sub, stream Cloud Pub/Sub to Google Cloud Dataflow, and store the data in Google BigQuery.
- C . Send the data to Cloud Storage and then spin up an Apache Hadoop cluster as needed in Google Cloud Dataproc whenever analysis is required.
- D . Export logs in batch to Google Cloud Storage and then spin up a Google Cloud SQL instance,
import the data from Cloud Storage, and run an analysis as needed.
You are creating a data model in BigQuery that will hold retail transaction data. Your two largest tables, sales_transation_header and sales_transation_line. have a tightly coupled immutable relationship. These tables are rarely modified after load and are frequently joined when queried. You need to model the sales_transation_header and sales_transation_line tables to improve the performance of data analytics queries.
What should you do?
- A . Create a sal es_transaction table that Stores the sales_tran3action_header and sales_transaction_line data as a JSON data type.
- B . Create a sale3_transaction table that holds the sales_transaction_header information as rows and the sales_transaction_line rows as nested and repeated fields.
- C . Create a sale_transaction table that holds the sales_transaction_header and
sales_transaction_line information as rows, duplicating the sales_transaction_header data for each line. - D . Create separate sales_transation_header and sales_transation_line tables and. when querying, specify the sales transition line first in the WHERE clause.
B
Explanation:
BigQuery supports nested and repeated fields, which are complex data types that can represent hierarchical and one-to-many relationships within a single table. By using nested and repeated fields, you can denormalize your data model and reduce the number of joins required for your queries. This can improve the performance and efficiency of your data analytics queries, as joins can be expensive and require shuffling data across nodes. Nested and repeated fields also preserve the data integrity and avoid data duplication. In this scenario, the sales_transaction_header and sales_transaction_line tables have a tightly coupled immutable relationship, meaning that each header row corresponds to one or more line rows, and the data is rarely modified after load. Therefore, it makes sense to create a single sales_transaction table that holds the sales_transaction_header information as rows and the sales_transaction_line rows as nested and repeated fields. This way, you can query the sales transaction data without joining two tables, and use dot notation or array functions to access the nested and repeated fields.
For example, the sales_transaction table could have the following schema:
Table
To query the total amount of each order, you could use the following SQL statement:
SQL
SELECT id, SUM(line_items.quantity * line_items.price) AS total_amount
FROM sales_transaction
GROUP BY id;
AI-generated code. Review and use carefully. More info on FAQ.
Reference: Use nested and repeated fields
BigQuery explained: Working with joins, nested & repeated data
Arrays in BigQuery ―.
How to improve query performance and optimise storage
You are creating a data model in BigQuery that will hold retail transaction data. Your two largest tables, sales_transation_header and sales_transation_line. have a tightly coupled immutable relationship. These tables are rarely modified after load and are frequently joined when queried. You need to model the sales_transation_header and sales_transation_line tables to improve the performance of data analytics queries.
What should you do?
- A . Create a sal es_transaction table that Stores the sales_tran3action_header and sales_transaction_line data as a JSON data type.
- B . Create a sale3_transaction table that holds the sales_transaction_header information as rows and the sales_transaction_line rows as nested and repeated fields.
- C . Create a sale_transaction table that holds the sales_transaction_header and
sales_transaction_line information as rows, duplicating the sales_transaction_header data for each line. - D . Create separate sales_transation_header and sales_transation_line tables and. when querying, specify the sales transition line first in the WHERE clause.
B
Explanation:
BigQuery supports nested and repeated fields, which are complex data types that can represent hierarchical and one-to-many relationships within a single table. By using nested and repeated fields, you can denormalize your data model and reduce the number of joins required for your queries. This can improve the performance and efficiency of your data analytics queries, as joins can be expensive and require shuffling data across nodes. Nested and repeated fields also preserve the data integrity and avoid data duplication. In this scenario, the sales_transaction_header and sales_transaction_line tables have a tightly coupled immutable relationship, meaning that each header row corresponds to one or more line rows, and the data is rarely modified after load. Therefore, it makes sense to create a single sales_transaction table that holds the sales_transaction_header information as rows and the sales_transaction_line rows as nested and repeated fields. This way, you can query the sales transaction data without joining two tables, and use dot notation or array functions to access the nested and repeated fields.
For example, the sales_transaction table could have the following schema:
Table
To query the total amount of each order, you could use the following SQL statement:
SQL
SELECT id, SUM(line_items.quantity * line_items.price) AS total_amount
FROM sales_transaction
GROUP BY id;
AI-generated code. Review and use carefully. More info on FAQ.
Reference: Use nested and repeated fields
BigQuery explained: Working with joins, nested & repeated data
Arrays in BigQuery ―.
How to improve query performance and optimise storage
You work for a car manufacturer and have set up a data pipeline using Google Cloud Pub/Sub to
capture anomalous sensor events. You are using a push subscription in Cloud Pub/Sub that calls a custom HTTPS endpoint that you have created to take action of these anomalous events as they occur. Your custom HTTPS endpoint keeps getting an inordinate amount of duplicate messages.
What is the most likely cause of these duplicate messages?
- A . The message body for the sensor event is too large.
- B . Your custom endpoint has an out-of-date SSL certificate.
- C . The Cloud Pub/Sub topic has too many messages published to it.
- D . Your custom endpoint is not acknowledging messages within the acknowledgement deadline.
You have uploaded 5 years of log data to Cloud Storage A user reported that some data points in the log data are outside of their expected ranges, which indicates errors You need to address this issue and be able to run the process again in the future while keeping the original data for compliance reasons.
What should you do?
- A . Import the data from Cloud Storage into BigQuery Create a new BigQuery table, and skip the rows with errors.
- B . Create a Compute Engine instance and create a new copy of the data in Cloud Storage Skip the rows with errors
- C . Create a Cloud Dataflow workflow that reads the data from Cloud Storage, checks for values outside the expected range, sets the value to an appropriate default, and writes the updated records to a new dataset in Cloud Storage
- D . Create a Cloud Dataflow workflow that reads the data from Cloud Storage, checks for values outside the expected range, sets the value to an appropriate default, and writes the updated records to the same dataset in Cloud Storage