Practice Free Professional Data Engineer Exam Online Questions
You are designing an Apache Beam pipeline to enrich data from Cloud Pub/Sub with static reference data from BigQuery. The reference data is small enough to fit in memory on a single worker. The pipeline should write enriched results to BigQuery for analysis.
Which job type and transforms should this pipeline use?
- A . Batch job, PubSubIO, side-inputs
- B . Streaming job, PubSubIO, JdbcIO, side-outputs
- C . Streaming job, PubSubIO, BigQueryIO, side-inputs
- D . Streaming job, PubSubIO, BigQueryIO, side-outputs
What is the general recommendation when designing your row keys for a Cloud Bigtable schema?
- A . Include multiple time series values within the row key
- B . Keep the row keep as an 8 bit integer
- C . Keep your row key reasonably short
- D . Keep your row key as long as the field permits
C
Explanation:
A general guide is to, keep your row keys reasonably short. Long row keys take up additional memory and storage and increase the time it takes to get responses from the Cloud Bigtable server.
Reference: https://cloud.google.com/bigtable/docs/schema-design#row-keys
How would you query specific partitions in a BigQuery table?
- A . Use the DAY column in the WHERE clause
- B . Use the EXTRACT(DAY) clause
- C . Use the __PARTITIONTIME pseudo-column in the WHERE clause
- D . Use DATE BETWEEN in the WHERE clause
C
Explanation:
Partitioned tables include a pseudo column named _PARTITIONTIME that contains a date-based timestamp for data loaded into the table. To limit a query to particular partitions (such as Jan 1st and 2nd of 2017), use a clause similar to this:
WHERE _PARTITIONTIME BETWEEN TIMESTAMP (‘2017-01-01’) AND TIMESTAMP (‘2017-01-02’)
Reference: https://cloud.google.com/bigquery/docs/partitioned-tables#the_partitiontime_pseudo_column
Your company produces 20,000 files every hour. Each data file is formatted as a comma separated values (CSV) file that is less than 4 KB. All files must be ingested on Google Cloud Platform before they can be processed. Your company site has a 200 ms latency to Google Cloud, and your Internet connection bandwidth is limited as 50 Mbps. You currently deploy a secure FTP (SFTP) server on a virtual machine in Google Compute Engine as the data ingestion point. A local SFTP client runs on a dedicated machine to transmit the CSV files as is. The goal is to make reports with data from the previous day available to the executives by 10:00 a.m. each day. This design is barely able to keep up with the current volume, even though the bandwidth utilization is rather low.
You are told that due to seasonality, your company expects the number of files to double for the next three months.
Which two actions should you take? (Choose two.)
- A . Introduce data compression for each file to increase the rate file of file transfer.
- B . Contact your internet service provider (ISP) to increase your maximum bandwidth to at least 100 Mbps.
- C . Redesign the data ingestion process to use gsutil tool to send the CSV files to a storage bucket in parallel.
- D . Assemble 1,000 files into a tape archive (TAR) file. Transmit the TAR files instead, and disassemble the CSV files in the cloud upon receiving them.
- E . Create an S3-compatible storage endpoint in your network, and use Google Cloud Storage Transfer Service to transfer on-premices data to the designated storage bucket.
When you store data in Cloud Bigtable, what is the recommended minimum amount of stored data?
- A . 500 TB
- B . 1GB
- C . 1TB
- D . 500 GB
C
Explanation:
Cloud Bigtable is not a relational database. It does not support SQL queries, joins, or multi-row transactions. It is not a good solution for less than 1 TB of data.
Reference: https://cloud.google.com/bigtable/docs/overview#title_short_and_other_storage_options
When you store data in Cloud Bigtable, what is the recommended minimum amount of stored data?
- A . 500 TB
- B . 1GB
- C . 1TB
- D . 500 GB
C
Explanation:
Cloud Bigtable is not a relational database. It does not support SQL queries, joins, or multi-row transactions. It is not a good solution for less than 1 TB of data.
Reference: https://cloud.google.com/bigtable/docs/overview#title_short_and_other_storage_options
MJTelco needs you to create a schema in Google Bigtable that will allow for the historical analysis of the last 2 years of records. Each record that comes in is sent every 15 minutes, and contains a unique identifier of the device and a data record. The most common query is for all the data for a given device for a given day.
Which schema should you use?
- A . Rowkey: date#device_idColumn data: data_point
- B . Rowkey: dateColumn data: device_id, data_point
- C . Rowkey: device_idColumn data: date, data_point
- D . Rowkey: data_pointColumn data: device_id, date
- E . Rowkey: date#data_pointColumn data: device_id
You used Cloud Dataprep to create a recipe on a sample of data in a BigQuery table. You want to reuse this recipe on a daily upload of data with the same schema, after the load job with variable execution time completes.
What should you do?
- A . Create a cron schedule in Cloud Dataprep.
- B . Create an App Engine cron job to schedule the execution of the Cloud Dataprep job.
- C . Export the recipe as a Cloud Dataprep template, and create a job in Cloud Scheduler.
- D . Export the Cloud Dataprep job as a Cloud Dataflow template, and incorporate it into a Cloud Composer job.
Data Analysts in your company have the Cloud IAM Owner role assigned to them in their projects to allow them to work with multiple GCP products in their projects. Your organization requires that all BigQuery data access logs be retained for 6 months. You need to ensure that only audit personnel in your company can access the data access logs for all projects.
What should you do?
- A . Enable data access logs in each Data Analyst’s project. Restrict access to Stackdriver Logging via Cloud IAM roles.
- B . Export the data access logs via a project-level export sink to a Cloud Storage bucket in the Data Analysts’ projects. Restrict access to the Cloud Storage bucket.
- C . Export the data access logs via a project-level export sink to a Cloud Storage bucket in a newly created projects for audit logs. Restrict access to the project with the exported logs.
- D . Export the data access logs via an aggregated export sink to a Cloud Storage bucket in a newly created project for audit logs. Restrict access to the project that contains the exported logs.
You have a data stored in BigQuery. The data in the BigQuery dataset must be highly available. You need to define a storage, backup, and recovery strategy of this data that minimizes cost.
How should you configure the BigQuery table?
- A . Set the BigQuery dataset to be regional. In the event of an emergency, use a point-in-time snapshot to recover the data.
- B . Set the BigQuery dataset to be regional. Create a scheduled query to make copies of the data to tables suffixed with the time of the backup. In the event of an emergency, use the backup copy of the table.
- C . Set the BigQuery dataset to be multi-regional. In the event of an emergency, use a point-in-time snapshot to recover the data.
- D . Set the BigQuery dataset to be multi-regional. Create a scheduled query to make copies of the data to tables suffixed with the time of the backup. In the event of an emergency, use the backup copy of the table.