Practice Free Professional Data Engineer Exam Online Questions
You are choosing a NoSQL database to handle telemetry data submitted from millions of Internet-of-Things (IoT) devices. The volume of data is growing at 100 TB per year, and each data entry has about 100 attributes. The data processing pipeline does not require atomicity, consistency, isolation, and durability (ACID). However, high availability and low latency are required.
You need to analyze the data by querying against individual fields.
Which three databases meet your requirements? (Choose three.)
- A . Redis
- B . HBase
- C . MySQL
- D . MongoDB
- E . Cassandra
- F . HDFS with Hive
You want to migrate an on-premises Hadoop system to Cloud Dataproc. Hive is the primary tool in
use, and the data format is Optimized Row Columnar (ORC). All ORC files have been successfully copied to a Cloud Storage bucket. You need to replicate some data to the cluster’s local Hadoop Distributed File System (HDFS) to maximize performance.
What are two ways to start using Hive in Cloud Dataproc? (Choose two.)
- A . Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to HDFS. Mount the Hive tables locally.
- B . Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to any node of the Dataproc cluster. Mount the Hive tables locally.
- C . Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to the master node of the Dataproc cluster. Then run the Hadoop utility to copy them do HDFS. Mount the Hive tables from HDFS.
- D . Leverage Cloud Storage connector for Hadoop to mount the ORC files as external Hive tables.
Replicate external Hive tables to the native ones. - E . Load the ORC files into BigQuery. Leverage BigQuery connector for Hadoop to mount the BigQuery tables as external Hive tables. Replicate external Hive tables to the native ones.
Your startup has never implemented a formal security policy. Currently, everyone in the company has access to the datasets stored in Google BigQuery. Teams have freedom to use the service as they see fit, and they have not documented their use cases. You have been asked to secure the data warehouse. You need to discover what everyone is doing.
What should you do first?
- A . Use Google Stackdriver Audit Logs to review data access.
- B . Get the identity and access management IIAM) policy of each table
- C . Use Stackdriver Monitoring to see the usage of BigQuery query slots.
- D . Use the Google Cloud Billing API to see what account the warehouse is being billed to.
You store and analyze your relational data in BigQuery on Google Cloud with all data that resides in US regions. You also have a variety of object stores across Microsoft Azure and Amazon Web Services (AWS), also in US regions. You want to query all your data in BigQuery daily with as little movement of data as possible.
What should you do?
- A . Load files from AWS and Azure to Cloud Storage with Cloud Shell gautil rsync arguments.
- B . Create a Dataflow pipeline to ingest files from Azure and AWS to BigQuery.
- C . Use the BigQuery Omni functionality and BigLake tables to query files in Azure and AWS.
- D . Use BigQuery Data Transfer Service to load files from Azure and AWS into BigQuery.
C
Explanation:
BigQuery Omni is a multi-cloud analytics solution that lets you use the BigQuery interface to analyze data stored in other public clouds, such as AWS and Azure, without moving or copying the data. BigLake tables are a type of external table that let you query structured data in external data stores with access delegation. By using BigQuery Omni and BigLake tables, you can query data in AWS and Azure object stores directly from BigQuery, with minimal data movement and consistent performance.
Reference: 1: Introduction to BigLake tables
2: Deep dive on how BigLake accelerates query performance
3: BigQuery Omni and BigLake (Analytics Data Federation on GCP)
Your company is migrating their 30-node Apache Hadoop cluster to the cloud. They want to re-use Hadoop jobs they have already created and minimize the management of the cluster as much as possible. They also want to be able to persist data beyond the life of the cluster.
What should you do?
- A . Create a Google Cloud Dataflow job to process the data.
- B . Create a Google Cloud Dataproc cluster that uses persistent disks for HDFS.
- C . Create a Hadoop cluster on Google Compute Engine that uses persistent disks.
- D . Create a Cloud Dataproc cluster that uses the Google Cloud Storage connector.
- E . Create a Hadoop cluster on Google Compute Engine that uses Local SSD disks.
Government regulations in the banking industry mandate the protection of client’s personally identifiable information (PII). Your company requires PII to be access controlled encrypted and compliant with major data protection standards In addition to using Cloud Data Loss Prevention (Cloud DIP) you want to follow Google-recommended practices and use service accounts to control access to PII.
What should you do?
- A . Assign the required identity and Access Management (IAM) roles to every employee, and create a single service account to access protect resources
- B . Use one service account to access a Cloud SQL database and use separate service accounts for each human user
- C . Use Cloud Storage to comply with major data protection standards. Use one service account shared by all users
- D . Use Cloud Storage to comply with major data protection standards. Use multiple service accounts attached to IAM groups to grant the appropriate access to each group
If a dataset contains rows with individual people and columns for year of birth, country, and income, how many of the columns are continuous and how many are categorical?
- A . 1 continuous and 2 categorical
- B . 3 categorical
- C . 3 continuous
- D . 2 continuous and 1 categorical
D
Explanation:
The columns can be grouped into two types―categorical and continuous columns:
A column is called categorical if its value can only be one of the categories in a finite set. For example, the native country of a person (U.S., India, Japan, etc.) or the education level (high school, college, etc.) are categorical columns.
A column is called continuous if its value can be any numerical value in a continuous range. For example, the capital gain of a person (e.g. $14,084) is a continuous column.
Year of birth and income are continuous columns. Country is a categorical column.
You could use bucketization to turn year of birth and/or income into categorical features, but the raw columns are continuous.
Reference: https://www.tensorflow.org/tutorials/wide#reading_the_census_data
You have data pipelines running on BigQuery, Cloud Dataflow, and Cloud Dataproc. You need to perform health checks and monitor their behavior, and then notify the team managing the pipelines if they fail. You also need to be able to work across multiple projects. Your preference is to use managed products of features of the platform.
What should you do?
- A . Export the information to Cloud Stackdriver, and set up an Alerting policy
- B . Run a Virtual Machine in Compute Engine with Airflow, and export the information to Stackdriver
- C . Export the logs to BigQuery, and set up App Engine to read that information and send emails if you find a failure in the logs
- D . Develop an App Engine application to consume logs using GCP API calls, and send emails if you find a failure in the logs
You have created an external table for Apache Hive partitioned data that resides in a Cloud Storage bucket, which contains a large number of files. You notice that queries against this table are slow You want to improve the performance of these queries.
What should you do?
- A . Migrate the Hive partitioned data objects to a multi-region Cloud Storage bucket.
- B . Create an individual external table for each Hive partition by using a common table name prefix Use wildcard table queries to reference the partitioned data.
- C . Change the storage class of the Hive partitioned data objects from Coldline to Standard.
- D . Upgrade the external table to a BigLake table Enable metadata caching for the table.
D
Explanation:
BigLake is a Google Cloud service that allows you to query structured data in external data stores such as Cloud Storage, Amazon S3, and Azure Blob Storage with access delegation and governance. BigLake tables extend the capabilities of BigQuery to data lakes and enable a flexible, open lakehouse architecture. By upgrading an external table to a BigLake table, you can improve the performance of your queries by leveraging the BigQuery storage API, which supports data format conversion, predicate pushdown, column projection, and metadata caching. Metadata caching reduces the number of requests to the external data store and speeds up query execution. To upgrade an external table to a BigLake table, you can use the ALTER TABLE statement with the SET OPTIONS clause and specify the enable_metadata_caching option as true. For example:
SQL
ALTER TABLE hive_partitioned_data
SET OPTIONS (
enable_metadata_caching=true
);
AI-generated code. Review and use carefully. More info on FAQ.
Reference: Introduction to BigLake tables
Upgrade an external table to BigLake
BigQuery storage API
You want to automate execution of a multi-step data pipeline running on Google Cloud. The pipeline includes Cloud Dataproc and Cloud Dataflow jobs that have multiple dependencies on each other. You want to use managed services where possible, and the pipeline will run every day.
Which tool should you use?
- A . cron
- B . Cloud Composer
- C . Cloud Scheduler
- D . Workflow Templates on Cloud Dataproc