Practice Free Professional Data Engineer Exam Online Questions
You are designing the database schema for a machine learning-based food ordering service that will predict what users want to eat.
Here is some of the information you need to store:
The user profile:.
What the user likes and doesn’t like to eat
The user account information: Name, address, preferred meal times
The order information: When orders are made, from where, to whom
The database will be used to store all the transactional data of the product.
You want to optimize the data schema.
Which Google Cloud Platform product should you use?
- A . BigQuery
- B . Cloud SQL
- C . Cloud Bigtable
- D . Cloud Datastore
Your team is working on a binary classification problem. You have trained a support vector machine (SVM) classifier with default parameters, and received an area under the Curve (AUC) of 0.87 on the validation set. You want to increase the AUC of the model.
What should you do?
- A . Perform hyperparameter tuning
- B . Train a classifier with deep neural networks, because neural networks would always beat SVMs
- C . Deploy the model and measure the real-world AUC; it’s always higher because of generalization
- D . Scale predictions you get out of the model (tune a scaling factor as a hyperparameter) in order to get the highest AUC
A
Explanation:
https://towardsdatascience.com/understanding-hyperparameters-and-its-optimisation-techniques-f0debba07568
You work for an advertising company, and you’ve developed a Spark ML model to predict click-through rates at advertisement blocks. You’ve been developing everything at your on-premises data center, and now your company is migrating to Google Cloud. Your data center will be migrated to BigQuery. You periodically retrain your Spark ML models, so you need to migrate existing training pipelines to Google Cloud.
What should you do?
- A . Use Cloud ML Engine for training existing Spark ML models
- B . Rewrite your models on TensorFlow, and start using Cloud ML Engine
- C . Use Cloud Dataproc for training existing Spark ML models, but start reading data directly from BigQuery
- D . Spin up a Spark cluster on Compute Engine, and train Spark ML models on the data exported from BigQuery
C
Explanation:
https://cloud.google.com/dataproc/docs/tutorials/bigquery-sparkml
Which of these are examples of a value in a sparse vector? (Select 2 answers.)
- A . [0, 5, 0, 0, 0, 0]
- B . [0, 0, 0, 1, 0, 0, 1]
- C . [0, 1]
- D . [1, 0, 0, 0, 0, 0, 0]
CD
Explanation:
Categorical features in linear models are typically translated into a sparse vector in which each possible value has a corresponding index or id. For example, if there are only three possible eye colors you can represent ‘eye_color’ as a length 3 vector: ‘brown’ would become [1, 0, 0], ‘blue’ would become [0, 1, 0] and ‘green’ would become [0, 0, 1]. These vectors are called "sparse" because they may be very long, with many zeros, when the set of possible values is very large (such as all English words).
[0, 0, 0, 1, 0, 0, 1] is not a sparse vector because it has two 1s in it. A sparse vector contains only a single 1.
[0, 5, 0, 0, 0, 0] is not a sparse vector because it has a 5 in it. Sparse vectors only contain 0s and 1s.
Reference: https://www.tensorflow.org/tutorials/linear#feature_columns_and_transformations
You work for an economic consulting firm that helps companies identify economic trends as they happen. As part of your analysis, you use Google BigQuery to correlate customer data with the average prices of the 100 most common goods sold, including bread, gasoline, milk, and others. The average prices of these goods are updated every 30 minutes. You want to make sure this data stays up to date so you can combine it with other data in BigQuery as cheaply as possible.
What should you do?
- A . Load the data every 30 minutes into a new partitioned table in BigQuery.
- B . Store and update the data in a regional Google Cloud Storage bucket and create a federated data source in BigQuery
- C . Store the data in Google Cloud Datastore. Use Google Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Cloud Datastore
- D . Store the data in a file in a regional Google Cloud Storage bucket. Use Cloud Dataflow to query BigQuery and combine the data programmatically with the data stored in Google Cloud Storage.
What is the recommended action to do in order to switch between SSD and HDD storage for your Google Cloud Bigtable instance?
- A . create a third instance and sync the data from the two storage types via batch jobs
- B . export the data from the existing instance and import the data into a new instance
- C . run parallel instances where one is HDD and the other is SDD
- D . the selection is final and you must resume using the same storage type
B
Explanation:
When you create a Cloud Bigtable instance and cluster, your choice of SSD or HDD storage for the cluster is permanent. You cannot use the Google Cloud Platform Console to change the type of storage that is used for the cluster.
If you need to convert an existing HDD cluster to SSD, or vice-versa, you can export the data from the existing instance and import the data into a new instance. Alternatively, you can write a Cloud Dataflow or Hadoop MapReduce job that copies the data from one instance to another.
Reference: https://cloud.google.com/bigtable/docs/choosing-ssd-hddC
What is the HBase Shell for Cloud Bigtable?
- A . The HBase shell is a GUI based interface that performs administrative tasks, such as creating and deleting tables.
- B . The HBase shell is a command-line tool that performs administrative tasks, such as creating and deleting tables.
- C . The HBase shell is a hypervisor based shell that performs administrative tasks, such as creating and deleting new virtualized instances.
- D . The HBase shell is a command-line tool that performs only user account management functions to grant access to Cloud Bigtable instances.
B
Explanation:
The HBase shell is a command-line tool that performs administrative tasks, such as creating and deleting tables. The Cloud Bigtable HBase client for Java makes it possible to use the HBase shell to connect to Cloud Bigtable.
Reference: https://cloud.google.com/bigtable/docs/installing-hbase-shell
You are using BigQuery and Data Studio to design a customer-facing dashboard that displays large quantities of aggregated data. You expect a high volume of concurrent users. You need to optimize tie dashboard to provide quick visualizations with minimal latency.
What should you do?
- A . Use BigQuery BI Engine with materialized views
- B . Use BigQuery BI Engine with streaming data.
- C . Use BigQuery Bl Engine with authorized views
- D . Use BigQuery Bl Engine with logical reviews
Which of these is not a supported method of putting data into a partitioned table?
- A . If you have existing data in a separate file for each day, then create a partitioned table and upload each file into the appropriate partition.
- B . Run a query to get the records for a specific day from an existing table and for the destination table, specify a partitioned table ending with the day in the format "$YYYYMMDD".
- C . Create a partitioned table and stream new records to it every day.
- D . Use ORDER BY to put a table’s rows into chronological order and then change the table’s type to "Partitioned".
D
Explanation:
You cannot change an existing table into a partitioned table. You must create a partitioned table from scratch. Then you can either stream data into it every day and the data will automatically be put in the right partition, or you can load data into a specific partition by using "$YYYYMMDD" at the end of the table name.
Reference: https://cloud.google.com/bigquery/docs/partitioned-tables
Use a Cloud Storage object trigger to launch a Cloud Function that triggers the DAG
Explanation:
This option is the most efficient and maintainable workflow for your use case, as it allows you to process each table independently and trigger the DAGs only when new data arrives in the Cloud Storage bucket. By using the Dataproc and BigQuery operators, you can easily orchestrate the load and transformation jobs for each table, and leverage the scalability and performance of these services12. By creating a separate DAG for each table, you can customize the transformation logic and parameters for each table, and avoid the complexity and overhead of a single shared DAG3. By using a Cloud Storage object trigger, you can launch a Cloud Function that triggers the DAG for the corresponding table, ensuring that the data is processed as soon as possible and reducing the idle time and cost of running the DAGs on a fixed schedule4.
Option A is not efficient, as it runs the DAG hourly regardless of the data arrival, and it uses a single shared DAG for all tables, which makes it harder to maintain and debug.
Option C is also not efficient, as it runs the DAGs hourly and does not leverage the Cloud Storage object trigger.
Option D is not maintainable, as it uses a single shared DAG for all tables, and it does not use the Cloud Storage operator, which can simplify the data ingestion from the bucket.
Reference: 1: Dataproc Operator | Cloud Composer | Google Cloud
2: BigQuery Operator | Cloud Composer | Google Cloud
3: Choose Workflows or Cloud Composer for service orchestration | Workflows | Google Cloud
4: Cloud Storage Object Trigger | Cloud Functions Documentation | Google Cloud
[5]: Triggering DAGs | Cloud Composer | Google Cloud
[6]: Cloud Storage Operator | Cloud Composer | Google Cloud