Practice Free Databricks Certified Data Analyst Associate Exam Online Questions – Page 2

Question #11

A business analyst has been asked to create a data entity/object called sales_by_employee. It should always stay up-to-date when new data are added to the sales table. The new entity should have the columns sales_person, which will be the name of the employee from the employees table, and sales, which will be all sales for that particular sales person. Both the sales table and the employees table have an employee_id column that is used to identify the sales person.

Which of the following code blocks will accomplish this task?

A)

A . Option
B . Option
C . Option
D . Option

Reveal Solution Hide Solution

Question #12

In which of the following situations should a data analyst use higher-order functions?

A . When custom logic needs to be applied to simple, unnested data
B . When custom logic needs to be converted to Python-native code
C . When custom logic needs to be applied at scale to array data objects
D . When built-in functions are taking too long to perform tasks
E . When built-in functions need to run through the Catalyst Optimizer

Reveal Solution Hide Solution

Question #13

A data engineering team has created a Structured Streaming pipeline that processes data in micro-batches and populates gold-level tables. The microbatches are triggered every minute.

A data analyst has created a dashboard based on this gold-level data. The project stakeholders want to see the results in the dashboard updated within one minute or less of new data becoming available within the gold-level tables.

Which of the following cautions should the data analyst share prior to setting up the dashboard to

complete this task?

A . The required compute resources could be costly
B . The gold-level tables are not appropriately clean for business reporting
C . The streaming data is not an appropriate data source for a dashboard
D . The streaming cluster is not fault tolerant
E . The dashboard cannot be refreshed that quickly

Reveal Solution Hide Solution

Correct Answer: A
A

Explanation:

A Structured Streaming pipeline that processes data in micro-batches and populates gold-level tables every minute requires a high level of compute resources to handle the frequent data ingestion, processing, and writing. This could result in a significant cost for the organization, especially if the data volume and velocity are large. Therefore, the data analyst should share this caution with the project stakeholders before setting up the dashboard and evaluate the trade-offs between the desired refresh rate and the available budget. The other options are not valid cautions because:

B) The gold-level tables are assumed to be appropriately clean for business reporting, as they are the final output of the data engineering pipeline. If the data quality is not satisfactory, the issue should be addressed at the source or silver level, not at the gold level.

C) The streaming data is an appropriate data source for a dashboard, as it can provide near real-time insights and analytics for the business users. Structured Streaming supports various sources and sinks for streaming data, including Delta Lake, which can enable both batch and streaming queries on the same data.

D) The streaming cluster is fault tolerant, as Structured Streaming provides end-to-end exactly-once fault-tolerance guarantees through checkpointing and write-ahead logs. If a query fails, it can be restarted from the last checkpoint and resume processing.

E) The dashboard can be refreshed within one minute or less of new data becoming available in the gold-level tables, as Structured Streaming can trigger micro-batches as fast as possible (every few seconds) and update the results incrementally. However, this may not be necessary or optimal for the business use case, as it could cause frequent changes in the dashboard and consume more resources.

Reference: Streaming on Databricks, Monitoring Structured Streaming queries on Databricks, A look at the new Structured Streaming UI in Apache Spark 3.0, Run your first Structured Streaming workload

Question #14

A data engineer is working with a nested array column products in table transactions. They want to expand the table so each unique item in products for each row has its own row where the transaction_id column is duplicated as necessary.

They are using the following incomplete command:

Which of the following lines of code can they use to fill in the blank in the above code block so that it successfully completes the task?

A . array distinct(produces)
B . explode(produces)
C . reduce(produces)
D . array(produces)
E . flatten(produces)

Reveal Solution Hide Solution

1 2

Exams