From the course: Azure Spark Databricks Essential Training

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

Understand data engineering workload steps

Understand data engineering workload steps

From the course: Azure Spark Databricks Essential Training

Start my 1-month free trial

Understand data engineering workload steps

- [Instructor] In this section I'm going to share some of the work that I've been doing in production working with scaling Apache Spark workloads. So to get context for that, let's review a typical data engineering pipeline. You can see that the steps are represented here. You'll have to design for ingest of data, exploration of data, using the notebook interface in our case. If we're using machine learning, as we are, training the ML Model, evaluating the effectiveness of it, and then eventually persisting it so that people outside of this environment can have access to it. In my case this'll be genomic researchers. Now to put a more practical spin on this engineering pipeline, there's actually sub-steps that I have found in working in production with this particular use case. Now because Spark is often used in situations with both massive sizes of data and also massive complexity, often using machine learning for compute, what is typically done is an iterative process to get the…

Contents