From the course: Data Science on Google Cloud Platform: Building Data Pipelines

Unlock the full course today

Join today to access over 22,500 courses taught by industry experts or purchase this course individually.

PCollections

PCollections - Google Cloud Tutorial

From the course: Data Science on Google Cloud Platform: Building Data Pipelines

Start my 1-month free trial

PCollections

- [Narrator] A Pcollection represents a working data set in the data pipeline. Think about it as a in memory heap storage like a Java Collection or a Python data frame. The Pcollection is created by other operations in the pipeline and is used by downstream stages. A Pcollection is a distributed, multi-element data set. When the pipeline is executed on a parallel processing framework, like Spark or Cloud Data Flow, the execution engine takes care of managing and distributing these Pcollections across processing nodes for parallel processing. A Pcollection is a holder of data that is moving in the pipeline between various processing steps. It serves as both inputs as well as outputs for processing transformations in the pipeline. Processing transforms use intermediate Pcollections to cache data and pass them to downstream transforms. In the pipeline diagram, you can see two Pcollections, Pcollection 1 and Pcollection 2. They serve as intermediate data holders between processing steps…

Contents