From the course: Data Science on Google Cloud Platform: Building Data Pipelines
Unlock the full course today
Join today to access over 22,500 courses taught by industry experts or purchase this course individually.
PCollections - Google Cloud Tutorial
From the course: Data Science on Google Cloud Platform: Building Data Pipelines
PCollections
- [Narrator] A Pcollection represents a working data set in the data pipeline. Think about it as a in memory heap storage like a Java Collection or a Python data frame. The Pcollection is created by other operations in the pipeline and is used by downstream stages. A Pcollection is a distributed, multi-element data set. When the pipeline is executed on a parallel processing framework, like Spark or Cloud Data Flow, the execution engine takes care of managing and distributing these Pcollections across processing nodes for parallel processing. A Pcollection is a holder of data that is moving in the pipeline between various processing steps. It serves as both inputs as well as outputs for processing transformations in the pipeline. Processing transforms use intermediate Pcollections to cache data and pass them to downstream transforms. In the pipeline diagram, you can see two Pcollections, Pcollection 1 and Pcollection 2. They serve as intermediate data holders between processing steps…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.