From the course: Data Science on Google Cloud Platform: Building Data Pipelines

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

ParDo

ParDo

- [Kumaran] ParDo is a general-purpose transform available in Apache Beam. ParDo takes as input a row of data from a PCollection. Then, it can perform any custom defined operation on that row of data. This includes cleansing, formatting, filtering, or transformations returning code specific to that programming language. The output of ParDo will be another PCollection. Every row of input results in one row of output. Custom operations can be implemented as functions in Python or Java, and they can be as simple or as complex as you want them to be. In our code example, we are going to use ParDo to extract out the product type and price from the "transactions" PCollection, and create a new PCollection called "prodTypePrice". The input and output data is shown here. Let's look at the code, now. The pipeline for this example is going to be in line 59. It starts work with the "transactions" PCollection, and executes a class called "ExtractProdTypePrice()" for every line in the PCollection…

Contents