From the course: Data Engineering Foundations

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

Distributive computing

Distributive computing

From the course: Data Engineering Foundations

Start my 1-month free trial

Distributive computing

- [Instructor] Handling petabytes of data requires distributed or parallel computing. Therefore it's crucial to understand the concept of distributed computing. In any data pipeline, we have to collect data from various sources, join them together, clean them and aggregate them. Parallel computing forms the basis of almost all modern data processing tools. However, why has it become so important in the world of big data? The main reason is memory and processing power. When big data processing tools perform a processing task, they split it up into several smaller subtasks. The processing tools then distribute these subtasks over several computers. These are usually commodity computers, which means they are widely available and relatively inexpensive. Individually, all of the computers would take long time to process the complete task. However, since all the computers work in parallel on smaller subtasks, the task in…

Contents