From the course: Data Engineering Foundations

Introduction to data engineering

From the course: Data Engineering Foundations

Start my 1-month free trial

Introduction to data engineering

- [Instructor] If you have ever heard of the data science, there's a good chance you have heard of data engineering as well. As the data space has matured, data engineering has emerged as a separate role to help organizations solve the problem of laying out data in efficient, big data systems. And this video, we will try to understand what data engineering means at a data-driven organization. Imagine that you have been hired as a data scientist at a young data-driven organization. You are tasked with developing a model to flag fraudulent transactions. You want to use a fancy machine learning technique that you have been honing for years. However, after digging around for a couple of hours, you realize all of your data is scattered around many databases. Additionally, the data resides in tables that are optimized for applications to run, not for carrying out analysis. To make matters worse, some legacy code has caused a lot of the data to be corrupt. In your previous company, you never really had this problem because all of the data was available to you in an orderly fashion. You're getting desperate and then comes the data engineer to the rescue. It is the data engineer's task to make your life as a data scientist easier. If you need that the data concurrently comes from several sources, the data engineer will extract data from those multiple sources and load it into one single database ready to use. At the same time, they will optimize the database schema so it becomes faster to query. They also monitor the data pipelines to make sure that there is no corrupt data. They repair the pipeline whenever there's an issue, they schedule or automate tasks to avoid any errors during manual work. As per the definition, data engineering is a type of software engineering that focuses on designing, developing, testing, and maintaining architectures, such as databases and large-scale processing systems. Now, data engineers should have the following skills and knowledge. They need to know Linux. They should be comfortable using the command line. They should have expedience programming in at least one of the programming languages, such as Python or Scala or Java. They should know SQL, how to write queries, how to extract data, how to create database schema. They need some understanding of distributed systems in general and how they are different from traditional storage and processing systems. They need a deep understanding of the ecosystem, including ingestion, processing frameworks and storage engines. They should know the strengths and weaknesses of each tool and what it is best used for. They need to know how to access and process data. In the sense, data engineer is one of the most valuable people in a data-driven organization that wants to scale up.

Contents