Before joining JENGA School, I had never heard of the concept of data pipelines. The concept was first introduced to me in the recently concluded second Module of the Data Science Core Program. I have to admit that this was one of the many times that I learned new things during the semester. Here is what I learned about data pipelines.
Simply put, a data pipeline is a system of automatically moving raw data from one or more sources in a series of steps to a target destination (such as data warehouses, data lakes e.t.c), for storage, analysis, and insights mining. Various ETL techniques may be applied to the data along the pipeline such that in the end it is processed and ready for analysis.
Why Data Pipelines?
Data Pipelines are essential in data analytics for several reasons.
First, data analytics is a computationally demanding task which if performed on the production environment (where the data is created), can impair the performance of the system as well as slow down the analysis.
Secondly, data needs to be aggregated in ways that make sense. An example is where you have one system that stores user files and another that captures events. Having a separate system for analytics ensures that you can combine the various data types without degrading your production system. Moreover, it is a lot less risky to make changes to data in a separate system.
Other justifications for data pipelines may include issues to do with data privacy where for example you may not want data analysts to have access to all of your organization’s data.
What Does a Data Pipeline Look Like?
Moving data from one system to another requires many steps, each usually requiring separate software. The General architecture of a data pipeline includes the following parts and processes:
This is where the information comes from. There are various sources of data such as relational databases, Apache servers, APIs, NoSQL e.t.c
This is the stage where the criteria and logic for how the data is combined is defined
Here, the required values of the data are extracted. Sometimes certain specific data that is found in larger fields is required. A good example is extracting area codes from the telephone number contact field.
Standardization is where you ensure all data follows the same measurement units and is presented in an acceptable size, font, and color.
In this stage, errors, as well as corrupt records, are checked and removed.
Once the data is cleaned up, it is loaded into the analysis system such as a data warehouse, a Hadoop framework e.t.c
Along the different stages, a data pipeline employs automation of processes. It could process such as error detection and monitoring.
As the world becomes more and more data-driven, many companies will continue to find the need for data pipelines. Therefore, building a data pipeline is a skill that many data scientists and engineers need to be well versed with.