Data Pipeline – is a series of tools and actions used to organize and transform data into different storage and analysis systems. It automates the ETL (Extract, Transform and Load) process. The big data pipeline usually includes the stages of ingestion, data lake, preparation & computation, data warehouse and presentation.
Data Lake vs Data Warehouse
“Data Lake” stores raw data, and normally the purpose of data is not yet determined, but data inside is highly accessible and quick to make changes whereas “Data Warehouse” stores processed data for specific purposes, usually in use at the moment. The accessibility of data in a “data warehouse” can be more complicated (compared with accessing data in a “data lake“), and costly to update.
A perfect examples of a data pipeline is a Telecom Service Provider or ISP which collects information about customer’s devices, location, session duration, and tracks their purchases and interaction with the company’s customer service in order to generate actionable insights that they can use to improve customer experience.