Member-only story
Creating a Data Pipeline from Scratch
A Basic Data Engineering Project end-to-end.

Introduction
Data Engineering was always an area of interest to me, but I never really had time to create a project because I need to split my time among many things like work, family, and everything else that need my time and attention. So, I have given myself a strong challenge: creating a data pipeline from scratch in just a couple of days.
Wow, I mean — that sounds like a lot. But also sounds doable.
With just an idea in mind and more experience navigating the Data Science space than the Data Engineering one, I knew this was going to be tough, but still: challenge accepted.
In this post, we will go over the following project (GitHub):
Create a data pipeline that:
(1) Gets finance datasets from telecommunication stocks, economic indicators, and a Dow Jones Index for the Telecommunications (Telco) sector;
(2) Give initial treatment to validate the data;
(3) Clean and organize the data;
(4) Makes it ready for consumption by analysts and clients in a PostgreSQL database; and
(5) Presents a Power BI report as an en result with some insights.
Let’s dive in.
Architecture
This project was thought to be processed in the Databricks medallion lakehouse architecture, where the data is ingested to an initial raw layer and, as it goes through more and more refinement, it passes to a more valuable layer ( Raw > Bronze > Silver > Gold).
The project, therefore, follows this architecture shown in the next picture.

The architecture is basically:
Fetching data from APIs >> Dump it into the cloud (Databricks folder) >> Start cleaning the data, transforming from json to data frame, correcting data types, dropping unwanted observations…