Member-only story

Creating a Data Pipeline from Scratch

13 min readJan 8, 2024

A Basic Data Engineering Project end-to-end.

Introduction

Data Engineering was always an area of interest to me, but I never really had time to create a project because I need to split my time among many things like work, family, and everything else that need my time and attention. So, I have given myself a strong challenge: creating a data pipeline from scratch in just a couple of days.

Wow, I mean — that sounds like a lot. But also sounds doable.

With just an idea in mind and more experience navigating the Data Science space than the Data Engineering one, I knew this was going to be tough, but still: challenge accepted.

In this post, we will go over the following project (GitHub):

Create a data pipeline that:
(1) Gets finance datasets from telecommunication stocks, economic indicators, and a Dow Jones Index for the Telecommunications (Telco) sector;
(2) Give initial treatment to validate the data;
(3) Clean and organize the data;
(4) Makes it ready for consumption by analysts and clients in a PostgreSQL database; and
(5) Presents a Power BI report as an en result with some insights.

Let’s dive in.

Architecture

This project was thought to be processed in the Databricks medallion lakehouse architecture, where the data is ingested to an initial raw layer and, as it goes through more and more refinement, it passes to a more valuable layer ( Raw > Bronze > Silver > Gold).

The project, therefore, follows this architecture shown in the next picture.

This is the Architecture of the project. Image by the author.

The architecture is basically:

Fetching data from APIs >> Dump it into the cloud (Databricks folder) >> Start cleaning the data, transforming from json to data frame, correcting data types, dropping unwanted observations…

Creating a Data Pipeline from Scratch

Introduction

Architecture

Written by Gustavo R Santos

Responses (2)