Learn how to use PCA algorithm to find variables that vary together

Principal Component Analysis

Principal Component Analysis or PCA for short is a mathematical transformation based on covariance calculations.

Many beginner Data Scientists have their first contact with the algorithm learning that it is good for dimensionality reduction, meaning that when we have a wide dataset, with many variables, we can use PCA to…

Learn how to use .factorize() method from Pandas.

This post will only take a minute from your time, but I believe it can save a few minutes from data transformation.


.factorize() is a Pandas method that helps you to quickly transform your data from text to numbers.

Encode the object as an enumerated type or categorical variable. …

Use Google’s API to translate texts directly from your python script.

Google Translator

Google Translator is possibly the most popular tool to translate texts. …

Understanding the effect of the hyperparameters in a Random Forest ML model

About Random Forest

Decision Tree is a disseminated algorithm to solve problems. It tries to simulate the human thinking process by binarizing each step of the decision. So, at each step, the algorithm chooses between True or False to move forward.

That algorithm is simple, yet very powerful, thus widely applied in machine…

A practical introduction to statistical tests in Python.

Statistical Tests

During an exploratory data analysis — the famous EDA — some questions regarding differences between groups of data can pop fairly easy. Similarly, in case you’re sharing the initial insights with a client or executive, those same questions can be brought up:

Understand how this type of collection works.

Python deque is an uncommon type of collection. What we hear the most from Python users is about lists, dictionaries, tuples.

But the deque — you pronounce it “deck” — is also an interesting type. What makes it different than other objects is…

Using statistics in real life problems. An use case of confidence interval.

Recently I’ve been brushing up my statistics skills, as a good part of Data Science and ML flow through those concepts.

The study subject this time was confidence intervals.

Confidence interval is a range of numbers that comprehends…

Um manual com o básico e essencial para trabalhar com datas e horas em Python.

Trabalhar com objetos datetime em Python pode ser complicado. Acredite em mim, já passei por isso.

Eu poderia apontar pelo menos uma dúzia de vezes nas quais gastei vários minutos procurando os trechos de código…

Learn the essential code snippets to deal with datetime in Python.

Working with datetime objects in Python can be tricky. Believe me, been there, done that.

I could point at least a dozen times I spent many minutes searching for the right code snippets to transform or format dates in…

An end-to-end Data Science project of a regression model to predict car prices.

The Project

  • Programming language: Python
  • Algorithm: Supervised learning Random Forest Regression
  • Goal: My goal for this project was to create a model to estimate car prices in Brazil. That model was deployed to a web app created with Streamlit…

Gustavo Santos

Data Scientist. I extract insights from data to help people and companies to make better and data driven decisions.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store