Survival Analysis - Part 1

Suppose that you have conducted a five-year medical study in which patients have been treated for cancer. We would like to fit a model to predict a patient’s survival time, using features such as baseline health measurements, type of treatment, tumour dimensions,… At first pass , this may sound like an ordinary regression problem, but there is an important complication: what if the study duration is finished and some patients are still alive, so we don’t know their true survival time? Such patients’ survival time is said to be “censored”.  We do not want to discard this subset of surviving patients, as the fact that they survived for at least 5 years is considered valuable information that we want to benefit from . However, it is still not clear how to make use of this information using the traditional well-known statistical techniques.

Feature stores the full story

The whole story started when the feature store concept was introduced by Uber in 2017 with the launch of their Michelangelo platform. It is essentially a purpose-built data platform for machine learning features.

Multicollinearity in linear regression

A key goal of fitting a linear regression model is to isolate the relationship between each independent variable (predictors) and the dependent variable (outcome), but multicollinearlity might be a strong obstacle to reach that goal.

ML Models Deployment Strategies

After training and testing of a learning algorithm , the best way to deploy it is not to just place it in a production environment and turn it on , hopefully that nothing will go wrong , because in reality a lot of things will go wrong !

Data Drifts and Concept drifts

No Model lasts forever , Even if the data quality is fine , the model performance may start degrading in production , indicating that something went wrong , Throughout this article , I am going to describe two usual suspects : data drift and concept drift.

Linear Regression Assumptions

Linear Regression is one of the most important models in machine learning, it is also a very useful statistical method to understand the relation between two variables (X and Y).