Foldercase Blog – Machine learning: an introduction

Nowadays, there is so much information on machine learning available online that it is hard to know where to start. And the field is advancing so quickly that it is difficult to catch up.

Since you’ve come here, chances are that you’re looking for an introduction to machine learning that is easy to understand and provides a solid basis for further study — without requiring you to read an entire book.

You’ve come to the right place.

To provide you with the most important information in as few words as possible, we get started straight away.

What is machine learning?

Machine learning aims to learn dependencies from data. There are two primary types of machine learning:

Supervised machine learning. You build a model that uses input data to predict a target variable (e.g. diagnosis, age, symptom severity).
Unsupervised machine learning. You build a model that identifies structure within the input data without predicting a predefined outcome.

Most real-world applications belong to supervised machine learning, which is why this post focuses on this domain.

How do you build a supervised machine learning model?

One of the simplest examples is linear regression, which predicts a continuous outcome using a combination of input features.

The model parameters are estimated by minimizing prediction error on the training data. This principle generalizes to most supervised machine learning models.

Some examples of supervised machine learning models

Linear discriminant analysis. Learns a linear boundary between two groups by optimizing variance separation.
Regression and classification trees. Iteratively partition the data into branches and can capture non-linear relationships.
Support vector machines. Define decision boundaries based on support vectors and can model non-linear patterns.

How good is a machine learning model?

Assessing model performance is essential to determine whether predictions will generalize to unseen data.

Error rate: Fraction of incorrectly classified observations.
Sensitivity and specificity: Robust measures for imbalanced datasets.
ROC-AUC: Evaluates performance across classification thresholds.

Importantly, training data must never be used to assess performance. Independent validation or cross-validation is required.

Why do you need cross-validation?

Cross-validation evaluates model performance using data not seen during training by repeatedly splitting data into training and test folds.

This provides a more honest estimate of how a model will perform in real-world scenarios.

How to build good machine learning models?

Avoid confounded data.
Use training data close to the intended application scenario.
Prefer simpler models when possible.
Use feature selection for high-dimensional data.
Validate models on independent datasets.
Check predictions for residual confounding.

Why can machine learning be challenging?

Low signal-to-noise ratio in input data.
Training data not representative of real-world use.
Noisy outcome variables.
Data access and privacy constraints.

How to get started with machine learning?

Hands-on practice is essential. R and Python are widely used languages with extensive machine learning libraries and strong community support.

By gaining experience with real data and proper validation strategies, you can build a solid foundation for applying machine learning responsibly.