Introduction to Machine Learning - Classification and Clustering Models

https://ubc-library-rc.github.io/ml-classification-clustering

Land Acknowledgement

UBC Vancouver is located on the traditional, ancestral, and unceded territory of the xʷməθkʷəy̓əm (Musqueam) peoples.

Use the Zoom toolbar to engage

Participants window

CCDHHN Certificate

This workshop contributes towards the Canadian Certificate in Digital Humanities.

Learning Objectives

Understand classification and clustering models and, their applications in machine learning
Train classification and clustering models for real-world datasets
Interpret and analyze classification and clustering model results.

Pre-workshop setup

GOOGLE COLAB (https://colab.research.google.com/)

What is Machine Learning?

“Field of study that gives computers the capability to learn without being explicitly programmed" - Arthur Samuel

Exercise 1: Replace the question mark

X	Y
0.1	A
0.4	A
4.3	B
4.2	B
3.2	?

Exercise 2: Group the points in two sets

X
11
10
21
22
9

Dataset Example

From colemanm.org

Building a Machine Learning Model

- **Data Collection:**: gathering and preparing data, he quality of the data used in the model directly impacts its accuracy and effectiveness - **Data Preprocessing:** cleaned, transformed, and formatted into a suitable format - **Model Training:** Once the data is ready, it is used to train a machine learning model. - **Model Evaluation:** After the model has been trained, it needs to be evaluated to ensure that it is performing accurately. This is done using a test set of data that was not used during the training process. - **Model Deployment:** Once the model has been trained and evaluated, it can be deployed for use in a production environment. This involves integrating the model into a larger system and ensuring that it can handle real-world data. This step involves choosing a suitable employment platform. - **Model Improvement:** Machine learning is an iterative process, so the model may need to be improved over time. This can involve retraining or fine-tuning its parametersthe with new data or continues stream of data.

Types of Machine Learning

From javatpoint

Some other types:

Reinforcement Learning
Transfer Learning

There are different types of machine learning, including supervised learning, unsupervised learning, and reinforcement learning, each with its own set of algorithms and techniques. * Supervised Learning: an algorithm is trained on a labeled dataset, meaning that the dataset has input features (X) and corresponding output labels (Y). The goal of supervised learning is to learn a function that maps the input features to the output labels. Once the model is trained, it can be used to make predictions on new data. Examples of supervised learning tasks include image classification, speech recognition, and regression analysis. * Unsupervised Learning: Unsupervised learning is a type of machine learning where the algorithm is trained on an unlabeled dataset, meaning that there are no output labels (Y) associated with the input features (X). The goal of unsupervised learning is to learn patterns and structure in the data without the help of a labeled dataset. Examples of unsupervised learning tasks include clustering, anomaly detection, and dimensionality reduction. * Reinforcement Learning: Reinforcement learning is a type of machine learning that involves an agent interacting with an environment to learn how to make decisions that maximize a reward. The agent receives feedback from the environment in the form of rewards or penalties, and its goal is to learn a policy that maximizes the expected long-term reward. Reinforcement learning is often used in robotics, game playing, and control systems. * Transfer Learning: Transfer learning refers to a technique in which a pre-trained model, typically trained on a large dataset, is used as a starting point for training a new model on a smaller dataset or a different but related task. The idea is that the knowledge learned from the pre-trained model can be transferred to the new task, allowing the new model to start with some level of knowledge or "transfer" from the previous task, which can potentially improve its performance and reduce the amount of training data required.

Cake Analogy

From medium.com

Methods and Algorithms

Data Preparations

Types of Features (continuous/categorical)
Handling missing values
Feature scaling
Feature selection

Model Evaluation

From geeksforgeeks

Algorithms and Methods

From mathworks

Algorithms and Methods

From mathworks

Python Libraries

Classification and Clustering

From miro account on Medium

Evaluation - Classification

From Anuganti Suresh on Medium

Density-based Clustering

From ResearchGate

Hierarchical Clustering

From analyticsvidhya.com

A 2003 research team used hierarchical clustering to “support the idea that many…breast tumor subtypes represent biologically distinct disease entities.” To the human eye, the original data looked like noise, but the algorithm was able to find patterns.

Limits of Machine Learning

Garbage In = Garbage Out
Data Limitation
Generalization and overfitting
Inability to explain answers
Ethics and Bias Limitations
Computational Limitations

Some of the differentiation factors of machine learning models also count as their limits: * Garbage In = Garbage Out In machine learning, the quality of the output model is directly dependent on the quality of the input data used to train it. If the input data is incomplete, noisy, or biased, the resulting model may be inaccurate or unreliable. For example, suppose a machine learning model is being developed to predict which loan applications are likely to be approved by a bank. If the training dataset only contains loan applications from a particular demographic group or geographic region, the resulting model may be biased towards that group or region and may not generalize well to other groups or regions. This could lead to discrimination and unfair lending practices. * Data Limitation Machine learning algorithms are only as good as the data they are trained on. If the data is biased, incomplete, or noisy, the algorithm may not be able to learn the underlying patterns or may learn incorrect patterns. Also, machine learning models require large amounts of labeled data for training, which can be expensive and time-consuming to obtain. * Generalization and overfitting Machine learning models are typically trained on a specific dataset, and their ability to generalize to new data outside of that dataset may be limited. Overfitting can occur if the model is too complex or if it is trained on a small dataset, causing it to perform well on the training data but poorly on new data. When a model is overfitting, it is essentially memorizing the training data rather than learning the underlying patterns in the data. * Inability to explain answers Machine learning models can be complex and difficult to interpret, making it challenging to understand why they make certain predictions or decisions. This can be a problem in domains such as healthcare or finance where it is important to be able to understand the rationale behind a decision. * Ethics and Bias Limitations Machine learning algorithms can amplify existing biases in the data they are trained on, leading to unfair or discriminatory outcomes. There is a risk of unintended consequences when using machine learning algorithms in sensitive areas such as criminal justice, hiring decisions, and loan applications. One example of bias in machine learning is in facial recognition technology. Studies have shown that facial recognition systems are less accurate in identifying people with darker skin tones and women. This bias can lead to misidentification, which can have serious consequences, such as wrongful arrest or discrimination in hiring. In the context of healthcare, machine learning algorithms can also perpetuate bias and discrimination. For example, if the algorithm is trained on biased data, it may make less accurate predictions for certain demographic groups, such as racial minorities or people with disabilities. * Computational Limitations Machine learning algorithms can be computationally expensive and require a lot of computing power to train and run. This can be a barrier to adoption in applications where real-time or low-power processing is required. One example of computational limitations in machine learning is training deep neural networks. Deep neural networks are a type of machine learning algorithm that can learn complex patterns in data by using many layers of interconnected nodes. However, training these models can be computationally expensive, requiring significant computing power and memory. For example, training a state-of-the-art natural language processing model like BERT (Bidirectional Encoder Representations from Transformers) can take weeks or even months on a large cluster of GPUs. This limits the ability of smaller organizations or individuals with limited computing resources to develop or use these models.

Open Jupyter Notebooks

Ethics

Image from: Lepri, Bruno, Nuria Oliver, and Alex Pentland. "Ethical machines: The human-centric use of artificial intelligence." IScience 24.3 (2021): 102249.

Where to go from here?

Learn math, programming, and software development: Digital Scholarship workshops at UBC Library
Online courses such as Supervised Machine Learning course on Coursera
Youtube Tutorials such as Python Machine Learning Tutorial Series
Find similar examples in Scikit-learn documentation
Join communities such as Kaggle or AMS Data Science Club

Future workshops

Title	Series 1
Regression models	Tue, Mar 19, 2024 (1:00pm to 3:00pm) - Past
Classification and clustering models	Tue, Mar 26, 2024 (1:00pm to 3:00pm) - Today
Neural networks	Tue, Apr 2, 2024 (1:00pm to 3:00pm)
LLMs	Tue, Apr 9, 2024 (1:00pm to 3:00pm)

More from the Research Commons at (UBC-V)

And from the Center for Scholarly Communication (UBC-O)