Glossary

This page is a quick reference for terms used in these workshops.

Term	Definition
Large Language Model (LLM)	A large computer model that outputs text (language) based on past experiences (training data).
Token	The small units of text an LLM reads and writes. Longer chats use more tokens; models also have a context limit.
Context window	The amount of text (tokens) an LLM can consider at once, including your current prompt and previous messages.
Training Data	Data used to make the LLM. This is what ChatGPT crawls the web for.
Testing Data	Data that is used to test the accuracy of the LLM.
Bias	A systematic error in the LLM. This is brought into the LLM during the training phase.
Hallucinations	Mistakes made by the LLM. Factually incorrect answers.
Prompt	Text input you give the LLM.
Prompt engineering	The practice of designing and refining prompts so the model returns useful, accurate results.
Integrated development environment (IDE)	Software for writing and running code in one place (e.g. Visual Studio Code with GitHub Copilot).
Jupyter notebook (`.ipynb`)	An interactive document that combines code, output, plots, and notes in one file.
Kernel	The Python runtime used by a notebook to execute code cells. Choosing the right kernel ensures your packages are available.
Virtual environment (venv)	An isolated Python environment for a project so dependencies do not conflict with other projects.
Comma-separated values (CSV)	A plain-text table format (often `.csv`). The Palmer Penguins dataset in this series is a CSV format.
Dataframe	Data structure constructed with rows and columns that can contain many different types of data (numbers and characters).
Data type	The kind of data in your dataset. For example, numbers are numeric data and letters are character data.
Data structures	Shapes/formats the entire dataset is saved as. These could be a dataframe, a matrix, a list, a vector to name a few. Statistical analyses and graph generation require data in a specific format.
Dummy data	A fake dataset that has the same data types and structure as your real data. These datasets are used to make sure code works and work around data privacy and ownership concerns of using real data.
Exploratory Data Analysis (EDA)	Early-stage analysis used to understand data shape, quality, patterns, and possible relationships before formal modeling.
Drop missing values (`dropna`)	A pandas method that removes rows (or columns) containing missing values.
Summary statistics	Numeric summaries such as count, mean, min, and max used to quickly describe a dataset.
Histogram	A chart showing the distribution of a numeric variable by binning values into ranges.

There is also a list of terms here. This page goes a bit more in detail for each term than the glossary in this workshop.