Glossary

This page is designed as a reference of common terms surrounding LLM.

Bias	A systematic error in the LLM. This is brought into the LLM during the training phase.
Dataframe	Data structure constructed with rows and columns that can contain many different types of data (numbers and characters).
Data type	The kind of data in your dataset. For example, numbers are numeric data and letters are character data.
Data structures	Shapes/formats the entire dataset is saved as. These could be a dataframe, a matrix, a list, a vector to name a few. Statistical analyses and graph generation require data in a specific format.
Dummy data	A fake dataset that has the same data types and structure as your real data. These datasets are used to make sure code works and work around data privacy and ownership concerns of using real data.
Hallucinations	Mistakes made by the LLM. Factually incorrect answers.
Large Language Model (LLM)	A large computer model that outputs text (language) based on past experiences (training data).
Prompt	Text input you give the LLM.
Prompt engineering	Someone who tests and optimizes prompts. This is important for LLM testing.
Training Data	Data used to make the LLM. This is what ChatGPT crawls the web for.
Testing Data	Data that is used to test the accuracy of the LLM.

There is also a list of terms here. This page goes a bit more in detail for each term than the glossary in this workshop.