Difference between Instruction Tuning vs Non Instruction Tuning Large Language Models

Asked 11/6, 2023 at 15:37 Answered 27/3 at 11:4

language-model fine-tuning large-language-model

What is the difference between instruction tuning and normal fine-tuning for large language models?

Also the instruction-tuning I'm referring to isn't the in-context/prompt one.

All the recent papers about fine-tuning seem to be about instruction tuning.

I have looked at a couple of papers about fine-tuning/instruction tuning (e.g. FLAN) and none really describe the difference between instruction tuning and the alternatives (whatever the alternatives are).

I understand instruction-tuning is a form of fine-tuning but with an instruction dataset. But are all datasets not instruction datasets? What other kinds are there?

Piker answered 11/6, 2023 at 15:37 Comment(1)

If my answer solves your question, could you mark it as solution, please? And otherwise, could you please give feedback on what is missing? – Sabaean 26/6, 2023 at 15:10

As you said, fine-tuning and instruction tuning are not mutually exclusive, but instruction tuning is a form of (supervised) fine-tuning, so there is no distinguishing feature of fine-tuning that differentiates it from instruction tuning, but only the other way around. So the answer to your first question is "No" (I read it as "Is every dataset an instruction dataset?", not as "Do instruction datasets even exist?").

What is special about instruction tuning is that the model is fine-tuned for an instruction-following task, which involves instructing the instruction receiver to perform another task, i.e. you have a second "level" of tasks (e.g. "Split the following number into digits") that is defined only in the instructions, which are part of the model's input sequence.

In classical types of supervised fine-tuning, you have no instructions, but directly tune the model to perform a single downstream task, e.g. to split an input number into digits, without being explicitly told to do so in the model input. (However, there are also hybrid approaches that involve both fine-tuning and explicit instructions.)

So although the word "task" is often used to refer to either, it is essential to conceptually distinguish between:

the task the model is fine-tuned to (if at all),
the task the end-user wants the model to perform, and
the way inputs for either of these tasks are presented to the model
(and also the corresponding datasets and statistical distributions)

In summary, one could say that in instruction following, the actual task is determined dynamically, at inference time, while in the classical fine-tuning approach without instructions or similar devices, the actual task is determined statically, at training time.

Your confusion might be connected to the fact that prompting, which is another widespread adaptation technique, can involve an abstract description of the task (e.g. in zero-shot prompting), which can be formulated as an instruction.

But again, this is not necessary: Few-shot prompting does not necessarily involve an abstract description of the task, but the prompt may consist only of input-output examples of the task, plus the input for which the model should predict the output.

To answer your second question: You can find many datasets/benchmarks on the Hugging Face Hub. If you randomly click at a few of them, you will see in the preview that most of them don't contain any instructions.

EDIT: I forgot to mention one important aspect of instruction tuning: Depending on the application or research question, it often is a goal of instruction tuning to generalize instruction following across tasks. That is, the model should learn to follow instructions based on the implicit knowledge it accumulated during pre-training, and not only based on the instructions it saw during instruction tuning. To measure this cross-task generalization capability, instruction datasets are often divided into multiple tasks. Some of these tasks (not only some split of each task) are held out during instruction tuning and they are used during evaluation only.

Sabaean answered 12/6, 2023 at 9:40 Comment(3)

So is it fair to say that instruction datasets have the task supplied as a string, and then input? So it's kind of like a multitask training dataset? – Lightless 12/9, 2023 at 16:47

@Lightless It's not the same, but it is connected. It is possible to have a instruction dataset where the formulation of the instructions varies but the end-user task is always the same, because there are many ways to formulate an instruction to solve a given problem. – Sabaean 13/9, 2023 at 7:3

Got it, that makes sense. So really, it's just a sequence-to-sequence dataset, where the instruction or task must be picked up from the input or associated task string. – Lightless 15/9, 2023 at 15:41

I think this description from this blog entry may help you:

The main difference between instruction tuning and standard supervised fine-tuning lies in the data that the model is trained on. Whereas supervised fine-tuning trains models on input examples and their corresponding outputs, instruction tuning augments input-output examples with instructions, which enables instruction-tuned models to generalize more easily to new tasks.

And this illustrative comparison can also be very helpful:

Source: Finetuned Language Models are Zero-Shot Learners

The normal fine-tuning is the (A) above, and the instruction tuning the (C) part.

Gantrisin answered 5/10, 2023 at 4:13 Comment(0)

@Bernards answer is right. This is a confusion many new to LLM has.

I understand instruction-tuning is a form of fine-tuning but with an instruction dataset. But are all datasets not instruction datasets? What other kinds are there?

An LLM is trained on a sequence of words where the next word is the implicit label/ground truth. This is Unsupervised Learning and is the bedrock of the magic of all LLMs

Example of training data of an LLM

"To boil an egg you need to first take a vessel and fill it with water. Then place an egg in the vessel and heat the vessel till the after boils"

Training with this type of data is called unsupervised training. The next token is the implicit label. That is after "To boil an" if the LLM predicts the next token as "apple" its loss is high as the ground truth is there in the training data "egg'

Almost everyone knows this now. And the power of LLM is due to the vast amount of such data available in the web, a vast amount of data that is auto labelled.

Now coming back to the question; we can take this data set and the Instruction Tuned Data Set will be something like this -

"How to boil an egg? To boil an egg you need to first take a vessel and fill it with water. Then place an egg in the vessel and heat the vessel till the after boils"

The internal training within the LLM is the same, it masks the next token and uses the generated token with the expected token to calculate Loss and backpropagate the loss.

The subtle difference is the framing of the training data set. When a user asks a similar question, the output answer aligns more with the correct response.

So on top of an already trained foundation model like LLAMA2 or Mistral, you can do instruction tuning with these types of instruction data sets and quickly align the model to a specific domain.

Here is a Colab notebook where this is illustrated

Unsupervised training and evaluation using a small medical dataset - Colab notebook
Training and evaluation with Instruction dataset generated out of the above small medical data set. Colab notebook You can see that the output of the Instruction model is more aligned to the specific domain

Note - You can use an LLM itself to create the Instruction training data by prompting it and feeding it chunks of the original data. Colab notebook

More details here in my medium post - https://alexcpn.medium.com/exploring-large-language-models-8fed99a5a139

Dorettadorette answered 27/3 at 11:4 Comment(0)

This is my understanding

So in Zero shot or few shot is where input is given in "text" inside the prompt context window where as in instruction fine tuning , a data set is used which has the instruction , dialog and output (summary , sentiment etc etc) meaning it does not use prompt context window

Marleenmarlen answered 13/2 at 15:25 Comment(1)

As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center. – Frisby 21/2 at 2:7

Recommended topics

Hot tags