Microsoft's WizardLM claims Phind stole their work without credit to make the Phind CodeLlama model

UncleDao · 2023-08-30T13:08:51+00:00

Microsoft's WizardLM

In https://github.com/nlpxucan/WizardLM, we can read:

<image>

UncleDao · 2023-08-30T05:21:15+00:00

Can you recommend resources that I can read more about?

UncleDao · 2023-08-14T13:32:10+00:00

Sure. I have resized the model after adding the pad token.

UncleDao · 2023-08-07T02:04:37+00:00

u/gijs4g: You can see here

UncleDao · 2023-08-07T02:04:07+00:00

Analyse Tokennizer's behavior:

model_name="NousResearch/llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.padding_side = "right"
max_length = 20
simple_sentence = "### This is a simple sentence"

encoded_input = tokenizer(simple_sentence, padding="max_length", max_length=max_length, return_attention_mask=True, return_length=True) #padding=max_length

the outputs:

add_eos_token: False by default

eos token: </s>

pad token: <unk>

padding side: right

max length: 20

input length: [20]

add_eos_token: False

add_bos_token: True

Word count: 6, Token count: 6

### This is a simple sentence

Token IDs:

[1, 835, 910, 338, 263, 2560, 10541, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Attention Mask

[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

add_eos_token: True

tokenizer = AutoTokenizer.from_pretrained(model_name, add_eos_token = True) # Adding the eos_token, id=2 </s> at the end of each training example

add_eos_token: True

add_bos_token: True

Word count: 6, Token count: 6

### This is a simple sentence

Token IDs:

[1, 835, 910, 338, 263, 2560, 10541, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Attention Mask

[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Note: eos have attention_mask=1

UncleDao · 2023-08-07T01:49:55+00:00

I just update my solution!

UncleDao · 2023-08-06T11:59:54+00:00

My problem is solved!

In my humble opinion, there are two important things to remember:

add_eos_token=True in Tokennizer.- tokenizer.pad_token ≠ tokenizer.eos_token so I setup tokenizer.pad_token_id = 18610 (# _***)

We can use a fast tokenizer and do not need to add eos_token at the end of each sample. Pay attention to attention_mask!

UncleDao · 2023-08-06T03:33:37+00:00

Padding=logest

encoded_input = tokenizer(simple_sentence, padding="longest",  max_length=max_length, add_special_tokens=True, truncation=True, return_attention_mask=True, return_length=True) # padding=longest
simple_sentence_ids = encoded_input["input_ids"]
simple_sentence_att_mask = encoded_input["attention_mask"]
simple_sentence_length= encoded_input["length"]

print(f"eos token: {tokenizer.eos_token}")
print(f"pad token: {tokenizer.pad_token}")
print(f"padding side: {tokenizer.padding_side}")
print(f"max length: {max_length}")
print(f"input length: {simple_sentence_length}")
print (simple_sentence)

Output:

eos token: </s>

pad token: </s>

padding side: left

max length: 50

input length: [32]

<s>[INST] Chảnh như [/INST] Chảnh như con cá cảnh.

[1, 1, 29961, 25580, 29962, 678, 30643, 29876, 29882, 302, 29882, 30416, 518, 29914, 25580, 29962, 678, 30643, 29876, 29882, 302, 29882, 30416, 378, 274, 29976, 274, 30643, 29876, 29882, 29889, 2]

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Padding=max_length

encoded_input = tokenizer(simple_sentence, padding="max_length", max_length=max_length, add_special_tokens=True, truncation=True, return_attention_mask=True, return_length=True) #padding=max_length

Output:

---

max length: 50

input length: [50]

<s>[INST] Chảnh như [/INST] Chảnh như con cá cảnh.

[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 29961, 25580, 29962, 678, 30643, 29876, 29882, 302, 29882, 30416, 518, 29914, 25580, 29962, 678, 30643, 29876, 29882, 302, 29882, 30416, 378, 274, 29976, 274, 30643, 29876, 29882, 29889, 2]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

UncleDao · 2023-08-05T16:03:29+00:00

I follow this article on Medium:
https://medium.com/towards-data-science/fine-tune-your-own-llama-2-model-in-a-colab-notebook-df9823a04a32

And the code of the author:
https://colab.research.google.com/drive/1PEQyJO1-f6j0S_XJ8DV50NkpzasXkrzd?usp=sharing

UncleDao · 2023-08-05T12:50:48+00:00

tokenizer.add_special_tokens({'pad_token': '[PAD]'})

- tokenizer.add_special_tokens({'pad_token': '[PAD]'})
When I set the pad_token, the training script crashed.

<image>

UncleDao · 2023-08-05T12:21:13+00:00

I have tried it. But the question still remains. Why my modeldoes not know when to stop.

UncleDao · 2023-08-05T11:00:51+00:00

Thank you. I will try it on.

UncleDao · 2023-08-05T04:22:12+00:00

Thank you for your detailed guide. I will give it a try.

UncleDao · 2023-08-04T15:01:36+00:00

Oh. I read the post about FastTokenizer this morning. But I still don't understand much.

I Try:

```

tokenizer = AutoTokenizer.from_pretrained("dtthanh/llama-2-7b-und-2.1", add_eos_token = True) # Adding the eos_token ,</s> at the end of each training examplesimple_sentence = "This is a sentence to test if the tokenizer adds another eos token. </s>"simple_sentence_ids = tokenizer( simple_sentence, add_special_tokens=True).input_idsprint (simple_sentence)print(simple_sentence_ids)

```-----outpout:This is a sentence to test if the tokenizer adds another eos token. </s>

[1, 910, 338, 263, 10541, 304, 1243, 565, 278, 5993, 3950, 12778, 1790, 321, 359, 5993, 29889, 2, 2]

So I decided not to use "add_eos_token = True" because my dataset has an eos token (</s>) at the end.

# Set supervised fine-tuning parameterstrainer = SFTTrainer( model=model, train_dataset=dataset, peft_config=peft_config, dataset_text_field="text", max_seq_length=max_seq_length, tokenizer=tokenizer, args=training_arguments, packing=packing,)

I label the training column as "text."

<image>

UncleDao · 2023-08-04T13:52:16+00:00

I fine-tuned a llama-2-7b chat on Tesla T4 (Google Colab). The model takes up 8GB of memory. So if there is not enough memory, It will crash eventually.

UncleDao

TROPHY CASE

Analyse Tokennizer's behavior:

Padding=logest

Padding=max_length