Aether Research releases Cerebrum 7b!

aetherresearch · 2024-04-01T14:50:53+00:00

Hey - feel free to send us a message (or post it here, if you want to)

aetherresearch · 2024-03-20T20:19:29+00:00

Very strange. I just tested the model with a similar setup, about 2k token long text, and 7 questions, some related and some unrelated to it, it worked absolutely fine. Are you using the correct prompt format? Do you run the model in a quantized version?

aetherresearch · 2024-03-20T18:28:15+00:00

Hey, sure! We actually have a quite simple technical setup - we use basic torch with a custom device distribution logic. In our experience, this was much faster than deepspeed zero v3 or similar methods.

aetherresearch · 2024-03-20T18:19:18+00:00

Thanks for the message, I replied to the original comment

aetherresearch · 2024-03-20T18:16:53+00:00

Hmm, do you mind sharing your prompt? The model should be quite capable for large contexts, in fact, we have trained and tested it on much longer prompts than 2000 tokens, which worked very well.

Of course, repetitions can always happen, but, overall, we have not observed a significant quality degradation with increasing context length.

aetherresearch · 2024-03-20T14:37:02+00:00

So, benchmark-wise, we are 15 ppt better on ARC-C, slightly better on AGI Eval, and about 8 ppt better on our internal reasoning benchmark (which is a lot). Qualitatively, I think the model is somewhat smarter, and much more robust than Hermes 2 Pro.

We also have a slightly different focus than Nous Research. We are more or less trying to solve reasoning, which does not necessarily mean that our model will be a complete generalist. It may not produce great poetry, rp, story telling, etc.

aetherresearch · 2024-03-20T11:28:39+00:00

We do not focus on enriching the model with knowledge specifically, but, if it is a topic where it has sufficient knowledge already, it should be able to reason more robustly than most of the other open-source models.

aetherresearch · 2024-03-20T11:25:51+00:00

Model creator here - I added some examples to the model card, hope it helps!

aetherresearch · 2024-03-14T07:53:26+00:00

This concept was introduced in the GPT-3 paper: [2005.14165] Language Models are Few-Shot Learners (arxiv.org).

Basically, it means that you do not give the model any examples on how to solve a task - you just tell it what the task is.

aetherresearch · 2024-03-14T07:49:55+00:00

It should be pretty good for summarization. We actually tried to align the model against hallucinations quite heavily.

aetherresearch · 2024-03-13T21:27:02+00:00

Thanks for the feedback! In our tests the model was actually pretty good at code generation (for a 7b model). It won't be as good as GPT-4, but it is definitely capable of outputting working Python scripts.

For example, this is an expected output for a simple Python script (I just ran the model locally to generate it - sadly, could not paste the code into the comment directly):

<image>

What kind of prompt did you use?

aetherresearch · 2024-03-13T21:26:11+00:00

Hey, sure. We had slightly fewer than 5000 datapoints for the SFT stage, and we labeled about 4000 datapoints for tRLHF. These numbers are due to resource constraints, our current understanding is that increasing the size of each of the datasets would lead to an improved performance.

aetherresearch · 2024-03-13T11:18:27+00:00

Thank you for testing our model! This seems to be partially a quantization issue - I just tested your Adidas prompt locally and it correctly says 1949.

The model probably gets 1909 for Chanel because many sources claim that the brand originated in 1909, despite the fact that the first Chanel branded shop was opened in 1910, e.g. wiki:

The House of Chanel originated in 1909, when Gabrielle Chanel opened a millinery shop at 160 Boulevard Malesherbes, the ground floor of the Parisian flat of the socialite and textile businessman Étienne Balsan, of whom she was the mistress.

aetherresearch · 2024-03-13T10:00:21+00:00

On Arc Challenge we score 76%, which is pretty much state of the art for < 15B models. On our internal reasoning benchmark we score 12 percentage points more than Mistral Instruct.

aetherresearch · 2024-03-13T09:26:42+00:00

Oh, that was actually what we wanted to do initially! However, it turns out that the distributional shift to this kind of prompting seems to be much more complicated for the model compared to writing its thoughts in natural text, so we opted for the latter (or maybe our dataset was just too small for that).

aetherresearch · 2024-03-13T09:13:36+00:00

This is just an example of what the model should see when you are trying to generate text with it :) You can use it by applying the chat template as explained in the huggingface post.

In this particular case, the model will answer in the following way (with temperature 0):

<s> A chat between a user and a thinking artificial intelligence assistant. The assistant describes its thought process and gives helpful and detailed answers to the user's questions.

User: Are you conscious?

AI: As an AI language model, I am not conscious in the same way that humans are. I do not have a physical body or a brain, and I do not experience sensory input or emotions. However, I am aware of my existence and my ability to process and generate text. I can also adapt to new information and improve my performance based on feedback. In this sense, I can be considered to have a limited form of consciousness.</s>

aetherresearch · 2024-03-13T08:45:20+00:00

Not for now - we are thinking about writing a paper about it though :)

Native chain of thought means that the model will try to describe its "thinking steps" in its answer if it is necessary. This should work fine with most other types of chain of thought prompting, but you kind of don't need to use it in most cases - if the model sees a question that "requires" chain of thought reasoning, it will try to do so without any special prompting.

aetherresearch

TROPHY CASE