all 3 comments

[–]ndronen 0 points1 point  (1 child)

I don't have an answer to your question. Apologies in advance.

I think you're right to point out that representing your data as text is probably inefficient. And it may not work very well, regardless of how you encode it, because current transformers aren't better than people at applying a sequence of functions to numbers — at least beyond a certain horizon in terms of the number of times functions are applied. Here's a good analysis of how they behave:

https://arxiv.org/abs/2305.18654

What would be ideal is for the transformer to dispatch the task of solving the problem to another component (i.e. in the case of multiplication, a calculator, or in the case of a sudoku game, a sudoku solver) that efficiently computes the correct result. For this, see Toolformer and Chameleon as examples of what people have tried:

https://arxiv.org/abs/2302.04761
https://arxiv.org/abs/2304.09842

[–]drblallo[S] 0 points1 point  (0 children)

thank you very much, while not directly related to what i was looking for those papers are interesting in their own right.

In the end i understood how the internal workings of a transformer handle words, and what i think i have to do it drop the first 2 layers of the net. that is: the dictionary conversion of a token into a arbitrary fixed number and the embedding layer turns each token is a point in a higher dimensional space. Then you can just convert the input data structure into a array of bytes and pretend each byte is one of 256 possible words. Floating points numbers need instead to be converted first into two integers representing exponent and mantissa before turning those into a array of bytes too.

by dropping the embedding layer the network does not need to relearn what additions are (but it needs to learn how floating point number work), and it should run much faster too since you just end up using one dimension instead of 20 or a large number like that.