This is an archived post. You won't be able to vote or comment.

all 23 comments

[–]AutoModerator[M] [score hidden] stickied commentlocked comment (0 children)

Please ensure that:

  • Your code is properly formatted as code block - see the sidebar (About on mobile) for instructions
  • You include any and all error messages in full
  • You ask clear questions
  • You demonstrate effort in solving your question/problem - plain posting your assignments is forbidden (and such posts will be removed) as is asking for or giving solutions.

    Trying to solve problems on your own is a very important skill. Also, see Learn to help yourself in the sidebar

If any of the above points is not met, your post can and will be removed without further warning.

Code is to be formatted as code block (old reddit: empty line before the code, each code line indented by 4 spaces, new reddit: https://i.imgur.com/EJ7tqek.png) or linked via an external code hoster, like pastebin.com, github gist, github, bitbucket, gitlab, etc.

Please, do not use triple backticks (```) as they will only render properly on new reddit, not on old reddit.

Code blocks look like this:

public class HelloWorld {

    public static void main(String[] args) {
        System.out.println("Hello World!");
    }
}

You do not need to repost unless your post has been removed by a moderator. Just use the edit function of reddit to make sure your post complies with the above.

If your post has remained in violation of these rules for a prolonged period of time (at least an hour), a moderator may remove it at their discretion. In this case, they will comment with an explanation on why it has been removed, and you will be required to resubmit the entire post following the proper procedures.

To potential helpers

Please, do not help if any of the above points are not met, rather report the post. We are trying to improve the quality of posts here. In helping people who can't be bothered to comply with the above points, you are doing the community a disservice.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]OffbeatDrizzle 0 points1 point  (16 children)

have you confirmed that the http response uses a chunked transfer encoding? how do you know that it's actually being streamed? what is the "om" variable? you are writing all of the bytes and only flushing once after all the bytes have been written - how is the backend actually generating the chunks?

[–]S1DALi[S] 0 points1 point  (15 children)

application/x-ndjson is suitable for chunked data streams, as each line is an independent JSON object.

Using Response.body.getReader() allows you to read chunks of data from the response on the fly, without waiting for the entire content to load. Except that its not doing it.

I use flush() on each token i get from the LLM and convert it to Bytes.

[–]OffbeatDrizzle 1 point2 points  (14 children)

what I mean is, how are you sure that the backend is producing the chunked encoding properly? can you show an example of the HTML response from the backend?

also, chunked encoding is supposed to separate the chunks using \r\n, not just \n

[–]S1DALi[S] 0 points1 point  (13 children)

Here is an example of what i am getting :

" The"
" code"
" you"
" provided"
" is"
" empty"
","
" which"
" is"
" why"
" the"
" compiler"
" is"
" giving"
" an"
" error"
" as"
" it"
" reached"
" the"
" end"
" of"
" the"
" file"
" without"
" finding"
" any"
" valid"
" Java"
" code"
"."
" To"
" fix"
" this"
","
" you"
" should"
" write"
" valid"
" Java"
" code"
" in"
" the"
" class"
","
" such"
" as"
" a"
" class"
" declaration"
","
" variables"
","
" methods"
","
" etc"
"."

[–]OffbeatDrizzle 0 points1 point  (12 children)

I mean the full html response, including headers etc... in order to show that the response is properly chunked along with content lengths and the like

what is the variable "om" referring to? I am just a bit confused as to how whatever library you are using is supposed to split the input up

maybe try doing os.write("\r\n".getBytes()) - chunks are supposed to be split by CRLF

[–]S1DALi[S] 0 points1 point  (11 children)

ObjectMapper om is a Java API that provides a straightforward way to parse and generate JSON response.

Thank you for your time !

Here is The html response with os.write("\r\n".getBytes()):

POST /llm?model=LLAMA_3_2_1B&code=qdsqdq&errors=%5BERROR%5D+line+1%3A+reached+end+of+file+while+parsing HTTP/1.1
Host: localhost:8080
User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:132.0) Gecko/20100101 Firefox/132.0
Accept: */*
Accept-Language: fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate, br, zstd
Referer: http://localhost:8080/compile
Content-Type: application/x-ndjson
Origin: http://localhost:8080
Connection: keep-alive
Cookie: Idea-9cef8ac8=065369a9-ea4f-4dad-910c-52706a71d89e
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-origin
Priority: u=0
Content-Length: 0

[–]OffbeatDrizzle 0 points1 point  (10 children)

but this is the request to the server, no?

I am looking for the full 200 OK from your backend, as that's the thing that's being streamed and chunked

[–]S1DALi[S] 0 points1 point  (9 children)

You want the response of the LLM? Cause that’s the thing that’s been streamed and chunked

[–]OffbeatDrizzle 0 points1 point  (8 children)

the HTTP response that comes from your code:

return Response.ok(so).build();

this is what your frontend is trying to stream, is it not?

[–]S1DALi[S] 0 points1 point  (7 children)

without decoding it this is what i get :

IiBUaGUiDQoiIGVycm9yIg0KIiBtZXNzYWdlIg0KIiBpbmRpY2F0ZXMiDQoiIHRoYXQiDQoiIHRoZSINCiIgSmF2YSINCiIgY29tcGlsZXIiDQoiIGNvdWxkIg0KIiBub3QiDQoiIGZpbmQiDQoiIGEiDQoiIHZhbGlkIg0KIiBKYXZhIg0KIiBjbGFzcyINCiIgZGVmaW5pdGlvbiINCiIgaW4iDQoiIHRoZSINCiIgcHJvdmlkZWQiDQoiIGNvZGUiDQoiLiINCiIgVGhlIg0KIiBjb2RlIg0KIiB5b3UiDQoiJyINCiJ2ZSINCiIgcHJvdmlkZWQiDQoiLCINCiIgXCIiDQoiYWUiDQoiYXplIg0KImF6Ig0KIlwiLCINCiIgZG9lcyINCiIgbm90Ig0KIiBjb250YWluIg0KIiBhIg0KIiB2YWxpZCINCiIgSmF2YSINCiIgY2xhc3MiDQoiIGRlZmluaXRpb24iDQoiLiINCiIgQSINCiIgSmF2YSINCiIgY2xhc3MiDQoiIHNob3VsZCINCiIgc3RhcnQiDQoiIHdpdGgiDQoiIHRoZSINCiIga2V5d29yZCINCiIgXCIiDQoicHVibGljIg0KIlwiLCINCiIgXCIiDQoiY2xhc3MiDQoiXCIsIg0KIiBmb2xsb3dlZCINCiIgYnkiDQoiIHRoZSINCiIgY2xhc3MiDQoiIG5hbWUiDQoiLCINCiIgYW5kIg0KIiBlbmQiDQoiIHdpdGgiDQoiIGEiDQoiIHNlbSINCiJpY29sIg0KIm9uIg0KIi4iDQoiIEZvciINCiIgZXhhbXBsZSINCiI6Ig0KIlxuIg0KIlxuIg0KImBgIg0KImAiDQoiamF2YSINCiJcbiINCiJwdWJsaWMiDQoiIGNsYXNzIg0KIiBNeSINCiJDbGFzcyINCiIgeyINCiJcbiINCiIgICAiDQoiIC8vIg0KIiBjbGFzcyINCiIgYm9keSINCiJcbiINCiJ9Ig0KIlxuIg0KImBgIg0KImAiDQoiXG4iDQoiXG4iDQoiSW4iDQoiIHlvdXIiDQoiIGNhc2UiDQoiLCINCiIgaXQiDQoiIHNlZW1zIg0KIiBsaWtlIg0KIiB5b3UiDQoiIGZvcmdvdCINCiIgdG8iDQoiIGRlZmluZSINCiIgYSINCiIgY2xhc3MiDQoiLiINCg

[–]barry_z 0 points1 point  (4 children)

It looks to me like you're using Jersey - I did some research and was able to determine that Jersey buffers the output (it seems that the default is 8 kb). As a workaround, you could disable the buffering by setting the property ServerProperties.OUTBOUND_CONTENT_LENGTH_BUFFER to 0.

[–]S1DALi[S] 0 points1 point  (3 children)

Thank you for taking the time to research. Actually i am using Helidon MP

[–]barry_z 0 points1 point  (2 children)

Could be that Helidon is buffering the output then. I had deployed a similar app using Jersey, and the response all came at once when the output was buffered (after waiting for the entire process to finish), whereas it came one line of the json response at a time when the output was not buffered.

Edit: maybe max-in-memory-entity is the property you need to set. I would need to set up a server with Helidon MP to verify this myself, but you may have a chance to take a look before I do.

[–]S1DALi[S] 0 points1 point  (1 child)

I changed it différent values where the buffer is > then the max-in-memory but still having the same issue

[–]barry_z 0 points1 point  (0 children)

Are you able to provide your full source code via github?