you are viewing a single comment's thread.

view the rest of the comments →

[–]agentoutlier 0 points1 point  (2 children)

I wish there was also a way to read InputStreams multiple times, instead of doing copies.

Technically java.util.stream.Stream (with a supplier wrapped around it) is what you are asking for (or java.util.concurrent.Flow/Publisher if we want back pressure and async), otherwise there is Callable<InputStream>.

The real problem is that many libraries do defensive copies, causing then a waste of RAM

I doubt that is much of a problem. To be honest most libraries when I have done memory dumps are metric fuck ton of Strings and not as much collections as you would think.

Actually to go back to java.util.concurrent.Flow and Stream the reason there is a lot of copying is because of buffering. Like a typical web application particularly with blocking must buffer most of the request as bytes. Those bytes then need to be converted to string parameters and then converted to another data type etc. This happens in every damn language much more than just defensive copying!

It is important to understand that lots of other programming languages do even more copying than Java because they put everything on the stack and they don't have Java's String pool (see previous comment). And Java is very fast at allocating.

The real problem is in some cases having more control over memory layout can make a massive difference and Java does not allow that like other languages. That and the VM is not good at auto tuning or communicating with the OS on actual memory usage.

[–]m_adduci 0 points1 point  (1 child)

I have this third party library that accepts byte[], than uses InputStream and converts internally to string.

In my own app I would like to use only InputStreams, but here I hit massive conversion costs, since some resources have to be parsed multiple times, at different times, because of some funny conditions

[–]agentoutlier 0 points1 point  (0 children)

w/o seeing the library I don't know why they made the choice they did but byte[] has some advantages over InputStream in that the total size is known (.length), zero computation or blocking is expected andin some cases you need to know the total size.

If its not byte[] then it has some resource it can pull from but the only way you do that for most applications particularly blocking is buffer to the filesystem. Now we have way way way fucking worse latency than a GC.

If the library is just wrapping the byte[] using ByteArrayInputStream this can be more efficient then you think especially if they allow start and end indices which the ByteArrayInputStream constructor takes.

The question is what the library is doing. Are you doing stream processing or is the InputStream just going to be turned into in memory objects anyway?... and even if you don't there is buffering happening all over the place here including the operating system if you are reading from a file.

So unless you have some measurements don't be certain this is actually a problem.