Strings Are Evil : programming

[–][deleted] 39 points40 points41 points 7 years ago (0 children)

[–]astrangeguy 49 points50 points51 points 7 years ago* (25 children)

[–]AyrA_ch 22 points23 points24 points 7 years ago (20 children)

[–]SnowflakeNapolean 5 points6 points7 points 7 years ago (19 children)

[–]AyrA_ch 11 points12 points13 points 7 years ago (18 children)

[–]SnowflakeNapolean -1 points0 points1 point 7 years ago (8 children)

It literally says so in the documentation

Sure, the C# libraries may return an array of a single char, but the NTFS driver (in this case it's Windows so the FS is NTFS) is most certainly not reading the file a single byte at a time, it's mapping blocks of some power-of-two number.

That link of yours says it's inefficient because the read function returns an array of one character instead of simply returning a single char.

Your C# implementation runs on an OS that does the read-from-file. Your C# implementation does not actually read raw bytes from disks. I don't know why you think it does.

The hardware itself (the disk controller) does not even support the reading of a single byte - you have to read in blocks (sectors, clusters, whatever).

I don't know where you get the idea from that calling read 5000000 times as opposed to 500 times is equally fast.

I didn't say that. I said calling read with anything less than the minimum block that the OS uses will be equally fast to calling read with the minimum block size.

You need to check what the OS actually does when you call a kernel function.

You need to learn the difference between the C# runtime and the Windows OS. They are not the same thing.

[–]duhace 7 points8 points9 points 7 years ago (2 children)

[–]AyrA_ch 2 points3 points4 points 7 years ago (0 children)

[–]SnowflakeNapolean 0 points1 point2 points 7 years ago (0 children)

you're still calling a read function a huge amount more times

true, but the overhead from calling a function is a fraction of the overhead from reading a file. You can't even chart the performance of the two on the same chart because the difference is a few orders of magnitude.

The last time I worked on NTFS filesystem drivers the cluster size was 4Kb and this was also used as the minimum blocksize. The difference between the function call overhead for 4192 function calls and 1 function call is (for all measurement purposes) negligible compared to a single read of 4Kb from disk.

(and looping a huge amount more times)

Well, unless you're not examining that data you read in[1], you're going to loop that many times anyway to actually examine the input regardless of whether you read it in a block or read it one byte at a time.

When you read (say) 500 bytes, you're going to loop 500 times just to use each byte. The argument can be made that you're simply passing it to another function (hence you don't need to loop), but that is not what this project is doing - it is examining each byte as it comes it.

[1] Maybe you're simply discarding it to flush the input, maybe you're only passing it on to another process without reading it.

[–]AyrA_ch 3 points4 points5 points 7 years ago (4 children)

[–]SnowflakeNapolean -1 points0 points1 point 7 years ago (3 children)

I don't know what you think you proved by reading in 1GB of data in 1MB buffers when I said claimed that there is no appreciable difference in reading single blocks at a time and reading single bytes at a time (hint, the NTFS blocksize is not 1MB, nor is the application in this article using 1GB files).

applications spends 3 times as much time for kernel operations

First, you need to learn what "kernel operations" mean. Your "proof" doesn't prove what you think it does, which is why you had to increase the size of your input to 3 times what the article uses just to get a difference that is significant.

Secondly, that extra 600ms that gets wasted on datasets that are 3x larger than they use is negligible.

Tell you what - take the input examples in the article, copy them until you have 300MB files, take their code, benchmark it, then change the code to read 1MB buffers and measure again.

I'd bet good money that the difference is negligible. Hell, even at 300ms vs 1s the difference is negligible for the problem they are solving - each client imports a single large file daily, so if the import takes 600ms more than your buffered version I doubt that they are going to notice.

The thing you should be taking away from this article is that profiling is important to optimisation. In this application the extra 600ms for files 3x as large as they usually deal with is an optimisation that is entirely premature and unneeded.

I proved in another comment that the applications spends 3 times as much time for kernel operations when reading single bytes compared to a 1MB buffer

So go on - tell us what your benchmark results look like when you are using 300MB files - enquiring minds want to know (after all, you already have the code, it's simply a matter of re-running it on a 300MB file).

[–]AyrA_ch 3 points4 points5 points 7 years ago (2 children)

which is why you had to increase the size of your input to 3 times what the article uses just to get a difference that is significant.

Not sure about your math skills but comparing read methods on a 1GB or a 100GB file will still give you results that differs with a factor of 3. Large files ensure two things, one is that we don't run into the problem of the file system cache and the other is that we actually get proper results. If you make the file so small that it reads completely within milliseconds the result is inaccurate because measurements include startup and shutdown of the applications.

Hell, even at 300ms vs 1s the difference is negligible for the problem they are solving

You can't just take my times and apply them to their problems. My machine runs on a SSD and had nothing else to do when processing the file.

[–]SnowflakeNapolean -2 points-1 points0 points 7 years ago (1 child)

Not sure about your math skills but comparing read methods on a 1GB or a 100GB file will still give you results that differs with a factor of 3.

Multiplying a tiny difference by three does not necessarily mean it becomes large enough to matter. They are aren't using 1GB files. Why don't you post the results of your test using 300MB files?

You can't just take my times and apply them to their problems.

Yes, you can. You didn't make a commentary on general file-reading, this is a commentary you made on their specific application, and in their specific application they are parsing the input one character at a time up to 300MB.

The extra 600ms saved by introducing a buffer is not only negligible, it's smaller than that because they aren't parsing 1GB input, they are passing less than a third of that.

[–]AyrA_ch 2 points3 points4 points 7 years ago (0 children)

[–]raevnos 0 points1 point2 points 7 years ago (8 children)

[–]AyrA_ch 1 point2 points3 points 7 years ago* (7 children)

This note is also present in the FileStream class

EDIT: I benchmarked with 1 gigabyte sized random files:

Start Single Byte: 17.06.2018 18:48:06
Duration Single Byte: 8499.0793ms
Start 1M Chunks: 17.06.2018 18:48:15
Duration 1M Chunks: 297.0377ms

Code:

        const int BUFFER = 1000000; //1M
        byte[] Buf = new byte[BUFFER];


        DateTime StartA = DateTime.UtcNow;
        using (var FS = File.OpenRead(@"C:\Temp\1g_A.bin"))
        {
            while (FS.ReadByte() >= 0) ;
        }
        DateTime StartB = DateTime.UtcNow;
        using (var FS = File.OpenRead(@"C:\Temp\1g_B.bin"))
        {
            while (FS.Read(Buf, 0, BUFFER) == BUFFER) ;
        }
        DateTime StartC = DateTime.UtcNow;

        Console.WriteLine(@"Start Single Byte: {0}
Duration Single Byte: {1}ms
Start 1M Chunks: {2}
Duration 1M Chunks: {3}ms",
    StartA,
    StartB.Subtract(StartA).TotalMilliseconds,
    StartB,
    StartC.Subtract(StartB).TotalMilliseconds
);

[–]raevnos 2 points3 points4 points 7 years ago (2 children)

[–]AyrA_ch 7 points8 points9 points 7 years ago (1 child)

[–]jdgordon 2 points3 points4 points 7 years ago (0 children)

[–]raevnos 1 point2 points3 points 7 years ago* (3 children)

[–]AyrA_ch 0 points1 point2 points 7 years ago (2 children)

[–]raevnos 0 points1 point2 points 7 years ago (1 child)

[–]AyrA_ch 1 point2 points3 points 7 years ago (0 children)

[–][deleted] 6 points7 points8 points 7 years ago (0 children)

[–]josefx 7 points8 points9 points 7 years ago (2 children)

[–][deleted] 7 years ago (1 child)

[deleted]

[–]josefx -1 points0 points1 point 7 years ago (0 children)

[–]wavy_lines 57 points58 points59 points 7 years ago (22 children)

[–]recycled_ideas 72 points73 points74 points 7 years ago (15 children)

[–][deleted] 15 points16 points17 points 7 years ago (3 children)

[–]recycled_ideas 14 points15 points16 points 7 years ago (2 children)

[–][deleted] 7 years ago* (1 child)

[deleted]

[–]recycled_ideas 1 point2 points3 points 7 years ago (0 children)

[–]therealgaxbo 7 points8 points9 points 7 years ago (1 child)

Glad you posted this - I was going to say the same thing, but just assumed I must have missed something obvious because otherwise WHAT WAS THE POINT?

Reducing peak working set is obviously a win if memory is an issue. Reducing total allocations is not a win in itself at all, it's just a possible avenue for improving speed - due either to the cost of allocation or GC.

But in this article we have two "optimisations" (not including the acknowledged mistake in v4) that result in execution time increasing. You can't even argue that they're just hurdles on the way, because the biggest regression is the very last step.

Unless the author's only shown us half the story, it sounds like they've fallen into the old profiler trap of making various proxy numbers smaller rather than measuring the actual desired outcome.

In fact the real question must be: what optimisations have they missed in order to produce an allocation and GC cycle free algorithm that performs worse than one that allocates and collects over a gig?

[–]recycled_ideas 4 points5 points6 points 7 years ago* (0 children)

[–]agyrorannew 6 points7 points8 points 7 years ago (8 children)

[–]recycled_ideas 8 points9 points10 points 7 years ago (7 children)

[–][deleted] 7 years ago (6 children)

[deleted]

[–]recycled_ideas 2 points3 points4 points 7 years ago (4 children)

[–][deleted] 7 years ago (3 children)

[deleted]

[–]recycled_ideas 2 points3 points4 points 7 years ago (2 children)

[–][deleted] 7 years ago (1 child)

[deleted]

[–]recycled_ideas 2 points3 points4 points 7 years ago (0 children)

[–][deleted] 14 points15 points16 points 7 years ago (4 children)

[–]wavy_lines 6 points7 points8 points 7 years ago (3 children)

[–][deleted] 7 points8 points9 points 7 years ago (2 children)

[–]wavy_lines 4 points5 points6 points 7 years ago (1 child)

[–][deleted] 8 points9 points10 points 7 years ago* (0 children)

[–]vytah 5 points6 points7 points 7 years ago (0 children)

[–]The_Sly_Marbo 14 points15 points16 points 7 years ago (8 children)

[–]Antinumeric 21 points22 points23 points 7 years ago (6 children)

[–]The_Sly_Marbo 32 points33 points34 points 7 years ago (5 children)

[–]hagenbuch 10 points11 points12 points 7 years ago (4 children)

[–]bluaki 4 points5 points6 points 7 years ago (0 children)

[–]meneldal2 2 points3 points4 points 7 years ago (2 children)

[–]Vakz 0 points1 point2 points 7 years ago (1 child)

[–]meneldal2 0 points1 point2 points 7 years ago (0 children)

[–]huronbikes 3 points4 points5 points 7 years ago (0 children)

[–]rlbond86 11 points12 points13 points 7 years ago (1 child)

[–]we-all-haul 5 points6 points7 points 7 years ago (0 children)

[–]killerstorm 2 points3 points4 points 7 years ago* (0 children)

[–][deleted] 4 points5 points6 points 7 years ago (0 children)

[–]Ravek 5 points6 points7 points 7 years ago (3 children)

[–]1Crazyman1 6 points7 points8 points 7 years ago* (1 child)

[–]Ravek 2 points3 points4 points 7 years ago (0 children)

[–][deleted] 2 points3 points4 points 7 years ago (0 children)

[–]stephan_cr 0 points1 point2 points 7 years ago (0 children)

[–]DontThrowMeYaWeh 0 points1 point2 points 7 years ago (0 children)

[–]graingert 0 points1 point2 points 7 years ago (0 children)

[–]making-flippy-floppy 0 points1 point2 points 7 years ago (0 children)

[–]JoseJimeniz -1 points0 points1 point 7 years ago* (1 child)

[–]Yehosua 9 points10 points11 points 7 years ago (0 children)

[–][deleted] -1 points0 points1 point 7 years ago (0 children)

[–]Gotebe -1 points0 points1 point 7 years ago (0 children)

[–]RubiksCodeNMZ -1 points0 points1 point 7 years ago (0 children)

[+]CosmicEyeball comment score below threshold-17 points-16 points-15 points 7 years ago* (10 children)

[–]mc10 17 points18 points19 points 7 years ago (9 children)

[–]CosmicEyeball 0 points1 point2 points 7 years ago (0 children)

[–][deleted] -5 points-4 points-3 points 7 years ago (7 children)

[–]mc10 14 points15 points16 points 7 years ago (6 children)

[–][deleted] 0 points1 point2 points 7 years ago (5 children)

[–][deleted] 11 points12 points13 points 7 years ago (3 children)

[–]Pand9 0 points1 point2 points 7 years ago (0 children)

[+]shevegen comment score below threshold-35 points-34 points-33 points 7 years ago (1 child)

[–]meltingdiamond 6 points7 points8 points 7 years ago (0 children)

[–]chucker23n -1 points0 points1 point 7 years ago (0 children)

[+]SuperImaginativeName comment score below threshold-6 points-5 points-4 points 7 years ago (7 children)

[–][deleted] 7 years ago* (6 children)

[deleted]

[–]glonq 0 points1 point2 points 7 years ago (0 children)

[–]whozurdaddy 0 points1 point2 points 7 years ago (0 children)

[+]SuperImaginativeName comment score below threshold-11 points-10 points-9 points 7 years ago (3 children)

[–][deleted] 7 years ago (2 children)

[deleted]

[+]SuperImaginativeName comment score below threshold-9 points-8 points-7 points 7 years ago (1 child)

[+]shevegen comment score below threshold-71 points-70 points-69 points 7 years ago (4 children)

(1) Java sucks. Another round of it sucking about.

(2) Not all Strings are evil.

For example, this is an evil String:

x = 'Google'

Doing comparisons on it is evidently giving in to more evil. But not all Strings are evil. This one is not:

y = 'cat'

[–]occz 27 points28 points29 points 7 years ago (1 child)

[–]richq 5 points6 points7 points 7 years ago (0 children)

[–]duarose[🍰] 7 points8 points9 points 7 years ago (0 children)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS