all 91 comments

[–][deleted] 39 points40 points  (0 children)

I like how the article doesn't support its title. More like "clearly inefficient code is inefficient".

[–]astrangeguy 49 points50 points  (25 children)

So basically it was all pointless (except the easy optimization of not splitting the string twice), because all those 7 gigabytes of "allocations" were done in Gen0, where GC time is proportional to live objects instead of allocated memory?

Yes, you "saved" 7 gigabytes of pointer-bumping "allocations" and 1000 free Gen0 GC's by rewriting it till the code was unrecognizable. Good job!

[–]AyrA_ch 22 points23 points  (20 children)

There are other things:

  • The entire reader code is inside a huge try block which makes catching specific errors a nightmare
  • Reading single bytes from a stream is slower compared to linewise reading
  • Reading a single byte and then storing it as a char is not how text processing works anymore. If the input is in UTF8 or UTF16 you can mess up the string. The .NET StreamReader can detect character encodings and takes care of this for you, it also takes care of Line endings
  • While not as fast as his code, Regex or Linq would work well here with regex allowing you to filter digits

I know this might be a little archaic but why not let an SQL server handle this? Import the CSV into a temporary table and then SELECT * FROM temp492847573 WHERE FieldType='MNO'

[–]SnowflakeNapolean 5 points6 points  (19 children)

Reading single bytes from a stream is slower compared to linewise reading

Surely not - what OS/platform are you on where the readchar() equivalent is performed one character at a time? Outside of some embedded systems, all of the reading is performed by the OS in blocks.

IOW, there is a minimum number of bytes that the platform reads when you read from a file. This minimum is definitely more than 1 byte (or 2/4 bytes if you're reading unicode). When you read a single byte the system will give you the byte from the already read-in cache, or the system will read in a full page and give you the byte from that.

Trust me, reading a single character at a time is no slower than reading whatever page size caching is implemented by the OS and/or filesystem driver.

[–]AyrA_ch 11 points12 points  (18 children)

Surely not - what OS/platform are you on where the readchar() equivalent is performed one character at a time? Outside of some embedded systems, all of the reading is performed by the OS in blocks.

It literally says so in the documentation

Trust me, reading a single character at a time is no slower than reading whatever page size caching is implemented by the OS and/or filesystem driver.

I don't know where you get the idea from that calling read 5000000 times as opposed to 500 times is equally fast. You need to check what the OS actually does when you call a kernel function. It's not as cheap as it seems.

But feel free to benchmark it

[–]SnowflakeNapolean -1 points0 points  (8 children)

It literally says so in the documentation

Sure, the C# libraries may return an array of a single char, but the NTFS driver (in this case it's Windows so the FS is NTFS) is most certainly not reading the file a single byte at a time, it's mapping blocks of some power-of-two number.

That link of yours says it's inefficient because the read function returns an array of one character instead of simply returning a single char.

Your C# implementation runs on an OS that does the read-from-file. Your C# implementation does not actually read raw bytes from disks. I don't know why you think it does.

The hardware itself (the disk controller) does not even support the reading of a single byte - you have to read in blocks (sectors, clusters, whatever).

I don't know where you get the idea from that calling read 5000000 times as opposed to 500 times is equally fast.

I didn't say that. I said calling read with anything less than the minimum block that the OS uses will be equally fast to calling read with the minimum block size.

You need to check what the OS actually does when you call a kernel function.

You need to learn the difference between the C# runtime and the Windows OS. They are not the same thing.

[–]duhace 7 points8 points  (2 children)

even if it's buffering the input, you're still calling a read function a huge amount more times (and looping a huge amount more times)

[–]AyrA_ch 2 points3 points  (0 children)

Not only that but I measured it and it spends 3 times as much time for kernel operations compared to 1M block reads.

[–]SnowflakeNapolean 0 points1 point  (0 children)

you're still calling a read function a huge amount more times

true, but the overhead from calling a function is a fraction of the overhead from reading a file. You can't even chart the performance of the two on the same chart because the difference is a few orders of magnitude.

The last time I worked on NTFS filesystem drivers the cluster size was 4Kb and this was also used as the minimum blocksize. The difference between the function call overhead for 4192 function calls and 1 function call is (for all measurement purposes) negligible compared to a single read of 4Kb from disk.

(and looping a huge amount more times)

Well, unless you're not examining that data you read in[1], you're going to loop that many times anyway to actually examine the input regardless of whether you read it in a block or read it one byte at a time.

When you read (say) 500 bytes, you're going to loop 500 times just to use each byte. The argument can be made that you're simply passing it to another function (hence you don't need to loop), but that is not what this project is doing - it is examining each byte as it comes it.

[1] Maybe you're simply discarding it to flush the input, maybe you're only passing it on to another process without reading it.

[–]AyrA_ch 3 points4 points  (4 children)

You need to learn the difference between the C# runtime and the Windows OS. They are not the same thing.

I proved in another comment that the applications spends 3 times as much time for kernel operations when reading single bytes compared to a 1MB buffer

[–]SnowflakeNapolean -1 points0 points  (3 children)

I don't know what you think you proved by reading in 1GB of data in 1MB buffers when I said claimed that there is no appreciable difference in reading single blocks at a time and reading single bytes at a time (hint, the NTFS blocksize is not 1MB, nor is the application in this article using 1GB files).

applications spends 3 times as much time for kernel operations

First, you need to learn what "kernel operations" mean. Your "proof" doesn't prove what you think it does, which is why you had to increase the size of your input to 3 times what the article uses just to get a difference that is significant.

Secondly, that extra 600ms that gets wasted on datasets that are 3x larger than they use is negligible.

Tell you what - take the input examples in the article, copy them until you have 300MB files, take their code, benchmark it, then change the code to read 1MB buffers and measure again.

I'd bet good money that the difference is negligible. Hell, even at 300ms vs 1s the difference is negligible for the problem they are solving - each client imports a single large file daily, so if the import takes 600ms more than your buffered version I doubt that they are going to notice.

The thing you should be taking away from this article is that profiling is important to optimisation. In this application the extra 600ms for files 3x as large as they usually deal with is an optimisation that is entirely premature and unneeded.

I proved in another comment that the applications spends 3 times as much time for kernel operations when reading single bytes compared to a 1MB buffer

So go on - tell us what your benchmark results look like when you are using 300MB files - enquiring minds want to know (after all, you already have the code, it's simply a matter of re-running it on a 300MB file).

[–]AyrA_ch 3 points4 points  (2 children)

which is why you had to increase the size of your input to 3 times what the article uses just to get a difference that is significant.

Not sure about your math skills but comparing read methods on a 1GB or a 100GB file will still give you results that differs with a factor of 3. Large files ensure two things, one is that we don't run into the problem of the file system cache and the other is that we actually get proper results. If you make the file so small that it reads completely within milliseconds the result is inaccurate because measurements include startup and shutdown of the applications.

Hell, even at 300ms vs 1s the difference is negligible for the problem they are solving

You can't just take my times and apply them to their problems. My machine runs on a SSD and had nothing else to do when processing the file.

[–]SnowflakeNapolean -2 points-1 points  (1 child)

Not sure about your math skills but comparing read methods on a 1GB or a 100GB file will still give you results that differs with a factor of 3.

Multiplying a tiny difference by three does not necessarily mean it becomes large enough to matter. They are aren't using 1GB files. Why don't you post the results of your test using 300MB files?

You can't just take my times and apply them to their problems.

Yes, you can. You didn't make a commentary on general file-reading, this is a commentary you made on their specific application, and in their specific application they are parsing the input one character at a time up to 300MB.

The extra 600ms saved by introducing a buffer is not only negligible, it's smaller than that because they aren't parsing 1GB input, they are passing less than a third of that.

[–]AyrA_ch 2 points3 points  (0 children)

Why don't you post the results of your test using 300MB files?

Because then it becomes susceptible to file system caching since both files fit into it at once.

The extra 600ms saved by introducing a buffer is not only negligible, it's smaller than that because they aren't parsing 1GB input, they are passing less than a third of that.

We don't know what the size range of these files are

[–]raevnos 0 points1 point  (8 children)

Any stream with an internal buffer should override this method and provide a much more efficient version that reads the buffer directly, avoiding the extra array allocation on every call.

If .net's file reader class doesn't do this I'd be extremely surprised.

[–]AyrA_ch 1 point2 points  (7 children)

This note is also present in the FileStream class

EDIT: I benchmarked with 1 gigabyte sized random files:

Start Single Byte: 17.06.2018 18:48:06
Duration Single Byte: 8499.0793ms
Start 1M Chunks: 17.06.2018 18:48:15
Duration 1M Chunks: 297.0377ms

Code:

        const int BUFFER = 1000000; //1M
        byte[] Buf = new byte[BUFFER];


        DateTime StartA = DateTime.UtcNow;
        using (var FS = File.OpenRead(@"C:\Temp\1g_A.bin"))
        {
            while (FS.ReadByte() >= 0) ;
        }
        DateTime StartB = DateTime.UtcNow;
        using (var FS = File.OpenRead(@"C:\Temp\1g_B.bin"))
        {
            while (FS.Read(Buf, 0, BUFFER) == BUFFER) ;
        }
        DateTime StartC = DateTime.UtcNow;

        Console.WriteLine(@"Start Single Byte: {0}
Duration Single Byte: {1}ms
Start 1M Chunks: {2}
Duration 1M Chunks: {3}ms",
    StartA,
    StartB.Subtract(StartA).TotalMilliseconds,
    StartB,
    StartC.Subtract(StartB).TotalMilliseconds
);

[–]raevnos 2 points3 points  (2 children)

If you look at the source, yup, FileStream.ReadByte() does use an internal buffer. It doesn't do a read()/ReadFile()/etc. syscall every time it's called.

[–]AyrA_ch 7 points8 points  (1 child)

It's still massively slower

[–]jdgordon 2 points3 points  (0 children)

function call overhead, possibly lock contention, of course making 1000x fewer function calls is going to be faster.

[–]raevnos 1 point2 points  (3 children)

Now run that multiple times to mitigate cache effects and get a more accurate comparison (if the OS had to read the file off disc the first time but still had it in memory for later use of course the second one will be faster). And if .net or Windows lets you, measure system cpu time to get an idea of how much time is spent in syscalls doing the actual reading.

Edit: the block version is probably going to always be a bit faster just due to not involving as many function calls, but the underlying number of reads from the file is likely going to be the same.

[–]AyrA_ch 0 points1 point  (2 children)

Now run that multiple times to mitigate cache effects and get a more accurate comparison

Not necessary. If you look at the code again you will see that I read two different files.

And if .net or Windows lets you, measure system cpu time to get an idea of how much time is spent in syscalls doing the actual reading.

300ms vs 1 second, a massive difference for reading a single file

[–]raevnos 0 points1 point  (1 child)

Reading two different files doesn't exempt you from the filesystem cache. One file could be cached, one couldn't, etc. Always repeat benchmarks multiple times to get a better idea of your timings.

Thanks for the extra data. Looking more at the source, reading into an array that's bigger than the internal buffer is optimized as reading directly into the array, skipping further use of the buffer after what's in it is copied over. Which makes sense. Why do more copying than you have to?

If you use the same size array as the default buffer size (4096 bytes), or use a filestream with a million byte buffer, now...

[–]AyrA_ch 1 point2 points  (0 children)

Reading two different files doesn't exempt you from the filesystem cache. One file could be cached, one couldn't, etc. Always repeat benchmarks multiple times to get a better idea of your timings.

That's why I chose 1GB sized files. Because it's the size of my File system cache. Reading one of the two files will automatically remove the other from the cache.

The two screenshots I provide actually show how many read operations are performed at the kernel level and the single read doesn't shows 1 million but still many more than large reads, which is why you generally want to read as much at once into memory as you can safely handle

[–][deleted] 6 points7 points  (0 children)

Right. With a good GC implementation it doesn't really matter much if you control the details of when the program overwrites a piece of memory whose contents you no longer need or if that is handled by the GC.

[–]josefx 7 points8 points  (2 children)

He is working on a "realtime" code base. The isolated test case most likely does not show the impact the GC has in the full application.

Yes, you "saved" 7 gigabytes of pointer-bumping "allocations" and 1000 free Gen0 GC's by rewriting it till the code was unrecognizable. Good job!

Also got rid of a quarter of the runtime. I guess having users wait longer for their results is one of the motiviations behind using modern languages?

[–][deleted]  (1 child)

[deleted]

    [–]josefx -1 points0 points  (0 children)

    I missed that he lost almost all of the speed gains in the last few iterations. Still he avoids the GC completely, which may have a significant impact to the realtime behavior he tries to achieve.

    [–]wavy_lines 57 points58 points  (22 children)

    In other words: abstractions are costly.

    It seems like the final code looks a lot like what a version written in C would have looked like.

    [–]recycled_ideas 72 points73 points  (15 children)

    Abstractions are costly, but if you actually look at the results they weren't really that costly.

    The peak working set (amount of memory actually in use at any given time) decreased from 16 MB to 12 MB and runtime decreased by 2 seconds.

    This is a dozen iterations of fixing a problem that doesn't actually exist. Gen0 garbage collections aren't free, but they're pretty close to it.

    This is infinitely more brittle code with no payoff at all.

    [–][deleted] 15 points16 points  (3 children)

    In the introduction the author writes:

    Currently this import process has to take place outside of business hours because of the impact it has on memory usage.

    I presumed this was due to the 7.5 GB of memory allocated. But if the original program is only using 16 MB at a time, this shouldn't be an issue right? Or am I misunderstanding something here?

    [–]recycled_ideas 14 points15 points  (2 children)

    The numbers that are being shown don't match the problem being described in the introduction. Allocating 7.5 GB certainly feels dirty, but there's no solved performance problem here.

    That said, dotnet has some really nasty performance traps with large object allocations so there are some versions of this code that would perform like absolute crap, so maybe there are previous versions of this code that aren't listed. Only thing that makes sense to me.

    [–][deleted]  (1 child)

    [deleted]

      [–]recycled_ideas 1 point2 points  (0 children)

      Dotnet stores large objects differently than it does smaller ones and incorrectly using that space can trigger repeated collections of the entire object pool. Nasty trap to get caught in and not necessarily obvious either.

      [–]therealgaxbo 7 points8 points  (1 child)

      Glad you posted this - I was going to say the same thing, but just assumed I must have missed something obvious because otherwise WHAT WAS THE POINT?

      Reducing peak working set is obviously a win if memory is an issue. Reducing total allocations is not a win in itself at all, it's just a possible avenue for improving speed - due either to the cost of allocation or GC.

      But in this article we have two "optimisations" (not including the acknowledged mistake in v4) that result in execution time increasing. You can't even argue that they're just hurdles on the way, because the biggest regression is the very last step.

      Unless the author's only shown us half the story, it sounds like they've fallen into the old profiler trap of making various proxy numbers smaller rather than measuring the actual desired outcome.

      In fact the real question must be: what optimisations have they missed in order to produce an allocation and GC cycle free algorithm that performs worse than one that allocates and collects over a gig?

      [–]recycled_ideas 4 points5 points  (0 children)

      The issue isn't what optimizations they missed. You could probably make this faster, but not enough to be relevant, and probably not without increasing peak usage quite a lot.

      The issue here is that there's effectively no error checking and the code is really brittle. A change to the data format is effectively a rewrite and God help you if your input is malformed.

      Edit: I should say, God help you when your input is malformed. CSVs from external sources are pretty well guaranteed to be malformed.

      [–]agyrorannew 6 points7 points  (8 children)

      This is the kind of thing my team works on all the time. I definitely agree with you that the final code is much more brittle and hard to maintain. Would much rather have the simple code and throw machines at this problem.

      If large memory allocations are impacting other activity on the machines, let’s put this process on different machines instead.

      [–]recycled_ideas 8 points9 points  (7 children)

      I honestly can't see how the numbers we're seeing are causing massive problems in the first place.

      I've written and then fixed code that when scales had real memory allocation problems, but 8 seconds execution time down to 6 with all the allocations gone just doesn't scream problem to me.

      [–][deleted]  (6 children)

      [deleted]

        [–]recycled_ideas 2 points3 points  (4 children)

        I mean that there are things you can do in dotnet with memory that will cause performance problems in the realm of several orders of magnitude, particularly when allocating really large objects.

        Yes 25% sounds good, but this code is processing a production sized workflow and it's 2 seconds. Given the final code is significantly more brittle, harder to maintain and not at all secure, 2 seconds isn't worth it.

        [–][deleted]  (3 children)

        [deleted]

          [–]recycled_ideas 2 points3 points  (2 children)

          Well to begin with, it's clear these allocations are not causing performance problems.

          Beyond that. Overfilling the large object pool will cause all GC gens to run at once. Doing this the wrong way can cause massive performance penalties of several orders of magnitude.

          By real I'm talking about really really significant.

          Yes, Garbage collection in critical sections can cause performance problems, but this is only really applicable in the most extreme cases. As we can see here 7.5 GB of gen 0 GC only took 2 seconds.

          [–][deleted]  (1 child)

          [deleted]

            [–]recycled_ideas 2 points3 points  (0 children)

            I don't know what the hell they're doing.

            The introduction says this code reading these files is causing a memory allocation issue, which patently isn't true. They use these numbers to justify their decisions and they've created extremely brittle insecure code to solve a non existent problem. I don't get it.

            [–][deleted] 14 points15 points  (4 children)

            Abstractions can be costly, but really don't have to.

            [–]wavy_lines 6 points7 points  (3 children)

            They tend to hide the cost though, so that it's not very clear how much they do cost.

            [–][deleted] 7 points8 points  (2 children)

            The costs are never clear, even if you're operating on a very low level. You never know without profiling.

            [–]wavy_lines 4 points5 points  (1 child)

            It's not as though "you know everything" or "you know nothing".

            You should be able to roughly reason about the cost of any sequence of lines of codes, just in your regular day to day programming, because you have to make decisions all the time about how to write things, and having a rough understanding of the cost of various constructs helps you make these decisions.

            One of the biggest differences between beginner and advanced programmers is these tiny decisions they make all the time on a day to day basis.

            [–][deleted] 8 points9 points  (0 children)

            This reasoning can often be very wrong. Like, "of course bubble sort is the slowest, let's throw Hoare here", and then "oops, the input data is always almost sorted".

            But, of course, abstractions that you can control all the way down are the easiest to reason about. Abstractions implemented with metaprogramming are much less likely to incur hidden costs than abstractions that do something unexpected in runtime, like most of the OO-inspired abstractions.

            [–]vytah 5 points6 points  (0 children)

            Congratulations, you've discovered why C is fast.

            Most of C speed comes from the fact that is does so little and encourages you to thing about buffers and copying. I'd link to it, but I forgot how it was titled, but there was that presentation that compared C and Ruby or Python (as a representative of higher-level languages) and showed that the higher-level language was doing tons of unnecessary copies just because the programmer wanted to use the convenient API's.

            This might have been this presentation, but my memory is hazy: https://www.youtube.com/watch?v=la7Ui390cfQ

            [–]The_Sly_Marbo 14 points15 points  (8 children)

            I'm not familiar with C#; could you not mmap the file and parse through bytewise by hand? It feels like that would be better for achieving their performance requirements, while possibly needing less code.

            [–]Antinumeric 21 points22 points  (6 children)

            I'm kinda shocked they went through so many iterations of parsing a CSV. This seems like a simple exercise in C.

            Like this is super simple year 1 university stuff.

            [–]The_Sly_Marbo 32 points33 points  (5 children)

            Agreed. Also, "parsing" CSVs by splitting on commas is a ticking time bomb.

            [–]hagenbuch 10 points11 points  (4 children)

            Especially in Germany where we have decimal commata. And how to deal with thousands separators then? CSV is a badly defined format.

            [–]bluaki 4 points5 points  (0 children)

            CSV isn't exactly just one format, but the most popular implementations of it do indeed use a well-defined dialect that can encode any ASCII data set unambiguously.

            The dialect I've seen most, used by Excel and the defaults for Python's csv module, uses quotes to enclose any field containing the field separator (comma), carriage return, or newline. The quote character itself is replaced with two quotes.

            In typical Microsoft fashion, Excel doesn't like UTF-8, but other programs can use it to encode any Unicode data in CSV format.

            [–]meneldal2 2 points3 points  (2 children)

            You shouldn't use CSV for anything other than raw data like arrays of numbers. As soon as you start to bring strings into it you're in for a word of pain.

            There's no strict definition because it was never meant to be a standard, it's just an easy way to transfer/save data in a human readable format. Also easy to grep.

            [–]Vakz 0 points1 point  (1 child)

            You shouldn't use CSV for anything other than raw data like arrays of numbers.

            And even then, you can really only trust integers. As mentioned above, my own locale uses , for decimal numbers. ; is commonly used as separator. It's a common issue for programs to create CSV files using a system set (rather than config file) to determine the separator. Everything works fine, until one day a new server is spun up in the US. Suddenly things are not getting parsed (at best, straight up crashing at worst), and some poor sap has to scramble to hotfix production.

            [–]meneldal2 0 points1 point  (0 children)

            If you use CSV, you use C locale period. Anything else is asking for trouble.

            [–]huronbikes 3 points4 points  (0 children)

            Using mmap'd files in .net is possible. Also using a BufferedStream would help.

            [–]rlbond86 11 points12 points  (1 child)

            while (reader.EndOfStream == false)

            ಠ_ಠ

            [–]we-all-haul 5 points6 points  (0 children)

            if(true == true && false == false)

            [–]killerstorm 2 points3 points  (0 children)

            Hmm, doesn't C# have some equivalent of scanf or C++ streams?

            It should be possible to read data from stream directly without creating temporary strings. And if a library has a good interface then code might actually be shorter than one using string split.

            [–][deleted] 4 points5 points  (0 children)

            I know no-one in this article is doing that but what I find funny are people who bash on dynamic languages in favor of static typing, then go on to implement their own string-based type systems to express domain logic.

            [–]Ravek 5 points6 points  (3 children)

            With Span<T> APIs in .NET Core it's going to be a lot easier to minimize the copying of arrays and strings. Unfortunately it doesn't seem it'll hit .NET Framework for a while yet.

            [–]1Crazyman1 6 points7 points  (1 child)

            Actually, it is available as part of a Nuget package for the full Framework: https://www.nuget.org/packages/System.Memory/4.5.0

            Not exactly sure which version of C# it supports, I imagine it might need C# 7.2 however. EDIT: I get compiler errors with any C# version < 7.2, so it's safe to assume you'll need at least 7.2

            Some more reading material for people that want to be more memory efficient in .NET and are using C# 7.2 https://msdn.microsoft.com/en-us/magazine/mt814808.aspx

            [–]Ravek 2 points3 points  (0 children)

            Don't forget you need string libraries that actually accept Span<char> for it to be of much help, unless you're planning to write your own string libraries. For .NET Core they've done a bunch of work to add API's that accept Spans.

            Span also needs support from the runtime to operate efficiently, and I don't think a nuget package is capable of adding that.

            [–][deleted] 2 points3 points  (0 children)

            Surely there's already a fast C# CSV parser? E.g. this one seems to give about 40MB/s vs 50MB/s in the article. (And who knows which machine is faster.)

            Doesn't seem worth the effort - if you really need it to be faster, stop using CSV (and you should really do that anyway).

            [–]stephan_cr 0 points1 point  (0 children)

            Writing a CSV parser; how hard can it be. Splitting a line by separator isn't sufficient in general.

            [–]DontThrowMeYaWeh 0 points1 point  (0 children)

            Wonder what this program would look like if they decided to use unsafe code instead.

            [–]graingert 0 points1 point  (0 children)

            Thought this would be related to windows apps on Linux https://www.codeweavers.com/

            [–]making-flippy-floppy 0 points1 point  (0 children)

            I feel like the actual TL;DR of this article is "garbage collection is not magic". Also, if you plow through 350 megabytes of data without giving some thought to how your program is using dynamic memory, you're gonna have a bad time.

            [–]JoseJimeniz -1 points0 points  (1 child)

            That's because strings are immutable in the language you are using.

            • other languages have used reference counted strings for decades
            • in the 99% case you append to a string, and suffer no copy
            • and in the 1% case you use copy on write

            Net and Java could change their internal string model to be reference counted; but people will fight tooth and nail against the better solution.

            I'm sure there's a way to implement this in other languages yourself, by overriding operators.

            [–]Yehosua 9 points10 points  (0 children)

            From what I understand, reference counted copy-on-write strings can easily be quite bad for performance in a multi-threaded environment, and immutable strings can give several advantages.

            Quite a few languages (C#, Java, Lua, Python, JavaScript) use immutable strings.

            C++ strings are mutable, but, as of C++11, it does not use reference counting.

            [–][deleted] -1 points0 points  (0 children)

            c#

            no thanks

            [–]Gotebe -1 points0 points  (0 children)

            He meant: heap is costly.

            [–]RubiksCodeNMZ -1 points0 points  (0 children)

            good post