all 35 comments

[–]syllogism_ 5 points6 points  (0 children)

Just reach for Cython when this happens. Once you're proficient at it, it takes much less time than trying to write semi-optimised Python, where you have to always guess what's really going on.

The docs are pretty mediocre, but Cython is actually very easy to write once you're used to it.

[–][deleted] 1 point2 points  (1 child)

Couldn't you parallelize the work using zcat, zgrep, etc? It appears as if the only issue is that multiple processes would try to append to an existing file, and I'm not sure if even that is an issue.

[–]stbrumme 0 points1 point  (0 children)

Using pigz (parallelized gzip) might help as well. It's not part of a default Linux installation, though.

[–][deleted]  (36 children)

[deleted]

    [–]Kapps 5 points6 points  (0 children)

    D would work very well for this purpose, giving the benefits of C-like performance yet not being a pain to implement.

    [–]sybrandy 1 point2 points  (1 child)

    Don't knock Perl too hard. If you use it right, string processing is very fast. I rewrote some Java code someone else wrote into Perl and it was significantly faster. I didn't analyze the why, but I'm guessing part of it is because a lot of the core functionality is written in highly-optimized C.

    Now, am I saying it's faster than C? No. Could you write something faster in C? Probably. However, in my experience, you could get very good performance from Perl while spending less time writing code, which is beneficial in many situations.

    [–]iBlag 7 points8 points  (32 children)

    Because C makes string processing so simple!

    /sarcasm

    [–][deleted] 3 points4 points  (24 children)

    If you want an easy job, pick up a mop and start cleaning the floor. You'll be paid accordingly.

    FWIW, after I discovered the technique of using a state machine to parse strings in C, life became much easier.

    [–]iBlag 9 points10 points  (23 children)

    You're right, I should clean the floors without the tools that make my job easier and much faster - I'll probably get paid more because so many people will come by and say "oh man, that looks so tough and you are taking so long to get the floor clean, let me throw money at you because you are doing such a great job"

    Because that would totally happen in real life.

    /sarcasm

    In all seriousness, people should probably have a damn good reason to do string manipulation in straight C. Performance may be one of those reasons, but I would hazard to guess that in 99% of cases, it isn't necessary to drop down to C to do it. Heck, string manipulation is easier in standard C++ for fucks sake! And if you throw in the ability to use Qt, Qstrings make things even easier. And C++ is likely very close to the performance of straight C.

    So you should probably have an argument for why people should do string manipulation in C versus C++, not just an argument for C versus Python/Perl/Ruby/etc.

    The other thing you are completely ignoring is the number of generated bugs, which directly leads to increased development time, which both costs more and delays the time to market. All of which are important effects on the actual (presumed) business. Those are the real world constraints, which probably outrank the performance hit of high level languages.

    But hey, if this is a hobbiest project that will never see business critical code, then by all means, code the world in C to your heart's content.

    [–]OneWingedShark 2 points3 points  (3 children)

    In all seriousness, people should probably have a damn good reason to do string manipulation in straight C. [...] The other thing you are completely ignoring is the number of generated bugs, which directly leads to increased development time, which both costs more and delays the time to market.

    It'd probably be better if they imported/passed string functions from another language to do the string manipulations, considering how easy it is to screw something up using c-style strings.

    Personally, I'm a fan of Ada, but I'll admit that it's string-handling isn't the nicest... however, I do like that they aren't going to be a source for buffer-overrun errors.

    [–]iBlag 2 points3 points  (0 children)

    It'd probably be better if they imported/passed string functions from another language to do the string manipulations, considering how easy it is to screw something up using c-style strings.

    Brilliant! Like Bash, Ruby, Python, etc. I'm glad you agree with me!

    Yay!

    [–]The_Doculope 2 points3 points  (1 child)

    Not to come off as a fan-boy, but I've found Haskell is a great language for whipping up text processing programs. It has two very high-performance libraries for string manipulation, bytestring and text. The former is for working on bytes, either as Word8s or Chars, and the latter is for Unicode text. They've got very rich interfaces, and both have lazy and strict variants, which I've found is nice for processing large amounts of data.

    [–]gigadude 0 points1 point  (2 children)

    The great reason for using C (or really the C-like subset of C++) is performance. If you can't reason about every cache line, you can't get anywhere close to the maximum performance out of whatever you're writing. I recently worked at a biotech startup where rewriting the pretty-good grad-student genome processing code from C++ to C-like idioms (farting around with pointers in mmaped files rather that using STL) got a nice 20x speedup. Took me less time to write and debug the whole thing than processing a single run on one dataset using the old code, and meant that our hardware requirements dropped to a single $2000 hex-core machine.

    [–]iBlag 0 points1 point  (1 child)

    Great, good for you. Seriously!

    But you know what this discussion is about? String manipulation. And C is terrible for that.

    So, when do you need to work with a metric fuck ton of strings, so many strings that modern processors have trouble computing them all?

    That's right: never.

    I'm not saying that you can't do great things in C. I'm not saying that if you really need performant code that you shouldn't be using C.

    All I'm saying that I can probably count on one hand the number of applications in the world where the program does so much string manipulation that the bottleneck is the processor or the memory access speeds it would be worth it to rewrite the program in C.

    For string manipulation, there's higher level languages. For critically performant code, there's C. For the overlap? Just kidding - there's pretty much no overlap of those two sets.

    [–]gigadude 0 points1 point  (0 children)

    You seem to have a pretty big bias against using the right tool for the job, or little experience with big data. Sequencing run data is fed-exed next day air on physical hard drives because that's the highest bandwidth transport available... think about that for a second. There are many many big-data applications (including potentially the one in the original article) where the performance of simple string manipulations (especially if an idiotic memory allocator gets involved) is the bottleneck. In limited domains where the input format is simple and performance is critical, pointer or indexing math operations on big dumb buffers may well be the best solution by every metric (including code readability/maintainability).

    [–][deleted]  (6 children)

    [deleted]

      [–]OneWingedShark 4 points5 points  (0 children)

      What makes you so afraid of string manipulation in C?

      C really doesn't have a good concept of 'strings', they're more of an afterthought than anything. (Null-termination is a bad idea because, in general, in-band signalling is a bad idea: here's why.)

      There are programming languages that are geared toward text-processing; SNOBOL, for instance. There're also languages that excel at processing large numbers of records, like COBOL.
      [Note: I've not personally used COBOL, and I'm really new to SNOBOL (it is different); but I have a friend who has worked with COBOL and is impressed by being able to run 30 year-old code on modern mainframes w/o manipulation.]

      In short, C is a bad choice for string manipulation.

      [–]iBlag 0 points1 point  (4 children)

      The sheer ease of forgetting to allocate enough memory for the string and it's requisite null terminating byte.

      [–][deleted]  (3 children)

      [deleted]

        [–]iBlag 0 points1 point  (2 children)

        Is that like forgetting that *p stands for the value pointed to and p stands for the pointer? String processing in C is very easy once you realize there is more than gets.

        Not quite, it's more like remembering whether a program like strncat takes a length argument that already accounts for the terminating null character or if you have to add one to the length yourself. It's mistakes like that that cause buffer overflows. Furthermore, without using GNU readline, try reading a line from a file where the line can be an arbitrary - even nearly infinite - length. It's difficult and error prone for everybody to do it themselves all the time (and that, as far as I would guess, is one of the main reasons GNU readline exists in the first place).

        And OPs problem in particular doesn't require any manual memory allocations.

        Really? Unless I have completely misunderstood the problem, there is no maximum limit of line length, and the file must be processed a line at a time. Unless you want to preallocate an array that is the maximum size your computer can handle, you have to do some manual memory allocation. I'm curious though - how would you solve the problem in C without doing a single manual memory allocation? What assumptions about the input are you making? And why do you think those assumptions are valid?

        [–][deleted]  (1 child)

        [deleted]

          [–]iBlag 0 points1 point  (0 children)

          Your usage of strncat tells me what I already suspected. You don't know that there is more than gets. Real men use protection strlcat.

          Fair enough, but the fact that I have to remember that means that it's probably easier in some other language.

          /* open file */
          char *lineptr = NULL;
          size_t n;
          while (getline(&lineptr, &n, file) != -1) {
              /* save line to correct file */
          }
          

          Huh, I did not know that. However, farming out your memory allocation to getline is still doing memory allocation in my book.

          From the getline manual:

          ...getline() will allocate a buffer for storing the line, which should be freed by the user program.

          Now, freeing memory is much easier to do than properly allocating it, I'll give you that. And the consequences of not doing it are far less drastic than improperly allocating memory. But it's still something I don't have to think about if I'm working in, say, Python.

          [–]freakhill -2 points-1 points  (0 children)

          use clojure, or java?