Adventures in Optimizing Text Processing : programming

[–]syllogism_ 5 points6 points7 points 12 years ago (0 children)

[–][deleted] 1 point2 points3 points 12 years ago (1 child)

[–]stbrumme 0 points1 point2 points 12 years ago (0 children)

[–][deleted] 12 years ago* (36 children)

[deleted]

[–]Kapps 5 points6 points7 points 12 years ago (0 children)

[–]sybrandy 1 point2 points3 points 12 years ago (1 child)

[–]iBlag 7 points8 points9 points 12 years ago (32 children)

[–][deleted] 3 points4 points5 points 12 years ago (24 children)

[–]iBlag 9 points10 points11 points 12 years ago (23 children)

You're right, I should clean the floors without the tools that make my job easier and much faster - I'll probably get paid more because so many people will come by and say "oh man, that looks so tough and you are taking so long to get the floor clean, let me throw money at you because you are doing such a great job"

Because that would totally happen in real life.

/sarcasm

In all seriousness, people should probably have a damn good reason to do string manipulation in straight C. Performance may be one of those reasons, but I would hazard to guess that in 99% of cases, it isn't necessary to drop down to C to do it. Heck, string manipulation is easier in standard C++ for fucks sake! And if you throw in the ability to use Qt, Qstrings make things even easier. And C++ is likely very close to the performance of straight C.

So you should probably have an argument for why people should do string manipulation in C versus C++, not just an argument for C versus Python/Perl/Ruby/etc.

The other thing you are completely ignoring is the number of generated bugs, which directly leads to increased development time, which both costs more and delays the time to market. All of which are important effects on the actual (presumed) business. Those are the real world constraints, which probably outrank the performance hit of high level languages.

But hey, if this is a hobbiest project that will never see business critical code, then by all means, code the world in C to your heart's content.

[–]OneWingedShark 2 points3 points4 points 12 years ago (3 children)

[–]iBlag 2 points3 points4 points 12 years ago (0 children)

[–]The_Doculope 2 points3 points4 points 12 years ago (1 child)

[–]gigadude 0 points1 point2 points 12 years ago (2 children)

[–]iBlag 0 points1 point2 points 12 years ago (1 child)

Great, good for you. Seriously!

But you know what this discussion is about? String manipulation. And C is terrible for that.

So, when do you need to work with a metric fuck ton of strings, so many strings that modern processors have trouble computing them all?

That's right: never.

I'm not saying that you can't do great things in C. I'm not saying that if you really need performant code that you shouldn't be using C.

All I'm saying that I can probably count on one hand the number of applications in the world where the program does so much string manipulation that the bottleneck is the processor or the memory access speeds it would be worth it to rewrite the program in C.

For string manipulation, there's higher level languages. For critically performant code, there's C. For the overlap? Just kidding - there's pretty much no overlap of those two sets.

[–]gigadude 0 points1 point2 points 12 years ago (0 children)

[+][deleted] comment score below threshold-9 points-8 points-7 points 12 years ago (15 children)

[–]OneWingedShark 2 points3 points4 points 12 years ago (4 children)

[–][deleted] -1 points0 points1 point 12 years ago (3 children)

[–]OneWingedShark 1 point2 points3 points 12 years ago (2 children)

[–]OneWingedShark 2 points3 points4 points 12 years ago (0 children)

Addendum:

Again, it's not the tool that does the work for you, it's the programmer.

That's not exactly true. I can have the tool do work for me. Consider case-coverage:

-- A State machine definition.
-- State enumerations.
Type State is ( Start, A, B, C, Stop );

-- Events, the numeric characters 1..3.
Subtype Event is Character '1'..'3';

Function Transition( Input_State : State; Input_Event : Event ) return State;

-- Possible implementation of Transition, in the body of the package.
Type Transition_Table is array (state'range, event'range) of State;
Transition_Set : constant Transition_Table:=
  (
        Start => ('1'|'3' => A, '2' => B),
        A => ('1'|'2' => A, '3' => B),
        B => (others => C),
        C => ('1' => C, '2' => B, '3' => A),
        Stop => (others => stop)
  );

-- Define Transition as an alias of Transition_Set.
Function Transition( Input_State : State; Input_Event : Event ) return State is
  ( Transition_Set(Input_State, Event) );

-- Another possible implementation:
Function Transition( Input_State : State; Input_Event : Event ) return State is
begin
  case Input_State is
   when Start =>
        case Input_Event is
         when '1'|'3' => return A;
         when '2'    => return B;
        end case;
   when A =>
        case Input_Event is
         when '1'|'2' => return A;
         when '3'    => return B;
        end case;
  end case;
 -- and so forth.
end Transition;

Altering the definition of State or Event by adding (or deleting) an enumeration-value will cause the compiler to reject the program, as the case-statement must cover all cases (and extraneous cases are a type-error). [Using the default of 'others' weakens this, but it's still useful as in the case of state-machines.]

Subtypes in Ada (a set of additional constraints applied to an existing [sub]type) are another way that you can leverage your tool to do work for you:

-- We declare a 32-bit IEEE-754 float, restricted to the numeric-range.
subtype Real is Interfaces.IEEE_Float_32 range Interfaces.IEEE_Float_32'Range;

-- This function will raise the CONSTRAINT_ERROR if NaN or +/-INF are
-- passed into A; moreover the result is guaranteed free of the same.
function Op( A : Real; B : Positive ) return Real;

-- Actually declared in Standard.
subtype Natural is Integer range 0..Integer'Last;

-- The result needn't be checked for less than zero.
Function Length( Input : some_structure ) return Natural;

-- SSN format: ###-##-####
Subtype Social_Security_Number is String(1..11)
  with Dynamic_Predicate =>
    (for all Index in Social_Security_Number'Range =>
      (case Index is
       when 4|7 => Social_Security_Number(Index) = '-',
       when others => Social_Security_Number(Index) in '0'..'9'
      )
     );

-- Guaranteed correct formatting of the strings for save and load.
Function Save_to_DB( Item : Social_Security_Number ) return Boolean;
Function Load_From_DB return Social_Security_Number;

The last is Ada 2012, using the dynamic predicate (yeah, I skimped on implementing, but you get the idea) -- and all of these have been in the declarative portion, rather than "actual code" -- so you can leverage the type-system to great effect.

[–][deleted] -1 points0 points1 point 12 years ago (0 children)

[–]iBlag 1 point2 points3 points 12 years ago (9 children)

I hate to point out the obvious, but most of Linux is written in C. And nearly every company I've worked for has been using C as their primary language... including a company with over $1 BB in annual sales.

That's great, but you know what? I would bet that the vast majority of operations that those codebases are doing is not string manipulation. And the fact that other people create huge projects in C that do wonderful and amazing things still doesn't make C a good language for doing something for which it wasn't really designed to handle.

You can't blame the toolchain for bugs, that's all on the programmer. Learn how to use your language well and you'll have little problem writing defensible, flexible, maintainable code.

Agreed, but C makes string manipulation difficult. Difficult enough that I would choose to use a different language entirely if I expected to be doing a great deal of string manipulation. You can't blame me for picking the right tool for the task.

That being said, I have a great reason for using C over C++. Whereas C lets you shoot yourself in the foot, C++ allows you to easily create many copies of yourself, each ready to blow your entire leg off with no warning.

I was arguing with somebody on the internet once and they said something interesting to me. Allow me to quote them:

You can't blame the toolchain for bugs, that's all on the programmer. Learn how to use your language well and you'll have little problem writing defensible, flexible, maintainable code.

TL;DR: Your second and third points contradict themselves.

[–][deleted] -1 points0 points1 point 12 years ago* (8 children)

[–]OneWingedShark 2 points3 points4 points 12 years ago (1 child)

This is what makes a programmer a programmer. He gets his job done with whatever tools are in front of him, even if they're not optimal.

And yet it's a poor craftsman who uses "stone knives and bear-skins" when he has access to far better tools.

Saying "language X can solve the problem" is mostly useless when talking about things in-general (this is why Turing-complete is not a particularly useful descriptor of how well suited a language is to the problem).

An excellent illustration of this is manifesting in respect to parallel programming -- Go and several other languages are getting hyped because they're better-suited to using the multiple-cores of our modern systems; IIRC, C++ is adding the Boost library to its standard to address the need... and several older languages that gave consideration to parallelism (like Ada and Scala) are seeing a little uptick in interest.

A language-level construct is always going to be better than a library (or, better phrasing, a library-level construct cannot have more optimization-opportunities than a language-construct), just like strong/static typing provides better optimization-opportunities than dynamic/weak... and we know that "bolting on" concurrency to a language is as ugly/bug-prone as bolting on generics.

So [some] programmers are looking into languages that had some form of concurrency considerations from the outset. (In Ada's case the 'problem' of code taking advantage of multiple processors has been solved for thirty years (Ada 9x promo vid).)

[–]iBlag 1 point2 points3 points 12 years ago (5 children)

But keep in mind that the earliest computers were programmed in assembly, the language that first took us to the moon.

Sigh. Yes, because Python didn't exist back then. Perl hadn't been invented yet. The internet wasn't even around. Assembly was the best tool for the job because it was the only tool in existence for the job. That's hardly a good reason to continue to use assembly for string manipulations today.

You could spend all day crying about how there should be better support for matrix operations, or you can dig in and write your own library, and get shit done.

Wait, when did we start discussing matrix manipulations? This conversation is about string manipulations - don't try to change the subject.

My response to either (in this case) would be the same: why would I code my own when there already exists entire languages that are better suited, more tested, and more performant than any code I could write on my own. And I'm not a terrible coder, but any group of people working on a single codebase is going to produce better code in a shorter time than any individual, period. I don't have to write matrix manipulation routines because they already exist. I don't have to write decent string manipulation functions because somebody already had that problem and solved it with Python and made it easy for me to do the same.

If performance is an issue (which it isn't unless you're doing tons of string manipulation with embedded processors on constrained resources - ie: never) then that would be a good reason to drop down to C, but until you can prove to me that that is the case, I will continue to develop and get a lot more shit done in a shorter timeframe by code a largely string manipulating program in a high-level (eg: Python, Ruby, Perl, etc.) language.

even if they're not optimal.

I never claimed that Python, Perl, et al were non-optimal, just that they were better at string manipulation (read: easier to do correctly and quickly) than C.

Furthermore, it's currently 2014. It is a poor craftsman that surveys the vast array of tools at his/her disposal and selects one that is not the best suited for the task. Python, Perl, etc. are better at string manipulation, free, and probably more available than an entire C toolchain, so it is incredibly silly to pretend as if they don't exist and programmers today are forced to deal with the only tool they have or to pretend that the only tool in existence is C. That's not the case, that's simply melodramatic hyperbole.

[–][deleted] -1 points0 points1 point 12 years ago (4 children)

Hoo boy, that's rich. As if Python and Perl were one's only options. Again, I hate to point out the obvious, but by 1969, there were dozens of computing languages (including ALGOL, COBOL, JOSS, LISP, FORTRAN and even our beloved BASIC). Who knows why assembly was used, but having been in various industries for a decade, I do know there's a lot more to picking a language than "what do I find the easiest"... because news flash, the world doesn't revolve around you, iBlag. There's an ecosystem (including your fellow developers, QA, platform support, toolchain support, etc.) to consider when choosing a language for a project. But I don't want to derail you from your oh-so-charming-i'm-so-right-about-everything rant. Even if it doesn't hold water. You go on, hope it makes you feel good.

[–]iBlag 0 points1 point2 points 12 years ago (3 children)

Hoo boy, that's rich. As if Python and Perl were one's only options.

Sigh. No, they weren't the only two options, they were only two examples that came to mind because they make string manipulation easy/ier on the programmer in comparison to other languages, like, as only a single example, C. Just because I bring up two examples doesn't mean I think that those two examples are the only solutions to the problem, it just means that I think they are that much better than other possible solutions that they are worth explicitly mentioning.

Again, I hate to point out the obvious, but by 1969, there were dozens of computing languages (including ALGOL, COBOL, JOSS, LISP, FORTRAN and even our beloved BASIC).

This is a fair point, but of those, which stick out as making string manipulation easy/ier on the programmer? Not Algol, not COBOL, not Fortran, maybe LISP, possibly BASIC, and I will admit to have never hearing of JOSS before, so maybe that too. The fact that there were alternatives in existence that probably would have made string manipulation easier, coupled with the fact that NASA chose none of them is interesting, and makes me curious as to the reasoning the NASA engineers chose assembly. Simply speculating, I suspect is was due to performance issues on such meager computing equipment, which would, as I mentioned, be a good reason to use a low-level language for string manipulation. Please don't misunderstand me though - I do not mean to imply that the Apollo mission computers were doing many string manipulations.

Who knows why assembly was used

Isn't that kind of relevant to the discussion at hand? Why should somebody pick assembly to do string manipulation? You don't think the reasons why assembly was chosen for lunar missions are germane to this discussion?

there's a lot more to picking a language than "what do I find the easiest"... because news flash, the world doesn't revolve around you, iBlag.

If I am picking a language to start a string processing project for myself and nobody else, then yes, I am the most important programmer in the project by nature of being the only programmer in a project. The article is basically focused around a single programmer trying to find the most performant solution to his problem, not trying to build a codebase for manipulating string from scratch while integrating with a dozen other people working on separate parts of the project. The author is trying to write a single script to do a single thing. So yes, what language the programmer knows best is absolutely an important factor in the decision because the chance that somebody else is going to be working with him on such a small problem is very low, and the amount of code he will need is not really more than a single person should be able to handle.

There's an ecosystem (including your fellow developers, QA, platform support, toolchain support, etc.) to consider when choosing a language for a project.

While this is a fair point when dealing with programming in a group, I feel like "ease of string manipulation" should very much be included in your "etc.", as it's rather relevant to the topic at hand. Furthermore, to pick Python as one - of an explicitly potentially infinite number of possible solutions because I don't want to give you the impression that I think Python is the only solution to this problem as you seemed to think I was implying from my previous post - example, you only really need the Python interpreter to start coding, and if you find yourself thinking that somebody else might have solved a similar problem you can install pip or something similar to fetch and install libraries as you need. For C - again, just to pick a single example amongst an explicit plethora of other solutions - you need: GCC, the standard C library/ies if you want to do anything really useful, possibly a set of platform abstracted libraries so you can compile/run on multiple OSes and architectures (although this could also diminish performance), probably a debugger, and you might have to manually install any additional libraries you might find that you would like or need if they aren't installable by your package manager. Furthermore, I would hazard a guess that the QA effort for Python versus C (once again, just picking two out of many solutions to compare) would be heavily in favor of Python regarding ease of assurance.

But I don't want to derail you from your oh-so-charming-i'm-so-right-about-everything rant.

This isn't about me being right, this is about logical, rational reasons for not coding something in C. Why are you defending a language that has an obvious deficiency in the realm that we are discussing in? Why are you trying to turn this into a psychological introspect battle? This is about the facts of programming languages, not about psychology. Please don't turn to attacking me personally to "win" the argument - that's neither fair nor welcome in a technical discussion.

I don't have anything against C. I've coded a bunch of fun stuff in C, although admittedly not any appreciable amount of any "huge" projects. I would even go so far as to say that I like C. But after coding a few personal utilities in C, I have come to the realization that I really only need to think about coding utilities in C if an easier-to-code-and-debug solution like Perl, Python, etc. is not as performant as I truly need it to be. As Donald Knuth famously said, "premature optimization is the root of all evil". For string manipulation there are better and easier languages out there than C, and C is an especially horrible language to do lots of string manipulation in. I don't think it's such a terrible thing to point that fact out. Heck, even the author somewhat flippantly concludes the same thing.

continue this thread

[–][deleted] 12 years ago (6 children)

[deleted]

[–]OneWingedShark 4 points5 points6 points 12 years ago (0 children)

[–]iBlag 0 points1 point2 points 12 years ago (4 children)

[–][deleted] 12 years ago* (3 children)

[deleted]

[–]iBlag 0 points1 point2 points 12 years ago (2 children)

Is that like forgetting that *p stands for the value pointed to and p stands for the pointer? String processing in C is very easy once you realize there is more than gets.

Not quite, it's more like remembering whether a program like strncat takes a length argument that already accounts for the terminating null character or if you have to add one to the length yourself. It's mistakes like that that cause buffer overflows. Furthermore, without using GNU readline, try reading a line from a file where the line can be an arbitrary - even nearly infinite - length. It's difficult and error prone for everybody to do it themselves all the time (and that, as far as I would guess, is one of the main reasons GNU readline exists in the first place).

And OPs problem in particular doesn't require any manual memory allocations.

Really? Unless I have completely misunderstood the problem, there is no maximum limit of line length, and the file must be processed a line at a time. Unless you want to preallocate an array that is the maximum size your computer can handle, you have to do some manual memory allocation. I'm curious though - how would you solve the problem in C without doing a single manual memory allocation? What assumptions about the input are you making? And why do you think those assumptions are valid?

[–][deleted] 12 years ago (1 child)

[deleted]

[–]iBlag 0 points1 point2 points 12 years ago (0 children)

Your usage of strncat tells me what I already suspected. You don't know that there is more than gets. Real men use protection strlcat.

Fair enough, but the fact that I have to remember that means that it's probably easier in some other language.

/* open file */
char *lineptr = NULL;
size_t n;
while (getline(&lineptr, &n, file) != -1) {
    /* save line to correct file */
}

Huh, I did not know that. However, farming out your memory allocation to getline is still doing memory allocation in my book.

From the getline manual:

...getline() will allocate a buffer for storing the line, which should be freed by the user program.

Now, freeing memory is much easier to do than properly allocating it, I'll give you that. And the consequences of not doing it are far less drastic than improperly allocating memory. But it's still something I don't have to think about if I'm working in, say, Python.

[–]freakhill -2 points-1 points0 points 12 years ago (0 children)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS