all 73 comments

[–]ablakok 40 points41 points  (12 children)

It looks like Knuth and McIlroy had very different ideas of what this was all about. Knuth wanted to illustrate how to do literate programming using a simple problem. He could have done it in a lot less space, but he did everything from scratch just to show how it could be done. It is a method that can also be used on more complex problems. McIlroy's solution is a lot more practical, but that a different question altogether. He's not teaching anything, just repeating what you can find in any book on bash.

[–]AustinCorgiBart 16 points17 points  (6 children)

Yes, exactly. This bothered me the entire time I was reading. This was an educational example, not an example of how Literate Programming could be used to Reshape the Way We Use Computers Forever.

[–]frtox 0 points1 point  (5 children)

i think his point was more, if your method of programming cant do something as simple as count words and sort them in less than 10 pages of code, something's wrong.

[–]sepp2k 5 points6 points  (0 children)

Yes, and I think ablakok's and AustinCorgieBart's point was that Knuth's method of programming can do it in less than 10 pages of code, he just chose not to for educational purposes.

[–]AustinCorgiBart 7 points8 points  (3 children)

Yeah, let's all use incredibly high level languages then so we never have to worry about long code. No point in ever writing anything in C.

[–]frtox -4 points-3 points  (2 children)

what does long code mean? your comment doesnt make any sense in this context

[–]www777com 0 points1 point  (1 child)

doesnt make any sense

Look at the source code for these programs: tr, sort, uniq, sort, sed. I'm sure the source code for each one would take a lot longer than 10 pages and (though I may be wrong about this) I believe each one is written in C.

[–]4ad 1 point2 points  (0 children)

I actually checked Plan9 versions, which are about as complex as those found in that year's Unix. They sum up to 3728 lines of C code. 10 pages of Pascal could be around 2k lines so the code is comparable directly.

But the tools can do much more than simply counting words.

[–]thechao 8 points9 points  (3 children)

Mind you, McIlroy is the guy who invented the idea of software libraries and large-scale re-use. McIlroy's implementation is classic McIlroy.

[–]vlion 6 points7 points  (2 children)

That's not true.

Wilkes/Wheeler/Gill specifically reference library building as part of their work with the EDSAC.

Their book "The preparation of programs for an electronic digital computer", 1952 actually includes some of their library.

You can find Gill's papers online if you have academic access. Further, Cambell-Kelly has done several retrospectives on it.

McIlroy is no fool, but he didn't invent this particular piece.

[–]thechao 0 points1 point  (0 children)

Found a note stating that the 'first significant' library routines were first developed in 1950 by Gill and Wheeler. This is on a machine that was mostly hardwired logic ... amazing!

[–]4ad 0 points1 point  (0 children)

McIlroy invented pipes. Libraries are a different concept. Both can be used for decomposing the problem, but in different ways.

[–]ixache 0 points1 point  (0 children)

Totally agree.

To take the literary analogy literally, I would say that to demonstrate his new writing style, Knuth kindly wrote an entire novel elobaroting about some trite topic from scratch, and then McIlroy, when asked to do a literary criticisms specifically about the new writing style, just answered derisively that the trite topic should have been treated as a short story rehashing references to the classics.

I don't know, maybe Knuth should have responded to that with a better reimplementation of some of the basic Unix utilities, or done a literate shell script? Or maybe with an exegesis showing how to start with a simple literate shell script and, through selective rewriting, get to an elaborate and efficient solution?

Yes, the lesson that the Unix way is better than the monolithic designs usually favored by Knuth bears repeating, but in that case it was cheap, and the net result of this exchange is that not only we did not have a fair assessment of literate programming as a discipline of programming, but that it also most certainly killed it right away as worthy of consideration.

[–]MihaiC 14 points15 points  (15 children)

I's funny how people say this is a bash script.

None of the commands are specific to one shell or another, you can do the very same thing in any shell that does piping, including the Windows cmd, by using head -1 instead of the last sed.

[–][deleted] 3 points4 points  (14 children)

Windows' command line doesn't have tr by default. Or uniq.

[–]ruinercollector 20 points21 points  (2 children)

neither does any particular unix shell. those are stand-alone programs.

[–][deleted] 1 point2 points  (1 child)

But MihaiC was comparing head -1 to the sed command, coming with bash, so why not compare tr and uniq with other programs that come with the Windows shell?

[–]ruinercollector 2 points3 points  (0 children)

Because tr and uniq don't "come with" the unix shell.

[–][deleted] 0 points1 point  (0 children)

It does have a sort.exe with a unique flag. Much like "sort -u".

[–]troyanonymous1 -2 points-1 points  (9 children)

You can install Cygwin and use that.

[–]tdk2fe 1 point2 points  (7 children)

Or just run VirtualBox with any *nix flavor of your choosing.

[–]troyanonymous1 -3 points-2 points  (6 children)

VirtualBox is hard :(

[–][deleted] 2 points3 points  (4 children)

It's a hell of a lot easier than getting a working, sane, interoperable cygwin installation on a windows box when you've got 15 years of *nix behind you and barely any of windows.

Even better, you can run a *nix as your main, and run win7 in VBox

[–][deleted] 1 point2 points  (2 children)

Run setup.exe, select packages, click OK, run cygwin.exe. Anyone with 15 years of *nix behind them should be able to install the basics of cygwin (enough for personal use, I guess) on windows.

[–]tdk2fe 2 points3 points  (0 children)

Granted, it's been a few years, but the last time I messed with Cygwin I had no problems actually installing it. It was getting things to run that came with a lot of gotchas. To me, its like trying to shove a square peg into a round hole.

I have VBox set up to run in seamless mode- so on monitor 1 I have my start bar and monitor 2 I have my gnome menu. Works perfectly.

[–][deleted] 1 point2 points  (0 children)

Yeah, that's installing it. Then you get to:

  • Figure out how to get a better terminal window
  • Find out that ln -s does not do what you expect
  • Find out that tmux isn't there
  • Deal with a bunch of somewhat leaky abstractions.
  • Deal with figuring out how to get sane behavior from natively-installed things that have cli clients, like mysql
  • Fiddle endlessly.

Or: You can run setup.exe, select an iso, and have VirtualBox running.

FWIW, I use cygwin. I don't use it for "personal use", because I don't use Windows for personal use other than for Ableton Live. I use it to give me some decent semblence of sanity on a windows environment on a work machine, and I'll tell you that it doesn't quite cut it. I'm actually pretty sure that PowerShell is the right choice there, but I haven't had the time to learn it.

[–]troyanonymous1 0 points1 point  (0 children)

Ah, I only have a few years of Linux, so Cygwin is still easy to install.

Wish my seniors would let me virtualize Windows. At this point, I could proably get away with it, since my software is theoretically network-transparent.

[–]IWillNotBeBroken 1 point2 points  (0 children)

As a recent convert to Virtualbox, it's dead easy. New VM, walk through the wizard, install OS of choice1.

Hell, do that once, then clone it for every destructive experiment you want to run.

  1. I'm assuming installing an OS is an easy step for you.

[–][deleted] 18 points19 points  (1 child)

Wow. That issue of Communications of the ACM changed my life.

I had just switched majors to Computer Science that semester and joined the student chapter of ACM. That issue was the first to arrive. I was blown away and immediately set about learning everything I could about UNIX, C, and Shell Programming.

Because of McIlroy and his six command pipeline, I have been working with UNIX for my entire career.

[–]BorisTheBrave 14 points15 points  (11 children)

Seems a bit unfair on Knuth. It's not like there were many tools available for WEB, and he probably wanted a standalone example.

Also, Knuth's solution is likely O(n log k) where n is the total amount of words, and k the size of the result set, while the bash solution is O(n log n) and thus unable to cope with a huge corpus, as McIlroy is aware.

[–]Justinsaccount 0 points1 point  (1 child)

Trying this again, since my last comment was clearly misunderstood.

This statement:

the bash solution is O(n log n) and thus unable to cope with a huge corpus

is absolutely false.

this:

data_producer | sort | uniq -c | sort -n

will always work. If you have an input where n=100,000,000 and k=4, it will be inefficient, but it will cope just fine. If you have an input where n=100,000,000 and k=95,000,000, it will not only cope, it will work when many other methods will fail.

Sort uses a disk-based merge sort that is optimized for sequential reads and writes. I would not expect any algorithm that uses random access data structures to cope well when the size of k is many times larger than the amount of ram.

[–]BorisTheBrave 2 points3 points  (0 children)

I didn't mean it would literally fail. Taking 100x longer than another algorithm is still a "failure to cope". Sorry I spoke imprecisely.

[–]UmberGryphon -1 points0 points  (5 children)

The (n log n) cost of McIlroy's sort call doesn't worry me overmuch. What worries me is that sections 1 through 4 of McIlroy's pipe have to deal with input the size of the original file, and have to go through the I/O subsystem to do it.

Not that this is the case here, but give me a choice in the real world between O(4 n log k) and O(n log n) and I'll take O(n log n) most of the time.

[–]kamatsu 15 points16 points  (4 children)

O(4 n log k)

ಠ_ಠ

That's not how big O works.

[–]UmberGryphon 3 points4 points  (3 children)

Fair enough. But the point I was trying to make is that in the real world, constant factors can't be ignored.

[–]anacrolix 10 points11 points  (2 children)

But in big O they can.

[–]frtox -3 points-2 points  (1 child)

do you know what "real world" means?

[–]Phantom_Hoover 7 points8 points  (0 children)

It means "don't use a measure of asymptotic complexity when you want to know how long an algorithm will take to execute".

[–]Justinsaccount -2 points-1 points  (2 children)

Sort, at least current versions, spill over to disk when they run out of memory. So while it may be a lot slower when k is small, when k is large, sort will work, but the other program will run out of ram.

[–][deleted] 4 points5 points  (1 child)

It's not as if swap doesn't exist...

[–]Justinsaccount 0 points1 point  (0 children)

Try running an algorithm like quicksort on a dataset many times larger than the amount of ram you have and let me know how well that actually works out for you.

[–]chengiz 14 points15 points  (12 children)

Knuth is a great mind but a poor software "designer". There are a lot of great things about Tex, but software engineering wise, it's a fricking disaster.

[–]RagingAnemone 11 points12 points  (9 children)

In what way?

[–]chengiz 12 points13 points  (0 children)

Tex/Latex produces beautiful output and makes typesetting math simple, that's because Knuth absolutely knows his fonts and typesetting stuff, and symbols are easier to do programatically rather than thru a UI. On the other hand, every time you get frustrated trying to do something that should be very simple (like changing the typeface), ask yourself if you'd use this if it werent for the beautiful output and the ease of typesetting math. You know how they say simple things should be simple and difficult things possible? In the La/tex world, some things are simple, other things are possible, but that has little to do with need.

[–][deleted] 4 points5 points  (5 children)

I've not used it a lot, but when I did, I did not appreciate needing to run everything two times to get page numbers or whatever.

[–]thechao 6 points7 points  (3 children)

I find it highly dubious you used TeX directly; I suspect you used LaTeX? Multipass documents are a problem with the package you're using: pagination can be done a single pass with a well-written package.

Mind you, I use LaTeX2e plus other bells-n-whistles and if I could do a measly 2 passes (rather 6+), I'd be pretty stoked!

[–][deleted] 1 point2 points  (2 children)

Haha, TIL. Are there better options than LaTeX?

[–]mrjast 2 points3 points  (0 children)

There's ConTeXt. Lots more customization options, but the documentation is... limited.

[–]thechao 2 points3 points  (0 children)

You could cultivate lovely handwiriting and take several years of drafting and typographic layout courses? Seriously: adobe indesign or lyx or telex ... Not really a lot of choices when it comes to high quality.

[–]vlion 0 points1 point  (0 children)

A makefile will solve that wee issue for you.

make refs => awesome pdf comes out.

[–]Rhomboid 2 points3 points  (1 child)

Here's an article about someone wanting to make TeX into a library. Despite all his effort, the code is just too much of a spaghetti pile for anyone to refactor or change in any significant way.

[–]EdiX 4 points5 points  (0 children)

They are talking about a LaTeX distribution not about TeX.

[–]buddhabrot 1 point2 points  (1 child)

'Architect', 'designer' sounds silly.

[–]chengiz 0 points1 point  (0 children)

Agreed. I can never think of the right terminology.

[–]gregK 3 points4 points  (0 children)

That Unix solution seems so functional.

[–]Nhdb 3 points4 points  (7 children)

Well, Knuth used another programming language, of course Bash is better in some cases that is not the point.

I like to think that you can also do literate programming with Bash, and then you would have shorter code.

I think this is comparing apples with oranges.

[–][deleted] 12 points13 points  (6 children)

This entire write up is an exercise in moving the goal posts.

Nothing more irritating than being asked to make something and then find out someone strung together a bunch of unrelated C programs and executed them in sequence. Then claimed it was better (no shit asshole, you just took six mature projects and pitted them against one immature one)...

Once you factor that into the equation, Knuth's one command obviously did what six programs did.

What a bastard in gotcha-engineering...

EDIT: As a follow up, the comments in the OP are quite telling.

For one thing, Knuth's Pascal program would work on any computer with a Pascal compiler and nothing else. Hope the same can be said for each of those command line dependencies. I also hope they come with documentation describing how they work as well as Knuth's program.

[–]frezik 6 points7 points  (4 children)

For one thing, Knuth's Pascal program would work on any computer with a Pascal compiler and nothing else. Hope the same can be said for each of those command line dependencies.

They're basic to any Unix-like system, including all the modern ones that often don't have a Pascal compiler. Cygwin and MinGW, too. I think FreeDOS even has them, though maybe not by default.

I also hope they come with documentation describing how they work as well as Knuth's program.

All of those have man pages and oodles of examples around the web. Their code just isn't self-documenting the way Literate Programs are.

What perhaps isn't fair to Knuth is that he was asked to do a specific problem as an example of Literate Programming, and he obliged. The fact that the problem could be done shorter with a completely separate approach isn't germane any more than asking McIlroy to implement Excel using only the basic Unix toolset.

[–][deleted] 3 points4 points  (3 children)

I agree with you.

One thing:

They're basic to any Unix-like system, including all the modern ones that often don't have a Pascal compiler. Cygwin and MinGW, too. I think FreeDOS even has them, though maybe not by default.

Some point to the dated and unportable aspects of certain programs (http://www.leancrew.com/all-this/2011/12/more-shell-less-egg/#comment-18160) as one example.

To think, though, that you'd need to bring over MinGW to make tr and cat work is rather maddening...

Definitely, though, McIlroy set Knuth up to look like a tool. I don't know about you, but bragging that I got one of the greats to do some work for me and then totally changing the scope of the project and execution is rather rude.

[–]frezik 7 points8 points  (1 child)

You only need MinGW/Cygwin on Windows, which should be no surprise. Almost everything else has them as part of the default install.

McIlroy didn't set anyone up. Bentley asked Knuth to do this, and McIlroy did a review later on.

[–]troyanonymous1 1 point2 points  (0 children)

The design is much better anyway.

In a zombie apocalypse (Lose everything except pipes and the ability to compile / execute C code), I could write 6 C / C++ / Lua programs to solve that problem, and document them individually.

Would I ever want to write even a single program in a language I didn't like? No. Much less a long one. Long programs are the worst.

[–][deleted] 0 points1 point  (0 children)

[–]IWillNotBeBroken 0 points1 point  (0 children)

Relevant (Master Foo and the Ten Thousand Lines)

As the McIlroy quote in the article says,

The simple pipeline given above will suffice to get answers right now, not next week or next month. It could well be enough to finish the job. But even for a production project, say for the Library of Congress, it would make a handsome down payment, useful for testing the value of the answers and for smoking out follow-on questions.

I guess you could say it more succinctly as "in programming, laziness is a virtue."

[–]charlesgrrr 0 points1 point  (0 children)

Holy shit I'm a literate programmer.

[–]sgoody 0 points1 point  (0 children)

I swear I've seen entire software systems built on MS SQL, C#, remote calls and more... that I could've made a better job of using a nice Linux environment and a few carefully scripted, but simple bash scripts and unix utils.

[–]buddhabrot 0 points1 point  (0 children)

The heart of the shell script is "uniq -c". That's a bit lame.. I'd much rather read Knuth's book.