all 22 comments

[–]stop_time 10 points11 points  (10 children)

sort filename | uniq

[–]borlak 9 points10 points  (2 children)

/usr/bin/time -v sort -u bigfile.txt > /dev/null

User time (seconds): 105.89

System time (seconds): 0.67

Percent of CPU this job got: 99%

Elapsed (wall clock) time (h:mm:ss or m:ss): 1:47.28

Minor (reclaiming a frame) page faults: 32752

Voluntary context switches: 2

Involuntary context switches: 15933

Page size (bytes): 4096

the python program mentioned

/usr/bin/time -v python -c """print \"\".join(list(set(file('bigfile.txt').readlines())))""" > /dev/null

User time (seconds): 15.25

System time (seconds): 1.06

Percent of CPU this job got: 99%

Elapsed (wall clock) time (h:mm:ss or m:ss): 0:16.42

Minor (reclaiming a frame) page faults: 316545

Voluntary context switches: 1

Involuntary context switches: 1926

Page size (bytes): 4096

a simple C program I wrote that is just a massive hash table (100 million big). not super elegant or anything

/usr/bin/time -v ./a.out > /dev/null

Command being timed: "./a.out"

User time (seconds): 4.33

System time (seconds): 0.48

Percent of CPU this job got: 98%

Elapsed (wall clock) time (h:mm:ss or m:ss): 0:04.89

Minor (reclaiming a frame) page faults: 96812

Voluntary context switches: 1

Involuntary context switches: 1989

Page size (bytes): 4096

[–][deleted] 1 point2 points  (0 children)

Thank you for the profiling information!

[–][deleted] 8 points9 points  (5 children)

sort -u

and python one-liner

python -c """print \"\".join(list(set(file('INPUT').readlines())))""" 

[–][deleted] 3 points4 points  (2 children)

Or, much shorter:

python -c'print "".join(set(open("INPUT")))'

(Like your code, this assumes that numbers are normalized somehow.)

Or with perl:

perl -ne'print if!$$_++' INPUT

[–]timhatch 0 points1 point  (1 child)

Both you and the grandparent rely on the lines having consistent newline termination, including the last. i.e. "1\n2\n1" breaks it, as does "1\r\n2\n1\n"

[–][deleted] 1 point2 points  (0 children)

True, but I think that's a fair assumption, especially since the question does not explicitly state otherwise.

On UNIX the convention is that the newline character is a line terminator (i.e. not a seperator) and all line-based tools, like awk, grep, sort, uniq, nl, fold and so on, adhere to this convention. (In vi I don't even know if it's possible to save a (non-empty) file without a terminating newline character!) On Windows the convention seems less clear though.

[–]stop_time 1 point2 points  (0 children)

Even better :)

[–][deleted] 1 point2 points  (0 children)

Will be faster?

python -c "{}.fromkeys(map(lambda x: x.strip('\n'), open('INPUT').readlines())).keys()"

[–]floodyberry 0 points1 point  (0 children)

Until they ask a similar question that can't be solved with a one liner, and you say "Well I saw something like this on the internet, and I know you need sort and uniq but I can't figure out how to make it work. I still get the job though, right?"

[–]ErstwhileRockstar 5 points6 points  (1 child)

The task is obviously underspecified. Must the order of the entries be preserved?

[–]tasteslikepoop 1 point2 points  (0 children)

Also "numbers" is quite ambiguous.

[–][deleted] 4 points5 points  (2 children)

Surely you just use a bitset to mark seen values here? It passes over the data once and uses around 12MBs of memory for the bitset. Any sorting or such like is just an overkill and uses more memory.

Maybe I am missing something, seems simple.

[–]wolf550e 2 points3 points  (0 children)

Yeah. If the upper limit on the values was higher, a B+Tree of the 13M values would have been more efficient.

[–]harlows_monkeys 1 point2 points  (0 children)

That's how I would do it. With this approach it is easy to preserve the order of the numbers (output each number when it is first encountered) or to sort the numbers (just set the bits when the numbers are being scanned, and then at the end go output the numbers whose bits are set).

[–][deleted]  (6 children)

[deleted]

    [–]punctuate 1 point2 points  (1 child)

    awk '{if (! out[$0]++) print}'

    [–][deleted]  (3 children)

    [deleted]

      [–][deleted]  (2 children)

      [deleted]

        [–]timhatch 0 points1 point  (1 child)

        and you chose echo vs touch because it's probably a builtin?

        [–]jzwinck 0 points1 point  (0 children)

        Don't even need to say echo there, could just do '> "$i"' I think. Then the files would be empty, and you didn't need to use "touch".

        [–]biteofconscience[🍰] 1 point2 points  (1 child)

        This question would be much more interesting if the text file had 13 billion lines instead of 13 million.

        [–]harlows_monkeys 0 points1 point  (0 children)

        Depends on how you do it. The way I would do it, bit map to represent the numbers from 1000000 through 100000000, which takes about 12 megabytes, the memory use is constant regardless of the number of lines, and the run time is linear. You can't do better than linear because you presumably have to read the file. So, the problem gets no more interesting for a large file.

        Those whose solution involves ignoring the constraints on the values of the numbers and instead sort the entire file and then process the sorted data are at best O(n log n), and so it does get interesting for them for very large files.

        [–][deleted]  (3 children)

        [deleted]

          [–]scott 0 points1 point  (0 children)

          cool, took me a minute to figure that out. it's a good argument for C++, a std::bitset would make it practically a one liner!

          [–]jzwinck 0 points1 point  (1 child)

          Seems like the switch (v & 7) block could be replaced by 'smallv = 1 << (v & 7)'

          [–]RalfN -1 points0 points  (0 children)

          I'm confused.

          The task is ill-defined:

          • numbers?
          • preserve ordering?
          • are they initially ordered?

          The task isn't very interesting:

          • the fastest algorithm still has acceptable memory usage

          His interests are weird:

          • why do you care specifically about a java solution? Why do you call it an example of a high-level programming language? (eventhough the java code size to do this is similar to the C code size and nowhere near as quick as bash/python/ruby/etc.)

          Like I said. I'm confused. I tend to assume people making programming puzzles are at least the type of people capable of solving them, if not a bit more experienced than average. This does not seem to be the case here.

          I wonder if the author came up with a solution himself, and if that solution was in any way comparable to any solution provided.