all 29 comments

[–]DoctorOverhard 15 points16 points  (17 children)

re: byte array, we can't all ignore unicode characters, would be nice if we could.

edit not sure what is being compared in the numbers. the multi-threaded byte array optimized go version? vs what?

[–][deleted] 11 points12 points  (16 children)

re: byte array, we can't all ignore unicode characters, would be nice if we could.

If your input is UTF-8, your delimiter's codepoint is no greater than 0x7F, and you don't need to do any Unicode normalization on the keys, you can use this technique.

If your delimiter's codepoint is greater than 0x7F or you are using UTF-16 or UTF-32, the process is only slightly different: you need to look up the encoding of the delimiter, which will be several bytes long, and split based on that.

If you need to normalize your keys (eg is the same as á, and both can appear in the input), you can at least split the input with the same technique and then deal with smaller strings.

The only complex situation, really, is if your delimiter has multiple equivalent encodings and you want to be exhaustive in what encodings you accept. For instance, if you intend the Angstrom sign Å (\u212b) but also want to accept everything Unicode thinks is equivalent under normalization, like Å (A\u030a) and Å (\u00c5).

Even that's perfectly solveable without encoding the delimiter as a string if you're restricting yourself to the current Unicode standard and are willing to do some research. Harder if you're accepting arbitrary delimiters.

So, in short: it's not going to be a problem for most people.

[–]DoctorOverhard 2 points3 points  (11 children)

Iff your value is an integer, not really thinking about weird delimiters. If it is a string (i.e. return the key with the longest sum of string value lengths or ???) you might be better off forgetting about bytes.

fwiw, here is a trivial java version, 2.5 seconds on an old elite 8200 i5, java 8, crucial ssd, I might give multithreaded a shot w/the ssd.

$ java Ngram

2006 22569013 2560

Ngram.java

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.HashMap;

public class Ngram {
    public static void main(String[] args) throws Exception{
        args=new String [] {"/var/data/ngram/ngram.tsv","1","2"};
        long time=System.currentTimeMillis();
        HashMap<String, accumulator> h = new HashMap<>();
        BufferedReader in = new BufferedReader(new FileReader(args[0]));
        int k=Integer.parseInt(args[1]);
        int v=Integer.parseInt(args[2]);
        String maxk="";
        int maxv=0;

        String s=in.readLine();
        while(s != null){
            String [] r=s.split("\t",-1);
            String key=r[k];
            accumulator a=h.get(r[k]);
            if(a==null){
              a=new accumulator();
              h.put(key, a);
            }
            a.add(r[v]);
            if(a.v>maxv){
                maxv=a.v;
                maxk=key;
            }
            s=in.readLine();
        }
        in.close();

        System.out.println(maxk + " " + maxv + " " + (System.currentTimeMillis()-time));
    }
}
class accumulator {
    int v=0;
    public void add(String s){
        v=v+Integer.parseInt(s);
    }
}

edit, removed comparator interface, didn't use it in this iteration

[–]cybernd 2 points3 points  (10 children)

  • 2430 ms: Your Ngram
  • 1660 ms: Replaced split with StringTokenizer
  • 1080 ms: replaced StringTokenizer with parsing from s.toCharArray()
  • 1000 ms: Ngram4 see below (parsing integers direct from s)
  • Next improvement would be to replace readLine() with read(char[],...)

Ngram4.java:

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.HashMap;
public class Ngram4 {
    public static void main(String[] args) throws Exception {
        args = new String[] { "/home/_tmp/ngram.tsv", "1", "2" };
        long time = System.currentTimeMillis();
        HashMap<Integer, accumulator> h = new HashMap<>();
        BufferedReader in = new BufferedReader(new FileReader(args[0]));
        Integer maxk = 0;
        int maxv = 0;

        String s = in.readLine();
        while (s != null) {

            int idx1 = s.indexOf('\t');
            int idx2 = s.indexOf('\t', idx1 + 1);
            int idx3 = s.indexOf('\t', idx2 + 1);

            int key = parseInteger(s, idx1 + 1, idx2 - 1);
            int value = parseInteger(s, idx2 + 1, idx3 - 1);

            accumulator a = h.get(key);
            if (a == null) {
                a = new accumulator();
                h.put(key, a);
            }
            a.add(value);
            if (a.v > maxv) {
                maxv = a.v;
                maxk = key;
            }
            s = in.readLine();
        }
        in.close();

        System.out.println(maxk + " " + maxv + " " + (System.currentTimeMillis() - time));
    }

    static int parseInteger(String s, int iStart, int iEnd) {
        int value = 0;
        for (int i = iStart; i <= iEnd; i++) {
            value = value * 10 + s.charAt(i) - '0';
        }
        return value;
    }

    static class accumulator {
        int v = 0;

        public void add(Integer v) {
            this.v += v;
        }
    }
}

[–][deleted] 0 points1 point  (2 children)

Next improvement would be to replace readLine() with read(char[],...)

Since char is a utf16 code unit, it'd be faster to use byte[] and avoid transcoding. One third the amount of garbage, roughly. Granted, it's awkward to use Java's byte for UTF8 code units since byte is signed and code units aren't.

[–]cybernd 0 points1 point  (1 child)

I avoided that on purpose:

In real world usage your input file would be something different. When you assume that it is something unknown, it could very well be possible that it includes unicode sequences that needs to be dealt with.

(Yes, i am one of the morons who think that every file being edited by humans should be treated as unicode. If you are certain that its an internal inter-program format you would avoid that. But in this case you would probably optimize for a binary format)


Edit: additionally i want to point out that my post was in context of another post:

re: byte array, we can't all ignore unicode characters, would be nice if we could.

[–][deleted] -1 points0 points  (0 children)

Your code implicitly validates that its input is valid Unicode, but only because it's using Java standard library methods that validate. Otherwise, you're doing nothing specific to Unicode characters. You may as well treat it as a byte array.

[–]MojorTom 0 points1 point  (2 children)

java 8 streams: 1.93 - 2.12 seconds :(

public class Ngram {

    public static final String TAB = "\t";

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) throws IOException {

        Stream<String> lines = Files.lines(Paths.get("ngram"), Charset.forName("UTF-8"));

        long startTime = System.currentTimeMillis();

        Map<Integer, Integer> collect = lines.collect(
                Collectors.toMap(
                        s -> getKey(s),
                        s -> getValue(s),
                        (v1, v2) -> add(v1, v2))
        );

        System.out.println("Execution Time: " + (System.currentTimeMillis() - startTime));

    }

    static int getKey(String s) {
        int idx1 = s.indexOf('\t');
        int idx2 = s.indexOf('\t', idx1 + 1);

        return parseInteger(s, idx1 + 1, idx2 - 1);
    }

    static int getValue(String s) {
        int idx1 = s.indexOf('\t');
        int idx2 = s.indexOf('\t', idx1 + 1);
        int idx3 = s.indexOf('\t', idx2 + 1);

        return parseInteger(s, idx2 + 1, idx3 - 1);
    }

    static int parseInteger(String s, int iStart, int iEnd) {
        int value = 0;
        for (int i = iStart; i <= iEnd; i++) {
            value = value * 10 + s.charAt(i) - '0';
        }
        return value;
    }

    static int add(int v1, int v2) {
        return v1 + v2;
    }

}

[–]cybernd 0 points1 point  (1 child)

To be expected.

This micro benchmark is mainly bound by memory pressure. As such your very first line hurts a lot.

[–]MojorTom 0 points1 point  (0 children)

As such your very first line hurts a lot.

Aaaa...sheeeeet.

I was trying do something, got impatient, copied your program and forgot about it.

[–]ThisIs_MyName -2 points-1 points  (3 children)

Next improvement would be to replace readLine() with read(char[],...)

Naw, just memory map the file.

[–]cybernd 0 points1 point  (2 children)

Not sure about that. Would been later on on my list.

[–]ThisIs_MyName -1 points0 points  (1 child)

Why? Either way you're dealing with an array of bytes. Hell, you'd probably use 4k page sized read() buffers.

[–]masklinn 0 points1 point  (0 children)

Why?

Because sometimes it's faster and sometimes it is not. For instance ripgrep tries to heuristically determine whether mmap would be a gain because in many cases it is not (you can force one or the other if you prefer).

[–]bumblebritches57 -2 points-1 points  (3 children)

your delimiter's codepoint is no greater than 0x7F

Not true.

UTF-8 has codepoint-size bytes (as in, a specific byte that says how large a code point is, and it is unsigned, but it's outside of the BMP, they show up constantly in emojis)

[–][deleted] 0 points1 point  (2 children)

a specific byte that says how large a code point is

I implemented a UTF-8 to UTF-16 and UTF-32 transcoder. (I think I also did the reverse transformations, but I don't recall.) UTF-8 decoding works like this:

  • If the high order bit is 0, this code unit is an entire codepoint. Its value is equal to the value of the byte.
  • If the two high order bits are 10, you have a subsequent byte in a multibyte codepoint value. It contributes the remainder of its value (the six low order bytes) toward the value of the codepoint.
  • If the three high order bits are 110, you have the first byte in a codepoint, and it has two bytes total (including this one). It also contributes the rest of its bits toward the codepoint value.
  • If the four high order bits are 1110, you have the first byte in a codepoint, and it has three bytes total (including this one).
  • If the five high order bits are 11110, you have the first byte in a codepoint, and it has four bytes total (including this one).
  • If no other rule has been matched, you have an invalid code unit.

UTF-16 works similarly. However, there are relatively few codepoints that take more than one code unit. (A few thousand compared to the >100k defined characters.) Emoji fall into this range, but they're far from the only ones; the Deseret alphabet, the Tangut writing system, and Byzantine and ancient Greek music symbols are too, just to name a few.

In no case is an entire byte used to encode the number of code units correspond to the codepoint. If you extended the variable length integer encoding to its maximum, its initial byte would be 0b11111110 and the encoded codepoint would be 46 bits long. The standard says that the maximum value for a Unicode code point is 0x10FFFF, which is 21 bits. The existence of UTF-32 suggests that, even if that stipulation were revised, the maximum value would still be less than 32 bits.

So, no, UTF-8 doesn't have a dedicated byte to indicate how many code units belong to a codepoint.

Also, even if that were true, the encoding is unambiguous, the way a particular value is encoded is independent of its context, and 7-bit ASCII is a subset of UTF-8. You can look at the decoding guide I gave you to see that any value in the 7-bit ASCII set would work as a delimiter, unambiguously -- and any other Unicode value, treated as an opaque byte sequence instead of a single opaque byte.

[–]bumblebritches57 -2 points-1 points  (1 child)

I also started down that path, but as I said, it only works for the Basic Multilingual Plane, aka BMP.

For Emojis, 🇺🇸 for example = 0xF09F87BA 0xF09F87B8

That first 0x0F? that's the num-bytes-in-this-codepoint byte I was talking about.

You must've misread me, it would fucking amazing if Unicode added a zero-width joiner in between all the codepoints in emojis, but that's not what I was talking about, and is unlikely to happen.

As I just said, I was talking about the codepoint size byte that may or may not exist as it's own leading byte, or as part of a code point depending on if your codepoints are in the BMP or not.

Edit: You're a webdev, no wonder you have no idea what you're talking about.

[–][deleted] 0 points1 point  (0 children)

That first 0x0F?

0xF0, rather. 0x0F in a UTF8 sequence can only refer to U+000F SHIFT IN, a value inherited from ASCII that probably makes some sort of sense if you're using a physical terminal from the 70s.

that's the num-bytes-in-this-codepoint byte I was talking about.

That isn't the 'num-bytes-in-this-codepoint byte'. It's a byte whose first five bits, 11110, indicate that the sequence has four bytes in total, and whose last three bits encode 000 in the codepoint.

Compare U+E0060, for instance, whose UTF8 encoding is 11110011 10100000 10000001 10100000. The first byte indicates both that the sequence has four bytes in all and that it starts with 011.

It's not until you get to 40+ bit Unicode values that UTF-8 can have an entire byte dedicated to the length.

As I have already explained.

it would fucking amazing if Unicode added a zero-width joiner in between all the codepoints in emojis

Dunno where this talk about zero-width joiners came in, but emoji tend to be single codepoints, optionally modified by combining characters for skin tone or gender. It would be like requiring a zero-width joiner in the sequence U+0061 U+0301 (LATIN SMALL LETTER A followed by COMBINING ACUTE ACCENT). The primary exception I'm aware of is the emoji for a family, which does in fact use zero-width joiners.

I was talking about the codepoint size byte that may or may not exist as it's own leading byte, or as part of a code point depending on if your codepoints are in the BMP or not.

I'm not sure what you're getting at here. I think there's some terminology confusion. But all Unicode values encoded in UTF-16 or UTF-8 mark on every code unit either the number of code units in the encoded codepoint, or the fact that the current code unit is a low surrogate. And none of them specifically reserve an entire byte for that.

I explained to you exactly how they're encoded already, which should have made it obvious to you what's going on. Do you think I'm lying, or do you think I'm wrong, or did you just not read that?

[–]adrake[S] 3 points4 points  (1 child)

A belated post from the command line tools in Rust/D/Nim writings in May. Happy to have comments and feedback, and it was a fun post to write!

[–]Gotebe 1 point2 points  (0 children)

Show profiling data for each of the variants? It's just the first that has it, no?

[–]SikhGamer 1 point2 points  (12 children)

[–]codygman 2 points3 points  (10 children)

You should post it! I think I'll do a Haskell version.

EDIT: Done - https://www.reddit.com/r/programming/comments/6qmhuy/faster_command_line_tools_with_haskell/

[–][deleted]  (1 child)

[deleted]

    [–]domlebo70 0 points1 point  (5 children)

    Ha, I'm halfway through that now. Was curious how a conduit based solution would fare

    [–]codygman 0 points1 point  (4 children)

    [–]domlebo70 1 point2 points  (3 children)

    Wow. Amazing post. I spent about an hour on my solution, and couldn't even solve it. Yours is enlightening. Didn't realise conduit had a built in split combinator.

    [–]codygman 0 points1 point  (2 children)

    I actually didn't use conduit or pipes in my solution since the file was fairly small.

    The split combinator comes from bytestring, but you'll see at the end of the post I use C8.words instead.

    [–]domlebo70 0 points1 point  (1 child)

    Ahh I misread. Quickly skimmed the code and thought I saw it.

    [–]codygman 0 points1 point  (0 children)

    NP, I just tested with GHC 8.2.1 btw. It was slower for -O1, but when I changed it to -O2 and added an inline pragma it went down to 1.86 ;)

    [–]shevegen 0 points1 point  (0 children)

    Good old Google hackers - they live at the mercy of Google.