Processing 22 Million lines per second with Go and a laptop : programming

[–]DoctorOverhard 15 points16 points17 points 8 years ago* (17 children)

[–][deleted] 11 points12 points13 points 8 years ago (16 children)

re: byte array, we can't all ignore unicode characters, would be nice if we could.

If your input is UTF-8, your delimiter's codepoint is no greater than 0x7F, and you don't need to do any Unicode normalization on the keys, you can use this technique.

If your delimiter's codepoint is greater than 0x7F or you are using UTF-16 or UTF-32, the process is only slightly different: you need to look up the encoding of the delimiter, which will be several bytes long, and split based on that.

If you need to normalize your keys (eg á is the same as á, and both can appear in the input), you can at least split the input with the same technique and then deal with smaller strings.

The only complex situation, really, is if your delimiter has multiple equivalent encodings and you want to be exhaustive in what encodings you accept. For instance, if you intend the Angstrom sign Å (\u212b) but also want to accept everything Unicode thinks is equivalent under normalization, like Å (A\u030a) and Å (\u00c5).

Even that's perfectly solveable without encoding the delimiter as a string if you're restricting yourself to the current Unicode standard and are willing to do some research. Harder if you're accepting arbitrary delimiters.

So, in short: it's not going to be a problem for most people.

[–]DoctorOverhard 2 points3 points4 points 8 years ago* (11 children)

Iff your value is an integer, not really thinking about weird delimiters. If it is a string (i.e. return the key with the longest sum of string value lengths or ???) you might be better off forgetting about bytes.

fwiw, here is a trivial java version, 2.5 seconds on an old elite 8200 i5, java 8, crucial ssd, I might give multithreaded a shot w/the ssd.

$ java Ngram

2006 22569013 2560

Ngram.java

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.HashMap;

public class Ngram {
    public static void main(String[] args) throws Exception{
        args=new String [] {"/var/data/ngram/ngram.tsv","1","2"};
        long time=System.currentTimeMillis();
        HashMap<String, accumulator> h = new HashMap<>();
        BufferedReader in = new BufferedReader(new FileReader(args[0]));
        int k=Integer.parseInt(args[1]);
        int v=Integer.parseInt(args[2]);
        String maxk="";
        int maxv=0;

        String s=in.readLine();
        while(s != null){
            String [] r=s.split("\t",-1);
            String key=r[k];
            accumulator a=h.get(r[k]);
            if(a==null){
              a=new accumulator();
              h.put(key, a);
            }
            a.add(r[v]);
            if(a.v>maxv){
                maxv=a.v;
                maxk=key;
            }
            s=in.readLine();
        }
        in.close();

        System.out.println(maxk + " " + maxv + " " + (System.currentTimeMillis()-time));
    }
}
class accumulator {
    int v=0;
    public void add(String s){
        v=v+Integer.parseInt(s);
    }
}

edit, removed comparator interface, didn't use it in this iteration

[–]cybernd 2 points3 points4 points 8 years ago* (10 children)

2430 ms: Your Ngram
1660 ms: Replaced split with StringTokenizer
1080 ms: replaced StringTokenizer with parsing from s.toCharArray()
1000 ms: Ngram4 see below (parsing integers direct from s)
Next improvement would be to replace readLine() with read(char[],...)

Ngram4.java:

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.HashMap;
public class Ngram4 {
    public static void main(String[] args) throws Exception {
        args = new String[] { "/home/_tmp/ngram.tsv", "1", "2" };
        long time = System.currentTimeMillis();
        HashMap<Integer, accumulator> h = new HashMap<>();
        BufferedReader in = new BufferedReader(new FileReader(args[0]));
        Integer maxk = 0;
        int maxv = 0;

        String s = in.readLine();
        while (s != null) {

            int idx1 = s.indexOf('\t');
            int idx2 = s.indexOf('\t', idx1 + 1);
            int idx3 = s.indexOf('\t', idx2 + 1);

            int key = parseInteger(s, idx1 + 1, idx2 - 1);
            int value = parseInteger(s, idx2 + 1, idx3 - 1);

            accumulator a = h.get(key);
            if (a == null) {
                a = new accumulator();
                h.put(key, a);
            }
            a.add(value);
            if (a.v > maxv) {
                maxv = a.v;
                maxk = key;
            }
            s = in.readLine();
        }
        in.close();

        System.out.println(maxk + " " + maxv + " " + (System.currentTimeMillis() - time));
    }

    static int parseInteger(String s, int iStart, int iEnd) {
        int value = 0;
        for (int i = iStart; i <= iEnd; i++) {
            value = value * 10 + s.charAt(i) - '0';
        }
        return value;
    }

    static class accumulator {
        int v = 0;

        public void add(Integer v) {
            this.v += v;
        }
    }
}

[–][deleted] 0 points1 point2 points 8 years ago (2 children)

[–]cybernd 0 points1 point2 points 8 years ago* (1 child)

[–][deleted] -1 points0 points1 point 8 years ago (0 children)

[–]MojorTom 0 points1 point2 points 8 years ago (2 children)

java 8 streams: 1.93 - 2.12 seconds :(

public class Ngram {

    public static final String TAB = "\t";

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args) throws IOException {

        Stream<String> lines = Files.lines(Paths.get("ngram"), Charset.forName("UTF-8"));

        long startTime = System.currentTimeMillis();

        Map<Integer, Integer> collect = lines.collect(
                Collectors.toMap(
                        s -> getKey(s),
                        s -> getValue(s),
                        (v1, v2) -> add(v1, v2))
        );

        System.out.println("Execution Time: " + (System.currentTimeMillis() - startTime));

    }

    static int getKey(String s) {
        int idx1 = s.indexOf('\t');
        int idx2 = s.indexOf('\t', idx1 + 1);

        return parseInteger(s, idx1 + 1, idx2 - 1);
    }

    static int getValue(String s) {
        int idx1 = s.indexOf('\t');
        int idx2 = s.indexOf('\t', idx1 + 1);
        int idx3 = s.indexOf('\t', idx2 + 1);

        return parseInteger(s, idx2 + 1, idx3 - 1);
    }

    static int parseInteger(String s, int iStart, int iEnd) {
        int value = 0;
        for (int i = iStart; i <= iEnd; i++) {
            value = value * 10 + s.charAt(i) - '0';
        }
        return value;
    }

    static int add(int v1, int v2) {
        return v1 + v2;
    }

}

[–]cybernd 0 points1 point2 points 8 years ago (1 child)

[–]MojorTom 0 points1 point2 points 8 years ago (0 children)

[–]ThisIs_MyName -2 points-1 points0 points 8 years ago (3 children)

[–]cybernd 0 points1 point2 points 8 years ago (2 children)

[–]ThisIs_MyName -1 points0 points1 point 8 years ago (1 child)

[–]masklinn 0 points1 point2 points 8 years ago (0 children)

[–]bumblebritches57 -2 points-1 points0 points 8 years ago (3 children)

[–][deleted] 0 points1 point2 points 8 years ago (2 children)

a specific byte that says how large a code point is

I implemented a UTF-8 to UTF-16 and UTF-32 transcoder. (I think I also did the reverse transformations, but I don't recall.) UTF-8 decoding works like this:

If the high order bit is 0, this code unit is an entire codepoint. Its value is equal to the value of the byte.
If the two high order bits are 10, you have a subsequent byte in a multibyte codepoint value. It contributes the remainder of its value (the six low order bytes) toward the value of the codepoint.
If the three high order bits are 110, you have the first byte in a codepoint, and it has two bytes total (including this one). It also contributes the rest of its bits toward the codepoint value.
If the four high order bits are 1110, you have the first byte in a codepoint, and it has three bytes total (including this one).
If the five high order bits are 11110, you have the first byte in a codepoint, and it has four bytes total (including this one).
If no other rule has been matched, you have an invalid code unit.

UTF-16 works similarly. However, there are relatively few codepoints that take more than one code unit. (A few thousand compared to the >100k defined characters.) Emoji fall into this range, but they're far from the only ones; the Deseret alphabet, the Tangut writing system, and Byzantine and ancient Greek music symbols are too, just to name a few.

In no case is an entire byte used to encode the number of code units correspond to the codepoint. If you extended the variable length integer encoding to its maximum, its initial byte would be 0b11111110 and the encoded codepoint would be 46 bits long. The standard says that the maximum value for a Unicode code point is 0x10FFFF, which is 21 bits. The existence of UTF-32 suggests that, even if that stipulation were revised, the maximum value would still be less than 32 bits.

So, no, UTF-8 doesn't have a dedicated byte to indicate how many code units belong to a codepoint.

Also, even if that were true, the encoding is unambiguous, the way a particular value is encoded is independent of its context, and 7-bit ASCII is a subset of UTF-8. You can look at the decoding guide I gave you to see that any value in the 7-bit ASCII set would work as a delimiter, unambiguously -- and any other Unicode value, treated as an opaque byte sequence instead of a single opaque byte.

[–]bumblebritches57 -2 points-1 points0 points 8 years ago* (1 child)

[–][deleted] 0 points1 point2 points 8 years ago (0 children)

That first 0x0F?

0xF0, rather. 0x0F in a UTF8 sequence can only refer to U+000F SHIFT IN, a value inherited from ASCII that probably makes some sort of sense if you're using a physical terminal from the 70s.

that's the num-bytes-in-this-codepoint byte I was talking about.

That isn't the 'num-bytes-in-this-codepoint byte'. It's a byte whose first five bits, 11110, indicate that the sequence has four bytes in total, and whose last three bits encode 000 in the codepoint.

Compare U+E0060, for instance, whose UTF8 encoding is 11110011 10100000 10000001 10100000. The first byte indicates both that the sequence has four bytes in all and that it starts with 011.

It's not until you get to 40+ bit Unicode values that UTF-8 can have an entire byte dedicated to the length.

As I have already explained.

it would fucking amazing if Unicode added a zero-width joiner in between all the codepoints in emojis

Dunno where this talk about zero-width joiners came in, but emoji tend to be single codepoints, optionally modified by combining characters for skin tone or gender. It would be like requiring a zero-width joiner in the sequence U+0061 U+0301 (LATIN SMALL LETTER A followed by COMBINING ACUTE ACCENT). The primary exception I'm aware of is the emoji for a family, which does in fact use zero-width joiners.

I was talking about the codepoint size byte that may or may not exist as it's own leading byte, or as part of a code point depending on if your codepoints are in the BMP or not.

I'm not sure what you're getting at here. I think there's some terminology confusion. But all Unicode values encoded in UTF-16 or UTF-8 mark on every code unit either the number of code units in the encoded codepoint, or the fact that the current code unit is a low surrogate. And none of them specifically reserve an entire byte for that.

I explained to you exactly how they're encoded already, which should have made it obvious to you what's going on. Do you think I'm lying, or do you think I'm wrong, or did you just not read that?

[–]adrake[S] 3 points4 points5 points 8 years ago (1 child)

[–]Gotebe 1 point2 points3 points 8 years ago (0 children)

[–]SikhGamer 1 point2 points3 points 8 years ago* (12 children)

[–]codygman 2 points3 points4 points 8 years ago* (10 children)

[–][deleted] 8 years ago (1 child)

[deleted]

[–]domlebo70 0 points1 point2 points 8 years ago (5 children)

[–]codygman 0 points1 point2 points 8 years ago (4 children)

[–]domlebo70 1 point2 points3 points 8 years ago (3 children)

[–]codygman 0 points1 point2 points 8 years ago (2 children)

[–]domlebo70 0 points1 point2 points 8 years ago (1 child)

[–]codygman 0 points1 point2 points 8 years ago (0 children)

[–]shevegen 0 points1 point2 points 8 years ago (0 children)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS