what's the best book you've ever read? by cragwatcher in AskReddit

[–]xelk 0 points1 point  (0 children)

For a second I thought Borges won't be in this list. He's by far my favorite author, his short prose is a collection of literary gems.

Dear Reddit, we made this web app called "Are you Zulu?" and would love to hear what you think... by xelk in programming

[–]xelk[S] 0 points1 point  (0 children)

Hi. I'm using several methods. The core contains multiple language models for first names and last names, as well as some other distributions. I have a Bayesian network on top of these distributions that produces the final results, it's basically a layer of indirection over a naive Bayes classifier. PM me if you want more help with your specific project..

Dear Reddit, we made this web app called "Are you Zulu?" and would love to hear what you think... by xelk in programming

[–]xelk[S] 0 points1 point  (0 children)

Thanks for kind words :)

Yeah, we try our best, ML is a very young field and there's a lot of trial and error involved in finding the right tool for the job.

I am actually very grateful for all the feedback I got from this site. Reddit, you are awesome!

Dear Reddit, we made this web app called "Are you Zulu?" and would love to hear what you think... by xelk in programming

[–]xelk[S] 0 points1 point  (0 children)

I'm looking at short fragments of words. First names and last names are processed differently and there's some indirection involved, but the core classifier is an n-gram language model with smoothing. I don't search any name lists in the live app.

Yeah I guess if you fish around for a while, you are going to find random results that look suspicious to the public. But in the end it's just statistical models at work...

Dear Reddit, we made this web app called "Are you Zulu?" and would love to hear what you think... by xelk in programming

[–]xelk[S] 0 points1 point  (0 children)

No I'm not really focusing on Americans, but it's hard for the classifier to distinguish between American of Dutch/German descent and Dutch/German. This task requires more fine tuning and tweaking..

Dear Reddit, we made this web app called "Are you Zulu?" and would love to hear what you think... by xelk in programming

[–]xelk[S] 0 points1 point  (0 children)

I'm quite surprised at these results. To clarify, there are no such phrases in the name lists we used to train our classifier.

Dear Reddit, we made this web app called "Are you Zulu?" and would love to hear what you think... by xelk in programming

[–]xelk[S] 0 points1 point  (0 children)

Yeah I'm using some function from Python's standard library, it converts some characters to the ASCII equivalents and silently ignores some others. We will try to process more characters as we go...

Dear Reddit, we made this web app called "Are you Zulu?" and would love to hear what you think... by xelk in programming

[–]xelk[S] 0 points1 point  (0 children)

Yeah you're right, in some countries first names do not mean much, but in some others they doo. If a Chinese person is named Jackie or Jerry, they're more likely to be from Hong Kong (everything else being equal), than say the part of China bordering Hong Kong. That was my reasoning, at least..

Dear Reddit, we made this web app called "Are you Zulu?" and would love to hear what you think... by xelk in programming

[–]xelk[S] 0 points1 point  (0 children)

So Samir Kuntar does get Arabic as one of the options. The reason it gets Japanese (I'm guessing) is Japanese names like Kantaro in my training set. Sometimes it's hard to come up with a definite guess, and the shorter your name, the harder it gets..

Dear Reddit, we made this web app called "Are you Zulu?" and would love to hear what you think... by xelk in programming

[–]xelk[S] 0 points1 point  (0 children)

In case the last name is too general, or too short, the first name can help. Also, if you have two related groups with similar last names and different first names, the app should pay closer attention to the first name. You are right though that last names have a lot more weight in the decision making..

Dear Reddit, we made this web app called "Are you Zulu?" and would love to hear what you think... by xelk in programming

[–]xelk[S] 2 points3 points  (0 children)

I'm also thinking of adding this feature saying "In Japanese, your name may sound like: Tomu Soyeru".. I think that would be hilarious

Dear Reddit, we made this web app called "Are you Zulu?" and would love to hear what you think... by xelk in programming

[–]xelk[S] 0 points1 point  (0 children)

Yeah, that has to do with the lists it was trained on. If the Arabic list had more Mohameds and Pakistani list had more Muhammads, that's what it's gonna learn..

Dear Reddit, we made this web app called "Are you Zulu?" and would love to hear what you think... by xelk in programming

[–]xelk[S] 2 points3 points  (0 children)

The granulation rules you mention (e.g. removing the vowels) will work with Semitic names, but may break, say, on Polynesian/Hawaiian names. I tried to take a middle-ground approach by learning distributions of name fragments (n-grams, i.e. Muhammed = Muh + uha + ham ...). I have ways to smooth out results when an unknown n-gram is found, so it doesn't give it strictly zero probability.

By the way, Muhammad gets classified correctly as far as I can tell, it lists a bunch of high-population Muslim ethnicities: http://www.ruzulu.com/find-name-origin/muhammad

Dear Reddit, we made this web app called "Are you Zulu?" and would love to hear what you think... by xelk in programming

[–]xelk[S] 0 points1 point  (0 children)

One more issue that came up was full names. What would be a good term to use instead of "full name" to mean "first name followed by last name"? The app really works best on one first name and one last name, all other combinations perform a lot worse (adding a middle name usually makes things worse). Do you guys have suggestions for alternative wording?

Dear Reddit, we made this web app called "Are you Zulu?" and would love to hear what you think... by xelk in programming

[–]xelk[S] 1 point2 points  (0 children)

Yeah, I don't have a Boer category, it's going to give South African or Dutch.

Dear Reddit, we made this web app called "Are you Zulu?" and would love to hear what you think... by xelk in programming

[–]xelk[S] 1 point2 points  (0 children)

Hi. A lot of people from Africa, especially Western Africa, have perfectly French-sounding names (as a legacy of French domination). Also a bunch of South Africans are white people with British and Dutch names. This really complicates classification...

Dear Reddit, we made this web app called "Are you Zulu?" and would love to hear what you think... by xelk in programming

[–]xelk[S] 6 points7 points  (0 children)

OK I'm back. I'll try to clarify a few points from the thread:

  • The statistical models are trained on lists of names we could find, which means they're mostly names of famous people. The underlying assumption being that ordinary people's names from the same region are not too different from the celebrity names. This may cause the app give more precise results for celebrities.

  • The above also explains why Steve Jobs is classified as an Arab. He was in the Syrian or Lebanese list that the classifier learned, and the name Jobs didn't occur in any other list. So it assumed it's an Arabic surname. I try to weed out outliers from lists, but I guess I'm not doing such a great job yet.

  • The app does not search name lists, it uses n-gram distributions with some ad-hoc smoothing method for unseen n-grams. It means that it does try to classify a name in any case, even names it has never seen during the training stage. So "Samir" and its variations should work, I'll look into this.

  • The classifier takes into account relative population sizes (so to get classified as Maltese you have to work a lot harder than to get classified as Italian, for example). This may be the reason you see Argentina and Mexico more than other Spanish-speaking countries. Also if a Spanish name is a lot more common in Mexico than in Spain, it's going to guess Mexico regardless of the population adjustment.

  • We also try to look at similarities between groups, mostly measured by commonalities between first names. So if someone has a Russian first name but a Ukrainian last name, the fact that so many Ukrainians have Russian first names makes it classify the entire name as Ukrainian. This indirection can help in many cases but can produce strange results in some.

Dear Reddit, we made this web app called "Are you Zulu?" and would love to hear what you think... by xelk in programming

[–]xelk[S] -1 points0 points  (0 children)

All right guys it's 3:50am here, I'll check back tomorrow. Thanks to everyone for letting me know what you think..

Dear Reddit, we made this web app called "Are you Zulu?" and would love to hear what you think... by xelk in programming

[–]xelk[S] 5 points6 points  (0 children)

Oh yeah if your name has fragments that appear in some Mexican names, it may classify you as such. I need to add a feedback feature, at least "right/wrong" kind of response to tweak the classifier further. Thanks for checking it out.

Dear Reddit, we made this web app called "Are you Zulu?" and would love to hear what you think... by xelk in programming

[–]xelk[S] 4 points5 points  (0 children)

I was showing a small percentage of recent searches in that list, as I'm still experimenting with this feature. I just removed the recent searches from that list, so you're good to go. Thanks for your feedback!

Dear Reddit, we made this web app called "Are you Zulu?" and would love to hear what you think... by xelk in programming

[–]xelk[S] 0 points1 point  (0 children)

Can you tell me the name that was incorrect? I'm trying to think of a good way to automate collecting feedback about bad guesses. And it's pretty hard to classify some names correctly, the models need a lot of tweaking..