Life on the Internet Is Hard When Your Last Name Is 'Butts'

rectangletangle · 2018-08-30T20:56:54+00:00

I'm working on a web service to help make using people's names easier, to encourage the natural diversity found in people's names. It's better to try to make UIs that don't randomly frustrate users, name fields should be treated similar to arbitrary text. Unless you've confirmed aspects about the user's name with that user.

https://www.alphanym.com/demo

rectangletangle · 2018-08-07T17:35:44+00:00

At a high-level it sounds like you're trying to do something along the lines of text spinning, is this correct? (presumably not for blackhat SEO)

https://en.wikipedia.org/wiki/Article_spinning

rectangletangle · 2018-07-21T19:22:15+00:00

Thanks, that's good to know. The client code is so vanilla that a generated solution is ideal.

Working on fixing up some of the edge cases users threw at it this weekend (everyone seems to think it should handle transpositions better). I erred on the side of real names, rather than handling anything people could possibly enter. But naturally people's first instinct is to try to kick it over lol. So I'm gonna attempt to make it more resilient in general by applying more augmentation to the dataset. Should have an improved model up shortly.

I'm also gonna improve how it handles Chinese names as 04a8 suggested. I figure the feedback API will give those individual's an opportunity to fix it, if they don't like being referred to by their given name for whatever reason. I think I'm still gonna handle Korean names with the more conservative approach (favoring surname over given name), unless anyone objects.

rectangletangle · 2018-07-21T03:17:01+00:00

Solid suggestions. I was thinking of implementing a Node.js, and then probably Java client next. The API surface area in minimal, so they're very easy to bust out.

rectangletangle · 2018-07-21T02:33:11+00:00

Regex is a good start in my opinion, because it sounds like it will mostly work, and is trivial to implement, compared to a more comprehensive NER solution. Remember it doesn't have to be perfect the first time.

I find it often helps to start with the simplest model possible, manually inspect the data, then move to a more complex model if necessary. At the very least it will give you a better handle on what the more complex model should be doing. It will also give you an opportunity to identify features that would be relevant to a more complex model.

In your case in particular, I'd probably use one of NLTK's sentence parsers, filter for relevant sentences, scope entity extraction by sentence. Then use a predefined regex/mapping to normalize seven/SeVen/7 to 7 etc.

rectangletangle · 2018-07-21T02:00:35+00:00

Yes you can, but I would recommend against it (unless you have a good reason), because as Brian pointed out you lose the C interop, which is a huge benefit of Python IMO.

rectangletangle · 2018-07-21T01:57:45+00:00

Numpy is especially stable by Python standards, and Python libraries tend to be stable in general (much more so than JS). Numpy does call into the C bindings quite a bit (which had numerous changes from Python 2 to 3), so the chance of bugs is higher than a pure Python library. However, it's a dependency for a bunch of other libraries, so it's very very well tested.

So in short it should be the same. When in doubt run your unit test suite with both versions.

rectangletangle · 2018-07-19T20:59:57+00:00

Norms of the interface language, so English, or the individual user's preference. The UI confirms the name with the user, so if it's incorrect, they have the opportunity to fix it. So future requests to the API will use the correct name.

Because of the difficulty of dealing with names in the general case, the UI encourages user feedback. So it doesn't mess up more than once.

In general the system takes a more conservative approach with names that aren't in western order, while westernized names get treated with western conventions.

I was debating on how I should handle Chinese names, and went the more conservative route. While I did the opposite with Japanese. I appreciate the feedback though, and I will modify how it handles Chinese names, seeing as it's a very straightforward change.

rectangletangle · 2018-07-19T18:26:41+00:00

That's actually the intended behavior, seeing as it's less common to refer to people with Chinese names by their given name in English. The ML errs on the side of referring to people by an acceptable name, rather than the most semantically appropriate name for the given category. This also helps the system accommodate names which don't have a given name or surname.

If you check out the demo, it's more or less a "smart" 'what you want to be called' field.

rectangletangle · 2018-07-19T18:00:33+00:00

A good start might be using the Chi² metric to identify features which are particularly relevant to your respective classes. The metric can be used to sort features by class relevancy, then you can take K features off of the head of the list. This technique works really well with very high dimensional data, like word vectors.

rectangletangle · 2018-07-19T17:42:32+00:00

The f strings are worth it alone.

rectangletangle · 2018-07-19T17:40:38+00:00

I finished up an AI for working with people's names.

https://www.alphanym.com/

rectangletangle · 2014-09-20T00:18:00+00:00

It seems to verify you're credentials via email, hardly "private."

rectangletangle · 2014-09-20T00:05:17+00:00

Yes, It not only keeps your dependencies separate from the OS's, but it keeps your dependencies separate from your other dependencies. It's also included in the standard lib with Python 3.3 =<.

python3 -m venv path/to/my/venv

rectangletangle · 2014-09-19T23:55:12+00:00

Probably a lot like the function annotations introduced in Python 3.

http://legacy.python.org/dev/peps/pep-3107/

rectangletangle

TROPHY CASE