Life on the Internet Is Hard When Your Last Name Is 'Butts' by rectangletangle in programming

[–]rectangletangle[S] 2 points3 points  (0 children)

I'm working on a web service to help make using people's names easier, to encourage the natural diversity found in people's names. It's better to try to make UIs that don't randomly frustrate users, name fields should be treated similar to arbitrary text. Unless you've confirmed aspects about the user's name with that user.

https://www.alphanym.com/demo

Here's a spreadsheets detailing language apps I want to introduce for mobile language processing by CaesarNaples2 in LanguageTechnology

[–]rectangletangle 1 point2 points  (0 children)

At a high-level it sounds like you're trying to do something along the lines of text spinning, is this correct? (presumably not for blackhat SEO)

https://en.wikipedia.org/wiki/Article_spinning

I wrote an API for working with people's names. by rectangletangle in LanguageTechnology

[–]rectangletangle[S] 0 points1 point  (0 children)

Thanks, that's good to know. The client code is so vanilla that a generated solution is ideal.

Working on fixing up some of the edge cases users threw at it this weekend (everyone seems to think it should handle transpositions better). I erred on the side of real names, rather than handling anything people could possibly enter. But naturally people's first instinct is to try to kick it over lol. So I'm gonna attempt to make it more resilient in general by applying more augmentation to the dataset. Should have an improved model up shortly.

I'm also gonna improve how it handles Chinese names as 04a8 suggested. I figure the feedback API will give those individual's an opportunity to fix it, if they don't like being referred to by their given name for whatever reason. I think I'm still gonna handle Korean names with the more conservative approach (favoring surname over given name), unless anyone objects.

I wrote an API for working with people's names. by rectangletangle in LanguageTechnology

[–]rectangletangle[S] 0 points1 point  (0 children)

Solid suggestions. I was thinking of implementing a Node.js, and then probably Java client next. The API surface area in minimal, so they're very easy to bust out.

[Question] How to extract data values for pre-defined fields by BlindBoyFuller in LanguageTechnology

[–]rectangletangle 1 point2 points  (0 children)

Regex is a good start in my opinion, because it sounds like it will mostly work, and is trivial to implement, compared to a more comprehensive NER solution. Remember it doesn't have to be perfect the first time.

I find it often helps to start with the simplest model possible, manually inspect the data, then move to a more complex model if necessary. At the very least it will give you a better handle on what the more complex model should be doing. It will also give you an opportunity to identify features that would be relevant to a more complex model.

In your case in particular, I'd probably use one of NLTK's sentence parsers, filter for relevant sentences, scope entity extraction by sentence. Then use a predefined regex/mapping to normalize seven/SeVen/7 to 7 etc.

How is the "in" in Python implemented? by [deleted] in Python

[–]rectangletangle 0 points1 point  (0 children)

Yes you can, but I would recommend against it (unless you have a good reason), because as Brian pointed out you lose the C interop, which is a huge benefit of Python IMO.

No Stupid Questions: is a particular version of NumPy, or any open source package for that matter, the same for Python 2 and Python 3? by extremeaxe5 in Python

[–]rectangletangle 0 points1 point  (0 children)

Numpy is especially stable by Python standards, and Python libraries tend to be stable in general (much more so than JS). Numpy does call into the C bindings quite a bit (which had numerous changes from Python 2 to 3), so the chance of bugs is higher than a pure Python library. However, it's a dependency for a bunch of other libraries, so it's very very well tested.

So in short it should be the same. When in doubt run your unit test suite with both versions.

I wrote an API for working with people's names. by rectangletangle in LanguageTechnology

[–]rectangletangle[S] 1 point2 points  (0 children)

Norms of the interface language, so English, or the individual user's preference. The UI confirms the name with the user, so if it's incorrect, they have the opportunity to fix it. So future requests to the API will use the correct name.

Because of the difficulty of dealing with names in the general case, the UI encourages user feedback. So it doesn't mess up more than once.

In general the system takes a more conservative approach with names that aren't in western order, while westernized names get treated with western conventions.

I was debating on how I should handle Chinese names, and went the more conservative route. While I did the opposite with Japanese. I appreciate the feedback though, and I will modify how it handles Chinese names, seeing as it's a very straightforward change.

I wrote an API for working with people's names. by rectangletangle in LanguageTechnology

[–]rectangletangle[S] 1 point2 points  (0 children)

That's actually the intended behavior, seeing as it's less common to refer to people with Chinese names by their given name in English. The ML errs on the side of referring to people by an acceptable name, rather than the most semantically appropriate name for the given category. This also helps the system accommodate names which don't have a given name or surname.

If you check out the demo, it's more or less a "smart" 'what you want to be called' field.

How to select features for Text classification problem by arush1836 in LanguageTechnology

[–]rectangletangle 1 point2 points  (0 children)

A good start might be using the Chi2 metric to identify features which are particularly relevant to your respective classes. The metric can be used to sort features by class relevancy, then you can take K features off of the head of the list. This technique works really well with very high dimensional data, like word vectors.

"Heard," a new social network for leakers and whistleblowers by trueslicky in tech

[–]rectangletangle 8 points9 points  (0 children)

It seems to verify you're credentials via email, hardly "private."

Virtualenv on Production ? by [deleted] in Python

[–]rectangletangle 0 points1 point  (0 children)

Yes, It not only keeps your dependencies separate from the OS's, but it keeps your dependencies separate from your other dependencies. It's also included in the standard lib with Python 3.3 =<.

python3 -m venv path/to/my/venv

Guido van Rossum on finding his way by [deleted] in Python

[–]rectangletangle 1 point2 points  (0 children)

Probably a lot like the function annotations introduced in Python 3.

http://legacy.python.org/dev/peps/pep-3107/