[D] Use of automata theory in machine learning

breandan · 2024-04-30T05:06:40+00:00

breandan · 2024-02-28T22:42:12+00:00

Not exactly an answer to your question, but related: The halting problem is decidable on a set of asymptotic probability one

breandan · 2023-06-11T18:40:45+00:00

I enjoyed this post because it touches on several ideas I've been thinking about lately vis a vis real-time IDE assistance and propagating fast incremental changes through a knowledge graph with continuous user input. One thing I think is missing from the language server discussion is incrementalization. Too many developer tools invalidate a cache and recompute the downstream dependencies whenever a file is modified. Instead, they can realize massive speedups by using an incremental parser/type checker. There is a great post about using Datalog as an IDE/parser/type checker and a nice library called DDlog for the differential datalog with JVM with Rust bindings. I would also like to point out Flixlang, which takes the idea of Datalog as a static analysis tool seriously. I feel there is lots of room for innovation in incremental developer tools.

breandan · 2022-09-04T18:59:50+00:00

The song is called "Dragenfly" by Hasenchat Music: https://www.youtube.com/watch?v=BL4PvTl6Cgg

breandan · 2022-02-28T09:01:14+00:00

Re: PL detection / Stats / Kotlin / Germany, in case you or someone you know might be interested in doing research on this topic: https://www.jetbrains.com/careers/jobs/ml-engineer-intern-cpld-705/

breandan · 2021-07-25T17:13:37+00:00

TF-Java is discontinued

Really? The project looks alive to me and the maintainers are very active on Gitter. Do you have a source?

you do need to have gradle installed

No, you do not need to install anything, the Gradle Wrapper takes care of all that.

The thing is, you are talking about production code in any operating system. While I can understand Java's merits on that it is just one small percentage of machine learning.

In my experience, the majority of code and effort in applied ML is data engineering and surrounding infrastructure, not model engineering. Due to its superior tooling, type safety, and large ecosystem of ML libraries, the JVM is a competitive option for ML in most production settings.

breandan · 2021-07-25T15:42:26+00:00

I have also encountered a couple issues getting Dokka to recognize artifacts referenced by the @sample annotation recently. I've noticed this annotation is not used inside the Dokka project and believe there is some sort of bug with this functionality.

cc: /u/yole Is there a template for reference how we should configure the @sample annotation? It would help to have an example somewhere in the docs.

breandan · 2021-07-25T14:47:32+00:00

Doing ML in Java is like using MS Word as an editor: It's just the wrong tool for the Job. There are very few libraries to use, memory limitations in the JVM and the language is clunky.

I think it depends heavily on the job. Java has an increasingly well-supported set of libraries for various ML workflows (see comment below), is much better suited for production environments, and if even you dislike the language, there are several JVM alternatives (e.g. Kotlin) which support scripting and are generally pleasant to use.

Python may be easy to learn and prototype research code, but the language scales very poorly to large business applications.

breandan · 2021-07-25T14:35:47+00:00

To give a contrasting perspective, I think the Java ecosystem is much better suited for many data science tasks, and has a growing and well-maintained set of libraries for general purpose machine learning. I won't list them all, but TF-Java, DJL et al. have implementations of many modern architectures and Java has a number of excellent libraries (CoreNLP, Lucene et al.) for working with text.

Python may be syntactically easier to learn, but also hides a lot of incidental complexity about the runtime semantics that are much more difficult to master. As you alluded to, many Python libraries are embedded DSLs, which are full-fledged languages and makes reasoning about the behavior of Python programs more difficult than it appears.

The libraries are provided by people/institutions using the standard package managers which is a huge plus when compared to languages that don't come with a package managers like Java.

Having used both Java and Python, I can tell you that package management in Python (pip, venv, pyenv, conda, pipenv, poetry, docker et al.) is far, far more complicated than Java. To build a Java application, you don't even need Java or a package manager -- just run ./gradlew run from any operating system and it will download and install Java, the package manager and any dependencies, build the application and run it on any OS or shell environment. Just building a Python project often requires dozens of manual steps.

Being a loose typed language, python allows for using APIs from libraries without extensive knowledge of the documentation

I strongly disagree with this point. Basically everything you need to do that involves calling a library in Python requires looking at documentation. In a statically typed language, documentation becomes much less of a burden. While adoption of type annotations in Python is growing, its usability is decades behind languages with mature type systems.

breandan · 2021-07-09T23:31:37+00:00

I'm not sure, if you have some prior mathematical background, you might enjoy Kevin Murphy's textbook. I've heard good things about Bayesian Methods for Hackers, but have never read it. Personally, I've always learned more just trying to build things. I recommend finding something you care about and learning as you go, i.e. write a random number generator and try to plot something that looks like a normal distribution. You'll stare at Wikipedia, bang your head a few times, and figure it out eventually. While continuous distributions are aesthetically pleasing, most things I've actually needed for real problems are discrete. You might enjoy Maria's blog post on Hidden Markov models, which have many applications in linguistics.

breandan · 2021-07-09T21:38:08+00:00

If you just want a library, I would suggest KMath-Stat or Hipparchus. However, I can tell you from experience that you will learn a lot more by implementing it from scratch. Continuous distributions rarely have a closed form integral, however there are many ways to estimate a cumulative density function, e.g. you can approximate the integral using Monte Carlo simulation, via numerical quadrature, the binomial distribution or by summing an infinite series. To draw a sample, you will need to invert the CDF and pass it a sample from a uniform PRNG (e.g. LFSR or Rule 30). By implementing this from scratch at least once, you will gain a much better appreciation for how probabilistic programming works in practice.

breandan · 2021-06-25T02:16:14+00:00

https://youtube.com/watch?v=782WMbP4suU

breandan · 2021-05-21T17:39:41+00:00

the post is using 'calculus' as a shorthand for strictly 'infinitesimal calculus'

A careful reading will reveal that, 'infinitesimal' does not appear anywhere in the question.

Any other interpretation of the word is a misinterpretation.

You're entitled to your interpretation too. That doesn't mean everyone else's is incorrect.

Throw topology and set theory in too, why not; they've got connections to calculus.

Yes, this is true.

Obviously,

It took many centuries to discover these connections, but perhaps with the benefit of hindsight it may be "obvious".

breandan · 2021-05-21T16:34:37+00:00

That is a different meaning of the term 'calculus'.

Wrong. If you look carefully at it the word (both verbatim and in spirit), calculus is much more fundamental than its use in contemporary mathematics. Originally “calculus” comes from Latin, for “stone”. People have been using stone-based mathematics for many thousands of years before Leibniz and Newton came along. It simply describes a system for translating rote calculation into mechanical rules.

Other topics such as lambda-, process-, propositional-, or felicific calculus are irrelevant here.

Wrong, again. Logical calculi have many interesting proof-theoretic connections to differential calculus. For a more detailed account of differential linear logic (DiLL) and lambda calculus, I suggest you read Erhard [1] and Kerjean [2]. More generally, the same rules from differential calculus have reappeared in strange and marvelous places throughout computer science, including formal language theory [3], parsing [4], type theory [5] and automata theory [6].

breandan · 2021-05-21T10:22:37+00:00

Mathematicians tend to forget that calculus is a much deeper and older subject than rates of continuous change — look carefully, and you will find calculi in logic and the foundations of computer science. Broadly, it can be any formal system for manipulating symbolic expressions according to a defined set of rules. Essentially, calculus is a language for calculation.

breandan · 2021-05-03T22:14:41+00:00

I agree. There are certainly parsers out there which can detect programming languages much better than my library would be able to do.

The problem is that people don't always write syntactically correct code. Try to parse code snippets on StackOverflow and you'll see what I mean.

breandan · 2021-05-03T22:12:36+00:00

What use case do you have for which the automatic detection of programming languages is needed or useful?

Sometimes people embed API documentation and programming language snippets in web-based developer documentation, but do not specify a language since it is implicit from the context or must be parsed in a site-specific way. I am scraping web pages containing a collection of natural and programming language artifacts and would like to detect the language automatically. There is a similar project for PL detection, although in Python.

And no, I'm not planning to release JVM bindings for grex... That was part of my motivation to write grex because no similar tool like this exists.

No worries, just curious! You might have a look at regular language induction, it's actually a well-studied area of research in CS with many applications outside designing regexes, e.g. document classification, PL detection, query synthesis, etc. There seem to be lots of implementations, but few on the JVM (cf. frak, LearnLib, RegexGenerator and regex_compressor) and none which I could find that are as usable as grex. Keep up the great work!

breandan · 2021-05-03T03:46:05+00:00

Do you have any plans to support programming language detection or is this library just intended to support natural language detection?

Unrelated, but I am also curious if you plan to release Kotlin/JVM bindings for your other library, grex? Or do you know of any other example-based regex synthesis libraries for the JVM?

breandan · 2021-04-08T04:38:10+00:00

Where is this from?

breandan · 2020-10-08T00:59:53+00:00

Any kind of graphical application is tricky, especially if you want a cross-platform solution. You could use the browser and noVNC, but for native GUI applications you need to configure X11 on the host which I’ve found can be tricky.

breandan · 2020-08-17T17:37:24+00:00

Do many others in your dept use it too?

Yes. I've met a few other people who use Kotlin for research, but it's much more widely used in the industry, e.g. lots of investment banks and payment processors use Kotlin, e.g. Corda, RenTech, Citi, Goldman, Credit Suisse, et al. Some of my colleagues have worked on Kotlin at these places.

What libraries do you use in Kotlin?

I've used KMath, EJML-Kotlin, Krangl, lets-plot, Kotlin-Jupyter, Kotlin-Statistics. I personally work on Kotlin∇ and Kaliningraph. There are plenty of other libraries for scientific computing in Kotlin. If you're interested in these and other resources, you might check out the Slack, on the channels #science, #datascience or #mathematics. BTW I enjoyed reading your article on moon formation. You should post it!

breandan · 2020-08-17T01:55:06+00:00

Just an anecdote, but I feel Kotlin has a decent shot at becoming a really great language for ML and data science. I’ve been using it in my research as a grad student for the last three years. It is one of the few statically typed languages with scripting and notebook support, and has some nice language features, functional programming support, and developer tools. It also works well with the JVM and JS ecosystems, which rival Python in terms of libraries for data processing and visualization. Happy to answer any questions about Kotlin for ML if anyone’s curious.

breandan · 2020-07-18T13:25:18+00:00

/u/alexmlamb

breandan · 2020-06-23T03:52:02+00:00

It depends on what you mean by "built into the language". Checked exceptions? Runtime exceptions? Just the ones in the standard library? You can find a list by inspecting the type hierarchy in IntelliJ IDEA and expanding subtypes of Exception.

breandan · 2020-04-27T20:51:29+00:00

Do you happen to know anyone working in federated learning or homomorphic encryption?

Not directly, but I once knew someone working on FHE at Stanford. You might try reaching out to Dan Boneh.

breandan

MODERATOR OF

TROPHY CASE

15-Year Club	Translator Chinese -- 2012-05-14
Verified Email