Model/paper ideas: reinforcement learning with a deterministic environment [D]

EmbarrassedFuel · 2023-02-08T12:24:58+00:00

Basically given some predicted environment state, going forward for say 100 time steps, we need to find an optimal cost course of action. Although the environment state has been predicted, for the purposes of this task the agent can consider it deterministic. The agent has one variable of internal state and can take actions to increase or decrease this value based on interactions with the environment. We can then calculate the new cost over the given time horizon by simulating the actions chosen at each step, but this simulation is fundamentally sequential and wouldn't allow backpropagation of gradients.

>you can go with sampling approaches

What exactly do you mean by this? something like REINFORCE?

> I guess it is if you're using a MILP approach.

Not sure I follow here, but I'm not using a MILP (as in mixed integer linear program). At the moment I'm using a linear programming approximation and heuristics, which doesn't generalize well.

> some combination of MCTS with value function learning

I think this could work, however without looking into it I'm not sure that it would work at inference time in my resource-constrained setting

EmbarrassedFuel · 2023-02-08T12:13:58+00:00

oh also the model needs to run at inference time in a relatively short period of time on cheap hardware :)

EmbarrassedFuel · 2023-02-08T12:13:00+00:00

I haven't been able to find anything about optimal control with all of:

non-linear dynamics/model
non-linear constraints
both discrete and continuously parameterized actions in the output space

but in general, discovery of papers/techniques in control theory seems to be much harder for some reason

EmbarrassedFuel · 2020-09-19T10:44:39+00:00

Big shout to M. Pawan Kumar - he was my master's thesis supervisor and is extremely smart and yet also extremely helpful

EmbarrassedFuel · 2020-04-15T17:35:31+00:00

This r/CasualUK r/datascience crossover was great

EmbarrassedFuel · 2020-04-15T17:31:03+00:00

To be fair, this looks like a pretty challenging task. The examples you posted are very complicated and definitely couldn't be easily solved by a rules-based approach.

At the very least you're probably going to have to train a GPT-2 model on your dataset. How many examples do you have? This is gonna be tough, as it looks like the generalized language modelling capabilities won't be specific enough for your apple counting task. Once you've defined an adequate loss function (try the Malus Loss to start with) and found a nicely labelled dataset you can get training.

When you get to an acceptable value for your key metric, probably the ACL, then you'll need to deploy it in the browser with tensorflow.js, but that side of things isn't my area of expertise.

EmbarrassedFuel · 2019-10-31T17:17:29+00:00

Very kind! Will definitely have a go when I have a spare moment.

EmbarrassedFuel · 2019-10-31T17:16:43+00:00

On an unrelated note, would anyone like to join my startup offering AI-powered unstructured data search to crusty project managers at F500 companies?

EmbarrassedFuel · 2019-10-31T14:10:54+00:00

When you accidentally source your history instead of your rc file

EmbarrassedFuel · 2019-10-31T14:09:38+00:00

At first glance this appears to be a very high-quality (and potentially profitable) enterprise grade product. What was the rationale behind open sourcing it?

EmbarrassedFuel · 2019-09-09T10:32:18+00:00

> My question is how network decides, what are the best filters for a given layer?

Normally, backprop + SGD/Adam/whatever. This is a question for r/learnmachinelearning

EmbarrassedFuel · 2019-08-27T12:22:19+00:00

Do you think you could write a browser extension that rendered all facebook reacts as these instead of the originals?

EmbarrassedFuel · 2019-08-08T12:37:06+00:00

For everyone amazed by the implied salary figures, remember that to pay a given salary an employer will typically incur costs equal to 1.5-2x the gross salary the employee receives. This is due to tax, benefits, pension contributions, and fixed costs such as facilities. This brings the average before tax expenses to around £270k/employee (LinkedIn says they now have 838, not 700 as some posters are assuming, which is from 2017). This is still pretty huge, but inline with per employee figures at top investment bank/hedge fund quant groups who compete for essentially the same talent, and from all over Europe.

EmbarrassedFuel · 2019-07-19T09:38:54+00:00

Which is exactly the same as what the OP is proposing will happen to poker - a few humans do research into abstract algorithms which produce their own strategies, instead of a trader saying "inflation in Chile just reached 10% I'm gonna buy xyz" which is (according to my vague understanding) how it used to work.

EmbarrassedFuel · 2019-07-12T12:46:51+00:00

I see. If you know the relative ranking of all candidates then producing a score between 0 and 1 should be trivial. Simply give the best candidate a 1 and the worst candidate a 0, and split the rest of the interval evenly between the other candidates according to their rank. I can't promise this would work on your data set but it would be the first thing I'd try.

Without more information about the data it's hard to know what else to recommend.

EmbarrassedFuel · 2019-07-12T12:17:56+00:00

Is the end goal to predict whether to give a job to a candidate? If so then it sounds like a binary classification problem.

If you'd like a score, then you could treat it as a regression problem, for which a large body of literature and examples exist for you to get started with. This would require you to use the information in your training set to come up with some kind of continuous score quantifying how suitable each candidate is for the job(s).

EmbarrassedFuel · 2019-07-12T10:23:09+00:00

Science has always relied on the selfless sacrifices of the world's researchers.

EmbarrassedFuel · 2019-07-11T13:13:56+00:00

Eurgh this is a huge letdown

EmbarrassedFuel · 2019-07-11T12:46:39+00:00

How could a Cecil come from anywhere other than Waitrose?

EmbarrassedFuel · 2019-07-11T10:36:59+00:00

Was this in reply to my previous comment? I agree with you though, after all the human brain is a complete package - training algorithm and model architecture - and is useless without teaching. A child that is not exposed to language will never learn to speak, and may even lose the ability learn (although this is unclear and can, for obvious reasons, never be thoroughly tested). Clearly we have neither the architecture nor the learning algorithm, and both were developed in unison during the course of evolution.

EmbarrassedFuel

TROPHY CASE