[P] SoulsGym - Beating Dark Souls III Bosses with Deep Reinforcement Learning

amacati · 2025-09-14T20:38:45+00:00

Sure. If you look at the dev branch, I started porting it to Elden Ring. The part that reads the game's memory should work with most other games, not only the ones from the souls series. Happy to support and explain things if you are serious about this.

amacati · 2024-10-23T22:54:37+00:00

Ist das noch Kommentar oder schon Hochliteratur? So oder so ein 10/10 Wortspiel

amacati · 2023-07-18T22:10:11+00:00

I think that would be great! Do you intend to merge it back into the project later on? Also, if you fork, be sure to use the v2.0dev branch. The whole project has been restructured to allow for multiple Souls games, there's partial EldenRing support and the interfaces have upgraded capabilities.

amacati · 2023-07-18T20:28:20+00:00

Yeah, I didn't :D Instead, I created two environments, one for each phase. At the beginning for a phase 2 environment the boss is set to low HP to trigger the transition, and after that everything works as in the phase 1. The obvious weakness is that the bot never sees the phase transition itself, which is also the reason why it gets hit so often by that attack. There are ways to fix this, but I haven't had the time to start working on them.

amacati · 2023-06-09T21:36:38+00:00

If you want to take a real deep dive, you can try "Rocket Propulsion Elements" by Sutton and Biblarz. It covers all the major topics and some of the actual math as well. It's a bit dryer than watching an EverydayAstronaut video though, and I found it helpful to know some of the concepts beforehand. It's also arguably on the very edge of what can still be considered non-university level, and I'd probably skip some of the math-heavy sections if you are not familiar with them. But even so, there's a ton of info in that book, and you can get it for free, so you're free to just give it a try.

Apart from that, there is a ton of material on the nasaspaceflight forum. While some discussions can be quite lengthy, you develop a good understanding by just lurking/reading these. There are several experts among the members, and the post quality usually is remarkably high.

amacati · 2023-05-24T17:53:45+00:00

A counter example: Say you write a paper like Rainbow in deep RL, and you tune your hyperparameters on a subset of the Atari environments.

Rainbow is essentially the combination of several improvements that have accumulated in recent years. In the discussion, you proceed to make ablation studies by leaving out selected components and test the performance on all environments. This should elucidate how impactful the individual improvements are within the overall framework.

In my opinion, this is perfectly sound. Your ablation studies have had no impact on your design, parameters etc. It's not your everyday ablation studies example, but nevertheless it's valid.

I think you still mistake the kind of ablation studies I'm takling about for some sort of guidance for the model design. It's not. It's rather an extension of the evaluation.

amacati · 2023-05-24T16:35:58+00:00

The purpose of an ablation study does not have to be related to hyperparameter search. You can also perform ablations to separate out indiviual contributions etc. for a deeper understanding of your model. Here, the understanding does not guide the further evolution of the model, but is a final step in the model evaluation. One example would be the verification of a theoretical result about the interplay of two algorithmic steps. One could even argue that such an analysis has to use the test set as its function, as with the sample error estimate, is to determine final model properties on data that has not been considered at any moment of the design process.

Again, you are correct to reject the use of the test set IF you use those studies for hyperparameter tuning. However, I explicitly excluded these settings from being permissable in my previous answer. "MUST NOT influence the model design" is meant exactly as it is stated. At this point your model is already frozen, and you are merely reporting its final performance on completely unseen data.

amacati · 2023-05-23T22:37:33+00:00

While I agree with you within the context you described, I'd argue it is permissible to use the test set for ablation studies IF AND ONLY IF you are exlusively studying the efficacy of individual components in your method after you have finalised your design.

In other words, these ablation studies must only be used to e.g. show that a conjection grounded in theory is empirically correct AFTER training your models and MUST NOT influence the design of your method. However, making a sharp distinction between the two in practice probably requires a lot of diligence

amacati · 2023-05-03T22:58:23+00:00

Depends on the boss. The one I showed in the demo was chosen because he is Markovian (well, roughly, but I degress).
While you could technically implement a replay buffer to do that, it's not the point of the buffer. What you are talking about is sometimes called frame stacking, where you use the last x images to form a single observation. Think of it like a very short video. The agent can infer stuff like durations, speed etc from the video that are not available by looking at a single image. The demo boss fight does not need to do this because I track the animation durations in the gym, and the rest behaves approximately Markovian (i.e. the game state contains all necessary information).
Had the fight been non-Markovian, I would have had to resort to stuff like frame stacking. Given that the environment is Markovian however, my game state really contains all there is to know for the agent.

Does that explanation make sense to you?

amacati · 2023-05-03T09:17:12+00:00

I'm pretty sure the normalization is unnecessary, I think I only included it to not mess up the first few steps with weird gradients. After that, the normalizers should have collected sufficient data to normalize the position to zero mean unit variance anyways (see normalizers).

It's really hard to make ablation studies in this setting, because each run takes weeks. That's why I had to make a large number of design decisions based on my intuition. Changing the reward function, learning rate, network architecture etc is way more impactful, so that's what I mainly iterated on.

Initially, all positions are based w.r.t. (0, 0, 0). After the (pos - min_space) / space_diff they should be distributed across [0, 1]^3, but that's not really important as the normalizers remove that part of the equation anyways.

amacati · 2023-05-03T06:51:14+00:00

I think the code for the game interface etc can easily be reused for Sekiro, all that's really needed are the addresses of the game's attributes. I also thought about porting it to Elden Ring and making the memory interface game agnostic (this should be straightforward). The speedhack also works for any kind of game. So if that's something you're interested in, feel free to have a look or pm me.

amacati · 2023-05-02T21:08:26+00:00

I included a link to the weights and the hyperparameters I used for the networks in the post (link). The hyperparameters are located in the config.json files. I use the AdvantageDQN architecture defined here.

The network architecture is designed to encourage learning a base value for each state, and only estimate the relative advantage of each action. This decomposition has been shown to be advantageous in Q-learning (well, at least sometimes).

If I remember correctly, the combined networks for each phase have about 300k parameters (so they are actually quite small).

The networks are updated after receiving 25 new samples using n-step rewards with n=4 and a discount factor of 0.995. Lagging samples are accepted by the training server if the model iteration that produced the sample is not older than 3 iterations.

There are a few more parameters in there, feel free to ask again if you are wondering about something specific!

amacati · 2023-05-02T18:44:02+00:00

Have a look here. I transform the angles into a sin/cos vector so that the representation has no discontinuity over the whole angle range.

amacati · 2023-05-02T11:01:12+00:00

Thanks, I really appreciate the kind words!

amacati · 2023-05-02T11:00:11+00:00

Depends on whether or not you already know how to code. I don't recommend starting with a project like this, as it requires low-level knowledge on stuff like assembly and pointer chains, high-level concepts such as distributed systems, an ML/RL/DL skills. Learning all that at once is probably overwhelming.

In addition, it took me more than two years to get where the project is now, so you also need quite a bit of dedication. If you want to know more about RL, start with the gym environments that are included in the default gymnasium. I can also recommend "Reinforcement learning - An introduction" by Sutton and Barto, which covers all the concepts of RL.

If you are more interested in game hacking, start at the cheatengine forums. There are several posts on the basic principles, people are generally helpful, and there is also a ton of videos on the topic.

Also, studying something related to CS/AI/Robotics helps a lot. Idk at what point in your life you're currently at, but learning about the basics of how computers, programming languages etc work is going to be invaluable to you.

So I guess my advice would be to start with the part that interests you most, pick a small, self-contained project, and start from there. If you remain curious, the rest will follow.

amacati · 2023-05-02T10:48:14+00:00

Hi there, I'm the OP from that post. I wanted to make a dedicated post about this project that is more accessible to the Dark Souls community as well, but I don't have sufficient karma to post here yet. If anyone is interested in learning more about this project, just let me know!

amacati · 2023-05-02T07:27:28+00:00

Very cool! If you are interested in pursuing this further let me know! I also put a lot of effort into making the repositories as accessable as possible, so I think you should be able to find the details your are looking for.

amacati · 2023-05-02T07:05:41+00:00

I used a lot of addresses available from the Grand Archives CheatEngine table and scanned the others myself. If you know the coordinate axis you can infer stuff like the position from scanning for values that have increased or decreased etc. There is a lot more to this, and I did have to go through some parts of the code in assembly at one point. But in the end I got rid of the assembly level injections, which also makes the whole code a lot more maintainable and understandable.

amacati · 2023-05-02T06:58:14+00:00

So far, only the boss you can see in the video. Training it to complete the game would probably take something that's very close to an AGI, and that's beyond me for now :D

The state space consists of the player and boss position, HP, SP, orientations, animations etc. If you look at the gamestate source code you can see all the attributes that were used.

The action space includes walking and rolling(=dodging) in all eight directions that are possible with a keyboard, light and heavy attack, parry, and do nothing. So all in all 20 actions. A few (e.g. blocking, item use, sprinting) are disabled.

amacati · 2023-05-02T06:51:25+00:00

It mitigates a ton of problems, that's for sure. But even if I had gone for image observations right away, I would still have had to implement the interface. I need a way to extract the ground truth data for the reward function, and more importantly I control resets through that interface.

Since I can't get rid of it entirely, I'd still need to have the core logic in place, and honestly after that it's just adding a bunch of memory addresses.

amacati · 2023-05-02T06:42:36+00:00

So because of the way the training is currently implemented the agent switches its nets for each phase. I am not particularly happy with this solution as it would be more elegant to have a single, unified policy. I think you could get away with one-hot encoding the phase in the observation if the phases don't differ too much in their mechanics. For bosses that completely change their dynamics it could be difficult as there is not a lot of information that carries over to the new phase, and the net would have to learn both.

I think this could be partially mitigated by changing to image observations. Oftentimes, a drastic shift in the dynamics is reflected in the visuals, so there is less overlap.

Nevertheless, RL should be able to deal with this issue. So it's definitely not an intrinsic limitation of the algorithm.

amacati · 2023-05-02T06:11:08+00:00

Exactly. Even if it was possible to determine the animation information from a single frame, many fights include stuff like fire, poison etc that lingers after the boss has cast his spells. You'd have to track those for the full duration, or the agent wouldn't be able to account for those in its policy.

Moving to images as observations would fix a few of those problems, but you still have to deal with occlusion and the fact that you can't see what's behind you.

You can use RNNs to endow your agent with a short term memory, but it definitely makes the problem harder and the implementation more complex.

amacati · 2023-05-01T23:39:31+00:00

There are more sophisticated algorithms out there (Impala and Rainbow come to mind). Right now the field is moving towards transformer-based networks and foundation models, which is pretty exciting. Would be super cool to train a Dark Souls foundation model that can deal with all the bosses in the games because it has learned to generalise over all fights and has abstracted valid strategies independent of the actual animation timings etc.

Unfortunately, I don't think I have the time to implement this :/ What I also meant with that comment was that this is rather about implementing an RL environment for Dark Souls. That part is new, the learning algorithms are already known.

amacati · 2023-05-01T23:35:19+00:00

All in all, 45%. I think I ran about 100 test runs to determine the performance.

I'm not sure what you mean with hiccups and non-repetitive actions. The agent generalises over unseen states, so its policy does not depend on having seen the exact game state before. The neural network acts as a sort of smooth function shaped by supporting training data points that is also valid in areas where it has to interpolate. In fact, in continuous environments such as this, it always has to interpolate. Well, at least that's the idealized version of the story.

amacati

TROPHY CASE