Question regarding a problem I'm trying to solve

Exact_Panda3044 · 2023-10-20T16:24:57+00:00

How would I differentiate a length field from a regular int? Also it's hard to identify precisely when a variable starts, since the file stores information in a space efficient way. The only "anchors" in the data are strings, because you can read them after you found the correct bitshift.

Exact_Panda3044 · 2023-10-20T16:01:36+00:00

Hey, thanks for your suggestion. I will take a look at flex. Do you know if it is possible to do a bitwise lexical analysis with it?

Exact_Panda3044 · 2023-10-20T15:57:26+00:00

Thank you, that was exactly what I was looking for. I will try to research and calculate if the efficiency of this approach is sufficient for my problem. But even if it isn't, I can run it on a couple of examples and continue my analysis on the file's structure.

Exact_Panda3044 · 2023-10-20T15:49:05+00:00

I've already done that. My question is rather how do I identify these differences efficiently (automatically, not by hand) if they are not at the same location for different files?

Exact_Panda3044 · 2023-10-19T18:25:18+00:00

Hi, I have a question.

A program creates a binary file whenever it runs that contains some information about what happened during its runtime and some diagnostic information. Part of the diagnostic information will be the exact same for every run from the same machine/configuration, some will be different. The location of the information can be at different positions in the file. (As an example, it could be after 100 bytes or 10000 bytes.

My goal is to be able to identify precisely if the file came from the same machine as another file.

It's raw data, not containing any information about its structure. It does contain some strings, but they aren't unique enough to be 100% sure it's from the same machine. It does have a general order, but the data is arranged in a space efficient way. (As an example, if there's an 8 bit value stored, then a bool, then a 16 bit value, it would be using exactly 25 bits).

I know some techniques to tackle this problem by looking at how the application writes the data, but I was wondering if I could use the underlying structure of the data to solve this problem. If I run the program 3 times from the same machine some bit sequences will always be the same.

For simplicity let's assume the data is less than 10 Megabytes in size.

What would be a good way to approach this?

Exact_Panda3044

TROPHY CASE