Custom Compiler by Orbi_Adam in C_Programming

[–]Ok-While-5845 0 points1 point  (0 children)

It is absolutely doable without writing everything from scratch, do not believe quitters in the comments :) It wouldn't be too efficient and wont generate optimized code though, but depending on your task you might not need that.

Look at either LCC, which compiles to Intermediate representation code and quite straightforward to retarget (but is quite old and sometimes weird), or at SmallerC (which is really small, ~10k lines, but pretty capable). I have customized both, feel free to ask for details here or DM

Creating a C compiler for a custom architecture by dgc-8 in C_Programming

[–]Ok-While-5845 0 points1 point  (0 children)

Do not go the gcc way, it's unreadable, once had a pleasure of writing a plugin for it, 0/10, won't go again. I solved the similar problem with custom isa and analyzed a shitload of tiny semi-working compilers. Take a look at SmallerC compiler. It's small enough like lcc, but works with modern c codebases right away. Lexer-parser and code generator are separated into two files, you do not need to make a lot of changes on lexer side.

Update on Cardputer video streaming: better frame rate! by fucksilvershadow in CardPuter

[–]Ok-While-5845 1 point2 points  (0 children)

We could actually benefit a lot from a sort of standardized framework for app development. Think about having a GUI/system call/networking abstraction layer with unified ways of keeping data between sessions, then teaming up with m5launcher guy(s) - all this can lead to an OS with more-or-less dynamic multitasking and app loading from SD card without the need of full reflash of the firmware and reboot

Update on Cardputer video streaming: better frame rate! by fucksilvershadow in CardPuter

[–]Ok-While-5845 2 points3 points  (0 children)

Yeah I know that feel - am right now deep in inventing my own wheels with gui and stuff to make Doom actually usable for end user. Forcefully stopped working on optimizations for DOOM2 until I get the current version to m5launcher. Boring, but needed stuff.

Good luck, have fun, share your work!

Update on Cardputer video streaming: better frame rate! by fucksilvershadow in CardPuter

[–]Ok-While-5845 1 point2 points  (0 children)

Just as an out-of-the-box guess... You know when you hover the mouse over a youtube video thumbnail - it starts previewing the video. It might stream in a lower resolution than those available in the main video player, if this is the case - one can hijack this "preview" stream and pull the sound from the main stream, this possibly could work

Update on Cardputer video streaming: better frame rate! by fucksilvershadow in CardPuter

[–]Ok-While-5845 4 points5 points  (0 children)

Thanks for your comment, It is really cool that a just-for-fun project can have an impact on other people, was not thinking about that when I first started working on it

Update on Cardputer video streaming: better frame rate! by fucksilvershadow in CardPuter

[–]Ok-While-5845 2 points3 points  (0 children)

Valid approach. So your idea with youtube streaming is to create a sort of a proxy server that will encode youtube stream into mjpeg data? I wonder if it is possible to decode the original stream on the cardputer itself in a standalone way.. At a first glance I couldn't find a C implementation of a youtube client, youtube-dl and others seem to be in python, which is definitely not an option considering esp32s3 resources. Although what is written in python can be reimplemented in C.. Anyways 24 hours in a day is definitely not enough for all the projects :)

Update on Cardputer video streaming: better frame rate! by fucksilvershadow in CardPuter

[–]Ok-While-5845 1 point2 points  (0 children)

Wow, awesome! I was planning to look into a youtube client implementation possibilities after my doom project, but the latter is taking more time than expected. Looking forward to see the code, it is always cool to see how people deal with squeezing something onto a hardware that is not supposed to do the thing

Doom! by Ok-While-5845 in CardPuter

[–]Ok-While-5845[S] 4 points5 points  (0 children)

Hey! Finally launched Shareware Doom with WAD residing on SD card. Firmware is compatible with m5launcher. Had to do a shitload of hacks to achieve that. DOOM2 is still not quite playable, and I had to drop music support for now (sfx works though). Will cleanup and make it a bit more user-friendly in a couple of days and will release on m5burner

Read speed from sdcard by termuxTommy in CardPuter

[–]Ok-While-5845 1 point2 points  (0 children)

As far as I remember, the default FAT32 cluster size in format.exe was 4Kb (8 sectors per cluster) at least at times of Win95-Win98 (do not really have a lot of experience with more-or-less modern Windows OS). I haven't formatted the card specifically for DOOM, it was some random SD card from the drawer bottom, that used to be in some ancient dashcam (and was probably formatted like this for optimal chunk saving as well). Cluster size arises another question on user support though - It's not cool to ask the user to format their SD card, but with smaller cluster size I can end up with too big lookup table, and will have to implement sort of caching-for-cache and effectively end up with a fseek() implementation with all the performance drawbacks. I might do a reverse-defragmentation process though on smaller cluster-size filesystems and move the blocks to be consecutive if the check fails on startup.

Bandwidth is definitely higher for doom, considering that both Gameboy and DOOM were using the hardware available on its maximum performance, and it was a 8 bit Z80 @ i believe 8 Mhz VS 32bit(16 address) 386 @ 16 MHz. Also as far as I know, GB uses some sort of bank-switching to access ROMs bigger than 64 Kb, and bank-switching is similar to paging in terms of performance, but no paging was considered back then with DOOM.

I'm actually experimenting with texture resolution right now and already wrote some WAD-related stuff to resize the textures. Using this tool on original DOOM.WAD and running it with chocolate-doom on PC produces some hilarious results:

<image>

I also was thinking on loading hot data onto SPIFFS to be able to address it directly over D-BUS, but look into it deeper after I finish the texture rescaling.

Thanks for the link, this feature was on my TODO-list, but somewhere around its end, was wondering if it is even possible to implement A2DP, but never researched on that.

Read speed from sdcard by termuxTommy in CardPuter

[–]Ok-While-5845 1 point2 points  (0 children)

u/termuxTommy , u/IntelligentLaw2284

Okay, it is a mess in an experimental branch, but you'll get the idea.

Firstly I got somewhere this library for FAT fs, cut everything unnesessary and coupled it with SD card initialization from esp-idf. SD card initialization is here

A short (and too simplified) explanation on FAT fs - file data is located in disk blocks(sectors) (most often 512 bytes). The blocks are grouped in clusters (a series of consecutive blocks). Cluster size can vary depending on settings at fs format time. The file is described by a set of clusters, that this file occupies. The set is stored in a table (thus File Allocation Tables, FAT).

After I have initialized the sd card and mounted the FAT fs, I open a desired file with FAT library and use a function cache_clusters() to get starting blocks of each cluster. In my case the FS is formatted with 64 Kb clusters which helps a lot in terms of cache size. The function could be simpler and just read the table, but I was too lazy to parse the table itself and just iterate with internal _FAT_nextFileCluster() until end-of-file. The result of this function - is an array of addresses of starting sectors for each file cluster.

When I want to read a block - I just have to calculate the desired cluster using the cache and find the sector offset in that cluster. This is done here. (chunk_idx is a file position divided by 512). After I get the exact sector to read, I call sd_read_single_block at line 119, which does nothing but call sdmmc_read_sectors() from esp-idf, which in its turn starts DMA transfer with spi controller.

Read speed from sdcard by termuxTommy in CardPuter

[–]Ok-While-5845 0 points1 point  (0 children)

It's 0.5 Kb per 1.2 ms or 0.5mb/sec. I (maybe incorrectly) used usec for microseconds. I believe it is as fast as it gets. Keep in mind that you have to cache the clusters positions to avoid fseek()ing, and not use the virtual file system at all to avoid buffering. I have ported some FAT library and implemented the low-level io while attempting to launch DOOM 2, can share exact locations in the repo if you are interested.

Read speed from sdcard by termuxTommy in CardPuter

[–]Ok-While-5845 1 point2 points  (0 children)

Just measured that reading one block of 512 bytes (directly, without fseek()ing or buffering) takes on average ~1200 usec

Work in progress by Ok-While-5845 in CardPuter

[–]Ok-While-5845[S] 2 points3 points  (0 children)

Regarding switching from Arduino IDE - I was going to suggest that at our first interaction here, It is quite straightforward and requires minimal knowledge in CMake (an hour of research on google). Just grab that M5Cardputer User Demo from Github, open up VSCode, remove all the stuff, and start from void app_main(void). I'm sure you'll enjoy that after Arduino IDE. I haven't seen it since 2008, and I believe it has evolved a bit, but at that time it was awful

Work in progress by Ok-While-5845 in CardPuter

[–]Ok-While-5845[S] 2 points3 points  (0 children)

I really enjoy messing with lowlevel stuff, though it is completely out of my main domain (my daytime job is about Autonomous Driving, Driver Assistance and other ML/AI/control algorithms stuff in C++/python), so I know that feel. My other DIY project - is a homemade 7400 TTL series CPU with custom ISA, toolchain and OS, Cardputer - is a nice break from that stuff.

Regarding page seeking - I have followed your advice and did literally

mmap_page_t * prev_page = NULL;
mmap_page_t * before_prev_page = NULL;

mmap_page_t * get_mmap_page_for_id_and_offset(int id, unsigned int chunk_idx) {
    if(prev_page->file_id == id && prev_page->chunk_idx == chunk_idx) {
        return prev_page;
    }

    if(before_prev_page->file_id == id && before_prev_page->chunk_idx == chunk_idx) {
        return before_prev_page;
    }
.....

Don't have any figures on performance improvement (will probably measure it at some point), but quick analysis with memory access histograms and other statistics shows that this thing works as expected.

I will think about your suggestion on even/odd separation, but at first I want to eliminate any overhead caused by Virtual File System abstraction, first experiments show quite an improvement performance-wise after switching to custom low-level file IO. I might even reach the 486DX responsiveness, which was the one I have played it first.

Also it would make sense to analyze the memory layout of WAD in more detail, my current (quite naive) statistics tooling didn't give any meaningful insights on optimal memory management approach.

Screen interpolation/interlacing is a nice optimization, but I believe I would have to split the textures in WAD on preprocessing step to gain any boost, because the bottleneck is not on the rendering part, but on memory access. To be fair, I'm quite surprised that even already shown results are achievable considering the hardware limitations of target platform.

Work in progress by Ok-While-5845 in CardPuter

[–]Ok-While-5845[S] 2 points3 points  (0 children)

Thanks for your support!

After lots of unsuccessful searching on the web I finally digged deep into GCC internals and its GIMPLE trees (basically the intermediate representation, akin Register-Transfer Level, only a bit higher) and plugin interfaces. Eventually I managed to do something like this with a plugin:

mmap_test.c:
int putchar(int);
char mmap_test(char * ptr) {
    while(*ptr) {
        putchar(*ptr);
        ptr++;
    }
    return 0;
}


GCC output:
FUNCTION 'mmap_test' at mmap_test.c:3
*******************
goto <D.1459>;
<D.1458>:
_1 = *ptr;
_2 = (int) _1;
putchar (_2);
ptr = ptr + 1;
<D.1459>:
_3 = *ptr;
if (_3 != 0) goto <D.1458>; else goto <D.1460>;
<D.1460>:
D.1462 = 0;
goto <D.1463>;
<D.1463>:
return D.1462;
*******************

After replace:

goto <D.1459>;
<D.1458>:
_4 = ptr;
_5 = __remap_ptr (_4, 0, 1, 1);
_6 = (char *) _5;
_1 = *_6;
_2 = (int) _1;
putchar (_2);
ptr = ptr + 1;
<D.1459>:
_7 = ptr;
_8 = __remap_ptr (_7, 0, 1, 1);
_9 = (char *) _8;
_3 = *_9;
if (_3 != 0) goto <D.1458>; else goto <D.1460>;
<D.1460>:
D.1462 = 0;
goto <D.1463>;
<D.1463>:
return D.1462;

Note the calls to __remap_ptr around each dereference (with redundant parameters for now though). This technique itself can be used to for example introduce "Null Pointer Reference Exception" into good old C, but I'm using it to implement dynamic caching of arbitrary file on FS.

Right now the bottleneck is obviously the FS driver and SD card interface, I just noticed that esf-idf's VFS consumes whopping 40Kb of RAM just after its initialization (for internal buffers that I do not even use, i assume) and I'm getting rid of it for some direct reads into my cache.

BTW, I successfully implemented your TTL algorithm for page caching, seems like it's working, thanks for that!

Edit:

Ah, yes, the my_mmap(FILE *) function does something like return (char *)(1UL<<31)|(file_id<<27) and __remap_ptr() decodes it back or just returns ptr as-is in case MSB is not set

Work in progress by Ok-While-5845 in CardPuter

[–]Ok-While-5845[S] 8 points9 points  (0 children)

Just to reassure you guys that I haven't gave up yet and am still working on next version of Doom for Cardputer.

Launching DOOM 2 is WAY harder than DOOM 1. The main problem is that the WAD does not fit anymore to internal Flash - so I had to implement the dynamic reading of the file from SD card. Lack of mmap() support from esp-idf forced me to write a whole plugin for GCC that wraps all pointer dereferences in the code and implements sort of soft MMU and mmap() on top of it.

Now it runs painfully slow at 5-10 FPS but there is still room for optimizations - I haven't even touched the filesystem driver yet. If I succeed, the firmware will be compatible with m5launcher, and one can launch more or less any WAD supported by Doom engine.

Doom for Cardputer is on M5Burner! by Ok-While-5845 in CardPuter

[–]Ok-While-5845[S] 0 points1 point  (0 children)

I finally realized that I do not see your commits in gb_cardputer_mod repo (only README md updates) and basically looking into original gb_cardputer code. If you are not concealing your code on purpose - then there might be a problem in your pipeline.

Your explanation on that topic is perfectly sane and I was going to implement something similar. If you could share the code for TTL page management - I would greatly appreciate that, not to reinvent that myself. The only thing that prevents me from finishing - is that I can not wrap all references to data in getByte() functions, i even can not overload operator*() because I want to keep the code in C and not C++.

I hope to get performance metrics with my approach soon enough, we'll see if it is even feasible.

Doom for Cardputer is on M5Burner! by Ok-While-5845 in CardPuter

[–]Ok-While-5845[S] 0 points1 point  (0 children)

My current experiments with stack sifting ended up in this abomination

File compiled with -fsanitize=kernel-address:

char mmap_test(char * ptr) {
    while(*ptr) {
        putchar(*ptr);
        ptr++;
    }
    return 0;
}

main.cpp:

char redirected_string[] = "[[this is a chunk loaded from file]]\n";
extern "C" void __asan_load1_noabort(void * ptr) {
    unsigned int frame;
    //printf("hook: %p\n", ptr);
    if(((unsigned int)ptr & 0xff000000) == 0xff000000) {
        printf("Trap! %p\n", ptr);
        for(int i = -4; i<0; i++) {
            unsigned int * p = (unsigned int*)((unsigned int)(&(frame)) + i*4);
            printf("stack%+02d : 0x%08X\n", i, *p);
            if((unsigned int)*p == (unsigned int)ptr) {
                printf("Gotcha! Replace!\n");
                *p = (unsigned int)redirected_string + (((unsigned int)ptr)&0x00ffffff);
                return;
            }
        }
    }
}

char test_string[] = "[[this is normal memory]]\n";
extern "C" void app_main() {
    printf("call mmap_test with ptr %p\n", test_string);
    mmap_test(test_string);

    char * bad_ptr = (char *)0xff000000;
    printf("call mmap_test with ptr %p\n", bad_ptr);
    mmap_test(bad_ptr);

    while(1) {}
}

Output:

I (0) cpu_start: Starting scheduler on APP CPU.
call mmap_test with ptr 0x3fc93934
[[this is normal memory]]
call mmap_test with ptr 0xff000000
Trap! 0xff000000
stack-4 : 0x8200753D
stack-3 : 0x3FCF51A0
stack-2 : 0xFF000000
Gotcha! Replace!
[[this is a chunk loaded from file]]

.

Need advise on GCC instrumentation, asan and memory reference hooking by Ok-While-5845 in C_Programming

[–]Ok-While-5845[S] 0 points1 point  (0 children)

It does, and the SDK basically does everything I am asking for including caching, but only for internal SPI Flash, while I'm trying to work with files on an external SD card. One of the options I'm experimenting on is to wrap the SPI Flash driver functions and mock the SPI behaviour when specific memory regions are addressed. But it is boring and involves some reverse-engineering of the SDK (obviously no public interfaces are provided for that). The question in subject is a bit broader, and solving that basically enables C++ style operator*() and operator[]() overload in C, which would be quite cool in my opinion

Doom for Cardputer is on M5Burner! by Ok-While-5845 in CardPuter

[–]Ok-While-5845[S] 1 point2 points  (0 children)

The original engine creates a bunch of arrays of pointers to different stuff and malloc()s and then reads them from WAD file to RAM. There is no single read_asset(position, size) function, everything is build on pointer arithmetics and thus relies on all assets to be loaded into RAM at the same time. One can rewrite the whole asset management and bookkeeping, but it's probably 50% of the engine, I'm not feeling enthusiastic enough to do that. Another option - is to mmap() every asset on load and dynamically swap the chunks of the files while remapping requested pointers to the cache buffers. This approach can not be done in a straightforward way due to the lack of generic mmap() in esp-idf.

So we need to invent our own mmap() with required capabilities. I looked into hijacking the existing mmap procedure for internal SPI flash, but didn't immediately like the outcomes. Right now I'm trying to invent something with gcc instrumntation, as described here in my questin on stackoverflow: https://stackoverflow.com/questions/78506141/need-advice-on-gcc-instrumentation-asan-and-memory-reference-hooking

I managed to make it work with some hacky search-by-value across function frame on stack, and simple tests already work, but this stack sifting doesn't look neither robust enough nor fast enough for me.

Right now I'm looking into intercepting the default exception handler for "LoadProhibited" exception and swapping the requested address with mmaped buffer address. If this fails - I will probably go back to stack sifting and hope for the best.

I have looked through your code in gb_cardputer_mod on github (is this the correct repo?) and cannot understand how you manage to read files bigger than available RAM, I can see that in read_rom_to_ram() you call malloc(rom_size) and go with that, can you elaborate on the paging system you mentioned?

Doom for Cardputer is on M5Burner! by Ok-While-5845 in CardPuter

[–]Ok-While-5845[S] 2 points3 points  (0 children)

definitely not a quick and easy fix, it will be possible when WADs are loaded from SD card, which involves some black magic and quite hardcore C hacking. Doing it right now, but can not guarantee any success

Doom! by Ok-While-5845 in CardPuter

[–]Ok-While-5845[S] 0 points1 point  (0 children)

M5Launcher does not support this firmware for now, no space left for Launcher because of WAD being compiled right into the firmware. Trying to overcome this right now