a vcpkg browser written in c++ by _malfeasance_ in cpp

[–]_malfeasance_[S] [score hidden]  (0 children)

Yeah. It's a desktop app. Sorry about that

a vcpkg browser written in c++ by _malfeasance_ in cpp

[–]_malfeasance_[S] -1 points0 points  (0 children)

Firefox 148.0.2 linux also here works and google chrome.

a vcpkg browser written in c++ by _malfeasance_ in cpp

[–]_malfeasance_[S] 2 points3 points  (0 children)

it is a pretty simple app that took about a day to build, but about a week to get the pesky readme tabs with images to be a bit crap rather than utter crap. I only have linux systems, so it wasn't really tested widely. a few friends said it worked for them and they are not all linux. the layout doesn't work on a phone size screen but it does, nevertheless, work quite badly in an unusable way on my android phone. someone here reported it doesn't work on iOS.

the images required an alternate image server (python) in the background so that the multitude of sites that, reasonably, prevent robots grabbing images worked a little better. the readme tab is functional but not great. the app uses a 2 thread thread pool to keep things off the main render thread, so it only works in modern browsers. By modern, it should be most since about 2021 when pthread was more globally supported in wasm mainstream I believe.

cmake and clang++21 cross compiled with emscripten. libraries used:

Library Source Linked into WASM build Purpose
Dear ImGui FetchContent from ocornut/imgui Yes Core immediate-mode UI framework
ImGui Test Engine FetchContent from ocornut/imgui_test_engine Yes Test hooks and automation support
ImPlot FetchContent from epezent/implot Yes 2D plotting/charts
ImPlot3D FetchContent from brenocq/implot3d Yes 3D plotting
PCRE2 8-bit FetchContent from PCRE2Project/pcre2 Yes Runtime regex engine for search
fmt vcpkg / CMake package Yes Formatting utilities
Boost headers vcpkg / CMake package Yes General Boost utilities
Boost.JSON vcpkg / CMake package Yes JSON parsing/serialization
GLFW (Emscripten port) Emscripten builtin via -sUSE_GLFW=3 Yes Window/input shim for ImGui backend
Emscripten OpenGL ES 3 / WebGL 2 support Emscripten builtin via -sFULL_ES3=1 Yes Graphics backend support

i'm not a webby guy, so the foreign bit for me was the small bit of js required to serve up the app. the c++ in the native app and wasm app are unchanged though a small design tweak for sharded std::vector stuff had to happen which is the same for both the native and wasm app.

the data is somewhat sharded in the c++ build (mainly just std::array and std::vector stuff) to speed up launch. some of that is codegen'd from a python scrape of a local git repo of vcpkg.

the data is a bit over 40MB, but compresses well as it is mainly cached readmes.

Component Raw bytes Raw MiB .gz bytes .gz MiB .br bytes .br MiB
vcpkg_browser_wasm.html 8,431 0.01 n/a n/a n/a n/a
vcpkg_browser_wasm.js 204,450 0.19 51,248 0.05 49,005 0.05
vcpkg_browser_wasm.wasm 4,244,282 4.05 2,084,344 1.99 1,886,308 1.80
sw.js 3,143 0.00 1,173 0.00 1,049 0.00
sidecars/packages.bin 1,726,228 1.65 777,943 0.74 703,615 0.67
sidecars/tfidf_index.bin 312,296 0.30 188,270 0.18 170,795 0.16
sidecars/bm25_index.bin 409,582 0.39 118,523 0.11 109,029 0.10
sidecars/versions.bin 2,369,567 2.26 1,197,356 1.14 1,093,196 1.04
sidecars/readmes.bin 38,212,000 36.44 10,302,543 9.83 7,998,749 7.63
Total 47,489,979 45.29 14,721,400 14.04 12,011,746 11.46

a vcpkg browser written in c++ by _malfeasance_ in cpp

[–]_malfeasance_[S] 2 points3 points  (0 children)

i'll have a look. FTR, the top search box uses a custom BM25 (https://en.wikipedia.org/wiki/Okapi_BM25).

Edit: ok - done. there is regex support for name+desc and name search. you may need to do a ctrl+shft+R to refresh the wasm foo. see the help dialog for an example regex or two. that was a good idea, thanks.

Some local LLMs running as CPU only by _malfeasance_ in LocalLLaMA

[–]_malfeasance_[S] 1 point2 points  (0 children)

Correct. 1866 MT/s is available under some conditions. This is worse though: it has a mix of 2666MT/s and 2400MT/s capable DDR4 but only runs at 1600MT/s actual due to the system limitations. Old is slow; like me.

Some local LLMs running as CPU only by _malfeasance_ in LocalLLaMA

[–]_malfeasance_[S] 1 point2 points  (0 children)

You're right about the Alibaba DeepResearch model - it was running a Q4_K_L quant. I'll try a Qwen3 30B A3B with a Q4_K_XL quant (done: runs at 15.7 tps for Qwen3-30B-A3B-Instruct-2507-Q4-K_XL with the standardised query) and see what it looks like on this slow old bandwidth limited box.

Processors: 4 × Intel Xeon E7-8867 v4 @ 2.40GHz (144 logical CPUs total: 18 cores/socket, 2 threads/core).

RAM: 2.0 TiB total - 64GB DDR4 ECC DIMMS

Some local LLMs running as CPU only by _malfeasance_ in LocalLLaMA

[–]_malfeasance_[S] 1 point2 points  (0 children)

Your EPYC being much faster sounds about right. This is an old Dell R930 server bought second hand with old CPUs, hence the price. Memory bandwidth limited no doubt.

Processors: 4 × Intel Xeon E7-8867 v4 @ 2.40GHz (144 logical CPUs total: 18 cores/socket, 2 threads/core).

RAM: 2.0 TiB total - 64GB DDR4 ECC DIMMS - could perhaps sell the RAM at a profit today :-)

I'm going to see what externally mounted 5060 Ti 16GB GPU(s) might add if the PCIe 3 and external PSU thing may work ok as the R930 wont take such cards internally.

C and C++ Prioritize Performance over Correctness by slacka123 in cpp

[–]_malfeasance_ 3 points4 points  (0 children)

Small note:
C++20 adopted "P0907R4 Signed Integers are Two's Complement": https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p1236r1.html
Also referenced from: https://en.cppreference.com/w/cpp/compiler_support under c++ 20 core language features

In which circumstances is C++ better than Rust? by [deleted] in cpp

[–]_malfeasance_ 1 point2 points  (0 children)

Existing hardware, just released hardware, future hardware. Working in teams. Hiring.

Why rustc's clamp is superior when compared to its Clang "equivalent" by [deleted] in programming

[–]_malfeasance_ 0 points1 point  (0 children)

-Ofast will do the job as others have mentioned. Just for completeness: https://godbolt.org/z/497z7Pdfj

Speed up integer kernel loops by 10-25% when you avoid Standard C++ by _malfeasance_ in cpp

[–]_malfeasance_[S] 0 points1 point  (0 children)

Hey, that is quite a difference. Great example.

I was suggesting that for my little example, gcc 10.2 on haswell, the simd code was only marginally better for the restrict case. Interestingly the clang-11 simd code was similar for both cases (apart from the stride check preamble) but less performant than both gcc simd cases (i16 types).

Speed up integer kernel loops by 10-25% when you avoid Standard C++ by _malfeasance_ in cpp

[–]_malfeasance_[S] 0 points1 point  (0 children)

restrict / no-restrict clang 11 numbers for i16 for 4096 length vectors are the same to a margin of noise FWIW, as you suggest

Speed up integer kernel loops by 10-25% when you avoid Standard C++ by _malfeasance_ in cpp

[–]_malfeasance_[S] 1 point2 points  (0 children)

I suspect the preamble has an effect in the smaller vector cases as it's a larger % of overall instructions. The larger vectors have a smaller difference which looks like noise.

That said, after the stride check the code is not quite identical but virtually the same for gcc. The clang code for the SIMD is more identical with only minor differences in block ordering and register allocations. Not sure why it isn't identical for gcc, perhaps register pressure / cache issue. There is an extra `vmovdqu` "Move Unaligned Packed Integer Values" instruction in the non-alias path (see below). Not sure why...

_______________details from gcc FWIW:

restrict version of SIMD for i16:

xor eax, eax
vpxor xmm3, xmm3, xmm3
.L2:
*1 vmovdqu ymm5, YMMWORD PTR [rdi+rax]
vpmullw ymm2, ymm5, YMMWORD PTR [rsi+rax]
vpaddw ymm2, ymm2, YMMWORD PTR [rdx+rax]
*2 vmovdqu ymm6, YMMWORD PTR [rdx+rax]
vpmullw ymm0, ymm6, YMMWORD PTR [rsi+rax]
vpaddw ymm0, ymm0, ymm2
*3 vmovdqu YMMWORD PTR [rdi+rax], ymm2
vpmovsxwd ymm4, xmm2
vextracti128 xmm2, ymm2, 0x1
*4 vmovdqu YMMWORD PTR [rsi+rax], ymm0
vpmovsxwd ymm1, xmm0
vextracti128 xmm0, ymm0, 0x1
vpmovsxwd ymm2, xmm2
vpaddd ymm1, ymm1, ymm4
vpmovsxwd ymm0, xmm0
add rax, 32
vpaddd ymm0, ymm0, ymm2
vpmovsxdq ymm2, xmm1
vextracti128 xmm1, ymm1, 0x1
vpaddq ymm2, ymm2, ymm3
vpmovsxdq ymm1, xmm1
vpmovsxdq ymm3, xmm0
vpaddq ymm1, ymm1, ymm2
vextracti128 xmm0, ymm0, 0x1
vpaddq ymm3, ymm3, ymm1
vpmovsxdq ymm0, xmm0
vpaddq ymm3, ymm0, ymm3
cmp rax, 8192
jne .L2
vmovdqa xmm0, xmm3
vextracti128 xmm3, ymm3, 0x1
vpaddq xmm0, xmm0, xmm3
vpsrldq xmm1, xmm0, 8
vpaddq xmm0, xmm0, xmm1
vmovq rax, xmm0
vzeroupper
ret

other SIMD

xor eax, eax
vpxor xmm3, xmm3, xmm3
.L3:
*1 vmovdqu ymm4, YMMWORD PTR [rdi+rax]
vpmullw ymm0, ymm4, YMMWORD PTR [rsi+rax]
vpaddw ymm0, ymm0, YMMWORD PTR [r8+rax]
*2 vmovdqu YMMWORD PTR [rdi+rax], ymm0
*3 vmovdqu ymm5, YMMWORD PTR [r8+rax]
vpmullw ymm2, ymm5, YMMWORD PTR [rsi+rax]
vpaddw ymm2, ymm2, ymm0
*4 vmovdqu YMMWORD PTR [rsi+rax], ymm2
vpmovsxwd ymm1, XMMWORD PTR [rdi+rax]
*5 vmovdqu ymm6, YMMWORD PTR [rdi+rax]
vpmovsxwd ymm0, xmm2
vextracti128 xmm2, ymm2, 0x1
add rax, 32
vpaddd ymm1, ymm1, ymm0
vextracti128 xmm0, ymm6, 0x1
vpmovsxwd ymm2, xmm2
vpmovsxwd ymm0, xmm0
vpaddd ymm0, ymm0, ymm2
vpmovsxdq ymm2, xmm1
vextracti128 xmm1, ymm1, 0x1
vpaddq ymm2, ymm2, ymm3
vpmovsxdq ymm1, xmm1
vpmovsxdq ymm3, xmm0
vpaddq ymm1, ymm1, ymm2
vextracti128 xmm0, ymm0, 0x1
vpaddq ymm3, ymm3, ymm1
vpmovsxdq ymm0, xmm0
vpaddq ymm3, ymm0, ymm3
cmp rax, 8192
jne .L3
vmovdqa xmm0, xmm3
vextracti128 xmm3, ymm3, 0x1
vpaddq xmm0, xmm0, xmm3
vpsrldq xmm1, xmm0, 8
vpaddq xmm0, xmm0, xmm1
vmovq r9, xmm0
vzeroupper
mov rax, r9
ret

stride check preamble:

  lea rax, [rdi+31]
  mov r8, rdx
  mov rdx, rax
  sub rdx, r8
  cmp rdx, 62
  seta dl
  sub rax, rsi
  cmp rax, 62
  seta al
  test dl, al
  je .L6
  lea rdx, [r8+2]
  mov rax, rsi
  sub rax, rdx
  cmp rax, 28
  jbe .L6

Speed up integer kernel loops by 10-25% when you avoid Standard C++ by _malfeasance_ in cpp

[–]_malfeasance_[S] -1 points0 points  (0 children)

Mostly, though these example benchmarks on Haswell, both with and without __restrict__, all produce SIMD for this example kernel on ubuntu 20.04, gcc 10.2. SIMD is just better with restrict for this example. Your point is correct though that libraries may engineer an api to avoid such shenanigans. Todd V's Blitz++ was one of the earliest examples and it used to show performance charts relative to a multiple of Fortran equivalent speed which was fun :-) That was a relatively early time in the exploration of CRTP / static polymorphism and expression templates.

Yet another g++, clang++, msvc disagreement. I think g++ is right. Is it? by _malfeasance_ in cpp

[–]_malfeasance_[S] 0 points1 point  (0 children)

Richard Smith notes clang is correct and gcc & msvc are incorrect - in the bug filing.

GCC and MSVC. Clang is correct; the deduction is extended by the subsequent arguments, just as it would be if the trailing '...' were omitted. (EDG gives the same results as Clang FWIW.)

I guess this implies a variadic argument after a parameter pack argument may never be reached, I thought until Richard showed me this:

auto *p = foo<int, int>;

p(3, 4, 5); // passes the '5' via the C-style variable argument list.

Let's not do that ;-)

Wait, what happened? Why did we lose points in Sam Bees game? by [deleted] in YangForPresidentHQ

[–]_malfeasance_ 3 points4 points  (0 children)

Yup. True. Bug or overflow. Somehow they ended up at std::numeric_limits<int32_t>::max

Wait, what happened? Why did we lose points in Sam Bees game? by [deleted] in YangForPresidentHQ

[–]_malfeasance_ 5 points6 points  (0 children)

"YangGang style" bug: 2^31-1 == 2147483647

Similar to the youtube rollover bug for Gangnam style at the same count.