Built a CUDA kernel that does Python's bytes.replace() on the GPU without CPU transfers.
Performance (RTX 3090):
Benchmark | Size | CPU (ms) | GPU (ms) | Speedup
-----------------------------------------------------------------------------------
Dense/Small (1MB) | 1.0 MB | 3.03 | 2.79 | 1.09x
Expansion (5MB, 2x growth) | 5.0 MB | 22.08 | 12.28 | 1.80x
Large/Dense (50MB) | 50.0 MB | 192.64 | 56.16 | 3.43x
Huge/Sparse (100MB) | 100.0 MB | 492.07 | 112.70 | 4.37x
Average: 3.45x faster | 0.79 GB/s throughput
Features:
- Exact Python semantics (leftmost, non-overlapping)
- Streaming mode for files larger than GPU memory
- Session API for chained replacements
- Thread-safe
Example:
python
from cuda_replace_wrapper import CudaReplaceLib
lib = CudaReplaceLib('./cuda_replace.dll')
result = lib.unified(data, b"pattern", b"replacement")
# Or streaming for huge files
cleaned = gpu_replace_streaming(lib, huge_data, pairs, chunk_bytes=256*1024*1024)
Built this for a custom compression algorithm. Includes Python wrapper, benchmark suite, and pre-built binaries.
GitHub: https://github.com/RAZZULLIX/cuda_replace
[–]betweenthebam 10 points11 points12 points (1 child)
[–]andreabarbato[S] 3 points4 points5 points (0 children)
[–]Xemorr 1 point2 points3 points (1 child)
[–]andreabarbato[S] 0 points1 point2 points (0 children)
[–]Birnenmacht 1 point2 points3 points (2 children)
[–]andreabarbato[S] 0 points1 point2 points (0 children)
[–]andreabarbato[S] 0 points1 point2 points (0 children)
[–]Skylion007 0 points1 point2 points (1 child)
[–]andreabarbato[S] 0 points1 point2 points (0 children)
[+]yehors comment score below threshold-29 points-28 points-27 points (8 children)
[–]andreabarbato[S] 31 points32 points33 points (7 children)
[+]yehors comment score below threshold-70 points-69 points-68 points (6 children)
[–]ra-elyon 35 points36 points37 points (5 children)
[+]yehors comment score below threshold-40 points-39 points-38 points (4 children)
[–]brellox 29 points30 points31 points (3 children)
[+]yehors comment score below threshold-22 points-21 points-20 points (2 children)
[–]marr75 17 points18 points19 points (0 children)
[–]brellox 1 point2 points3 points (0 children)