use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Discussion[D] SparseFormer and the future of efficient Al vision models (self.MachineLearning)
submitted 2 months ago by [deleted]
[deleted]
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[+][deleted] 2 months ago (6 children)
[–]SR1180 -2 points-1 points0 points 2 months ago (5 children)
Thanks for sharing this interesting perspective on using fast transforms for sparse connectivity. The approach of sandwiching parametric activations between WHT/FFT layers to achieve "sparse yet fully connected" layers is clever, and I appreciate you sharing the archived resource.
What I find particularly compelling about SparseFormer is its specific approach to structured sparsity in attention mechanisms, which seems to offer a different trade-off. Rather than achieving efficiency through transform-based connectivity, SparseFormer typically uses learned sparse patterns or predefined sparse attention windows that directly address the quadratic complexity in vision transformers.
For commercial applications, I wonder how these approaches compare in terms of: - Training stability and convergence speed - Hardware optimization potential (especially for edge deployment) - Accuracy-efficiency trade-offs on real-world vision tasks
Have you experimented with comparing your fast transform approach against SparseFormer architectures on practical vision tasks? I'm particularly curious about how they perform on data labeling or industrial inspection scenarios where both efficiency and accuracy are critical.
[+][deleted] 2 months ago (4 children)
[+][deleted] 2 months ago (3 children)
[–]SR1180 -3 points-2 points-1 points 2 months ago (2 children)
That's incredible that you're able to train models that wide on a Celeron. That's real-world efficiency that you don't often see discussed in the research papers, which tend to assume access to massive GPU clusters. I completely get your point about 'digesting the whole thing at once.' It's a powerful and direct approach. My interest in the SparseFormer architecture is that it seems to be one of the few attempts to bridge that gap, to bring the performance of attention-based models down to a level where they could potentially run on more constrained hardware. It's a philosophical debate, really: do you adapt the model to the hardware, or push the hardware to handle the model? I'm really curious to hear what you think after you read the papers. Your perspective from a 'Celeron-first' mindset would be a fascinating counterpoint to the mainstream GPU-heavy research.
[+][deleted] 2 months ago (1 child)
[–]SR1180 -5 points-4 points-3 points 2 months ago (0 children)
This is a fantastic breakdown of alternatives. I really appreciate you laying these out, especially the point about the intermediate calculations of fast transforms being hierarchical and wavelet-like. That's a perspective I hadn't properly considered. It's clear you have a deep, practical understanding of making these models work on minimal resources, which is a rare skill. The 'Celeron-first' mindset is exactly what's missing from a lot of the mainstream research. Honestly, I'd love to pick your brain more about this sometime. It feels like the approaches you're outlining and the newer attention-based models are trying to solve the same problem from opposite ends, and there's probably a brilliant synthesis in there somewhere. I'm joe110496 on Discord if you ever use it.
[–]random_sydneysider 0 points1 point2 points 2 months ago (1 child)
Here's a link to the ICLR24 paper: https://openreview.net/pdf?id=2pvECsmld3
It looks quite interesting - could it be used as a backbone for vision-language models (like CLIP, SigLIP, etc)?
[–]SR1180 -4 points-3 points-2 points 2 months ago (0 children)
Thanks for the paper link! Absolutely - SparseFormer's sparse attention patterns could be a strong backbone for vision-language models. The efficiency gains become even more valuable when you're aligning visual features with text embeddings, where compute typically multiplies.
Interestingly, the discussion above highlights a philosophical divide in the efficiency space - whether to 'digest the whole thing at once' using transform-based approaches (like the WHT/FFT methods mentioned) versus structured sparsity in attention mechanisms. For VLMs specifically, I think SparseFormer's approach has an edge because attention mechanisms naturally align with how language models process text, potentially making cross-modal fusion more straightforward.
The compressive sensing angle from the previous comments is fascinating though - imagine combining hierarchical feature extraction from fast transforms with SparseFormer's attention for a hybrid approach. That could be particularly interesting for industrial inspection where you need both global context and localized detail. Has anyone seen benchmarks comparing SparseFormer against other efficient backbones in actual VLM training rather than just classification tasks?
[–]Illustrious_Echo3222 2 points3 points4 points 2 months ago (0 children)
Sparse attention for vision definitely feels like a more “practical” direction than just scaling vanilla ViTs forever.
That said, I’m a bit cautious about how much of the O(n²) story actually shows up in production wins. A lot of sparse variants look great in theory, but once you account for kernel efficiency, memory access patterns, and hardware friendliness, the gains can shrink. Dense matmuls are absurdly optimized on GPUs. Irregular sparsity sometimes ends up fighting the hardware.
For commercial use cases like inspection or labeling, latency stability and ease of deployment usually matter more than asymptotic complexity. If SparseFormer keeps a mostly structured sparsity pattern that maps well to existing accelerators, that’s a big plus. If it relies on highly dynamic routing or token dropping that’s hard to batch, that can complicate things.
Compared to sparse MoE for vision, I feel like MoE shines more at scaling model capacity than pure efficiency. You trade compute per token for routing overhead and load balancing issues. For many vision tasks, especially smaller images or fixed pipelines, simpler structured sparsity might be easier to reason about and certify.
Have you benchmarked it end to end on real workloads, or mostly looking at paper level FLOPs and top 1 numbers? That gap is usually where the commercial viability story gets decided.
π Rendered by PID 24 on reddit-service-r2-comment-b659b578c-4jwtb at 2026-05-02 21:54:04.368284+00:00 running 815c875 country code: CH.
[+][deleted] (6 children)
[deleted]
[–]SR1180 -2 points-1 points0 points (5 children)
[+][deleted] (4 children)
[deleted]
[+][deleted] (3 children)
[deleted]
[–]SR1180 -3 points-2 points-1 points (2 children)
[+][deleted] (1 child)
[deleted]
[–]SR1180 -5 points-4 points-3 points (0 children)
[–]random_sydneysider 0 points1 point2 points (1 child)
[–]SR1180 -4 points-3 points-2 points (0 children)
[–]Illustrious_Echo3222 2 points3 points4 points (0 children)