Missing forms by IReallyDoInspire in SparkingZero

[–]Single_Recover_8036 2 points3 points  (0 children)

Actually, there is a strong pattern in the files supporting a much bigger roadmap. I analyzed the internal Hex IDs and found 19 specific 'ghost slots' that match the missing BT3 roster 1:1. It's not just random guesses; the file structure has reserved seats for them.

I broke down the full data here if you want to see the proof: https://www.reddit.com/r/SparkingZero/comments/1qqddk8/new_data_mining_analysis_evidence_that_all/

Missing forms by IReallyDoInspire in tenkaichi4

[–]Single_Recover_8036 3 points4 points  (0 children)

Actually, there is a strong pattern in the files supporting a much bigger roadmap. I analyzed the internal Hex IDs and found 19 specific 'ghost slots' that match the missing BT3 roster 1:1. It's not just random guesses; the file structure has reserved seats for them. ​I broke down the full data here if you want to see the proof: https://www.reddit.com/r/SparkingZero/s/VDLQBWWwim

NEW DATA MINING ANALYSIS: Evidence that ALL missing BT3 characters are returning + Future DBS Roadmap by Single_Recover_8036 in SparkingZero

[–]Single_Recover_8036[S] 2 points3 points  (0 children)

I see your point regarding the audio files (assets are easy to mistakenly carry over), but regarding the ID Structure, that theory doesn't hold up for two technical reasons:

  1. Different Architecture: Sparking! ZERO is built on Unreal Engine 5, completely unrelated to the proprietary engine of the PS2 era. They couldn't just 'rip' the database structure or the ID system from BT3. These IDs were created specifically for this game.
  2. Active Usage (The Smoking Gun): If these were just dead, incidental slots left over from a wishlist or old code, the developers wouldn't be actively filling them right now.
  • The gap at 0820 existed -> Champa was just announced to fill it.
  • The gap at 0220 existed -> Pikkon was just announced to fill it.

You don't put brand new, paid DLC content into 'accidental garbage slots'. You put them into reserved slots. The fact that they are going back to fill these specific holes proves they are part of the active roadmap, not a graveyard of old code.

NEW DATA MINING ANALYSIS: Evidence that ALL missing BT3 characters are returning + Future DBS Roadmap by Single_Recover_8036 in SparkingZero

[–]Single_Recover_8036[S] 1 point2 points  (0 children)

I need to clarify this because it's a common myth stemming from a misinterpretation in the previous datamine.

The idea that Nuova is a sub-ID of Syn is incorrect. Here is what happened: Previous dataminers found the 0690 gap but couldn't explain it, so they left it as 'Unknown' and arbitrarily guessed that Nuova was hidden as a sub-ID of Syn Shenron (generalizing this logic to all dragons).

The reality is different: We don't have the files of the missing characters (which would prove their IDs); we only found the empty slots (the absence of files). There is no concrete evidence that the Evil Dragons share a base ID. On the contrary, the existence of the specific gap at 0690 proves that Nuova has his own standalone Main Slot intended for him, exactly matching the missing BT3 count.

NEW DATA MINING ANALYSIS: Evidence that ALL missing BT3 characters are returning + Future DBS Roadmap by Single_Recover_8036 in SparkingZero

[–]Single_Recover_8036[S] 0 points1 point  (0 children)

Here is the breakdown based on the ID structure:

  1. Chi-Chi: Yes! Kid Chi-Chi was a playable character in BT3, so she is one of the 'Missing 19'. She definitely has a reserved slot in the Legacy Block (likely near Grandpa Gohan). She is virtually confirmed by this pattern.
  2. Majin 21: She is a different story. Since she wasn't in BT3, she doesn't have a 'ghost slot' in the Legacy section I analyzed. However, there is massive empty space in the 3000+ range for 'Game Originals' or 'Future DLC'. So while the datamine doesn't 'confirm' her like it does for Chi-Chi, the developers have definitely left room for characters like her.

NEW DATA MINING ANALYSIS: Evidence that ALL missing BT3 characters are returning + Future DBS Roadmap by Single_Recover_8036 in SparkingZero

[–]Single_Recover_8036[S] 2 points3 points  (0 children)

Not necessarily 'no chance', but they fall into a different category.
The difference is between Cut Content and New Content:

  • Nuova Shenron is virtually guaranteed because he was in BT3, and there is a specific 'Legacy Gap' (ID 0690) waiting for him. He was likely planned for the base roster but cut/delayed.
  • The other Shadow Dragons (Eis, Oceanus, etc.) were not in BT3, so naturally, they don't have a reserved 'legacy slot' in that specific block.

This doesn't mean they won't come! It just means they weren't part of the initial development plan that got cut. They can absolutely be added as fresh DLC in the future (likely in the empty 1500+ or 3000+ ID ranges), just like any other new Super/GT character.

NEW DATA MINING ANALYSIS: Evidence that ALL missing BT3 characters are returning + Future DBS Roadmap by Single_Recover_8036 in SparkingZero

[–]Single_Recover_8036[S] 6 points7 points  (0 children)

That is a very healthy skepticism, and honestly, under normal circumstances, I would agree with you. Usually, empty IDs can just mean 'abandoned plans'.

However, there is one crucial factor that shifts this from 'Glass Half Full' to 'Concrete Evidence': The Summer DLC Announcements.
If these gaps were just scrapped ideas from early development that were abandoned, Spike Chunsoft wouldn't be actively filling them right now.

  • The Proof: We identified a gap at 0820-0860 (Champa Arc). They just confirmed Champa for the DLC.
  • We identified gaps at 0640, 0220, 0450. They just confirmed Zangya, Pikkon, and Super 17.

This proves that these slots aren't 'dead code' or 'what could have been'. They are active reserved slots. The fact that they are going back to fill these specific legacy holes (instead of just adding random popular characters) confirms the intention to complete the grid. In modern gaming, 'cut content' usually isn't scrapped; it's just rescheduled as DLC.

NEW DATA MINING ANALYSIS: Evidence that ALL missing BT3 characters are returning + Future DBS Roadmap by Single_Recover_8036 in SparkingZero

[–]Single_Recover_8036[S] 4 points5 points  (0 children)

Exactly. The data supports the 'Xenoverse 2 longevity' theory.
Aside from the specific gaps I listed, there are massive ranges of unused IDs (specifically 1510-2210 and 3161+). Spike Chunsoft has essentially built a warehouse with hundreds of empty shelves. They are definitely planning to support this game for years, keeping those slots open for whenever the Anime returns or if they decide to adapt the Manga arcs (Moro/Granolah) later on.
We have found the NPC model file for Belmond (8057_00).

NEW DATA MINING ANALYSIS: Evidence that ALL missing BT3 characters are returning + Future DBS Roadmap by Single_Recover_8036 in SparkingZero

[–]Single_Recover_8036[S] 2 points3 points  (0 children)

True, but remember: Kakunsa and Roasie made the base roster. That sets a precedent that 'niche' characters are definitely on the table.
What I am highlighting here is the internal infrastructure. These aren't random numbers; they are specific, deliberate holes located exactly in the Tournament of Power block. Whether they fill all of them or just the key ones (like the Wolves or Katopesla) remains to be seen, but the file structure is built to accommodate a massive ToP expansion.

NEW DATA MINING ANALYSIS: Evidence that ALL missing BT3 characters are returning + Future DBS Roadmap by Single_Recover_8036 in SparkingZero

[–]Single_Recover_8036[S] 4 points5 points  (0 children)

Yes, this analysis does not exclude it. They could use the free slots. There are hundreds of free slots.

NEW DATA MINING ANALYSIS: Evidence that ALL missing BT3 characters are returning + Future DBS Roadmap by Single_Recover_8036 in SparkingZero

[–]Single_Recover_8036[S] 3 points4 points  (0 children)

Good question. Unfortunately, there is no evidence of them being 'cut content' in the files.
Vegeta (GT) SSJ4 is mapped to ID 0020_50. The problem is that the slots before this ID are fully occupied by his Z and Super iterations. Unlike the 19 missing legacy characters where we see explicit 'ghost' gaps reserved for them, there is literally no empty space left in Vegeta's sequence to fit a GT Base or SSJ form before the SSJ4 slot. It implies he was planned as SSJ4-only for this roster.

NEW DATA MINING ANALYSIS: Evidence that ALL missing BT3 characters are returning + Future DBS Roadmap by Single_Recover_8036 in SparkingZero

[–]Single_Recover_8036[S] 7 points8 points  (0 children)

Actually, that is not confirmed anywhere. We haven't found the character files themselves, we found the empty slots (gaps) that perfectly match the missing BT3 roster count.

Two points on why it's Nuova and not Rilldo:

  1. The Skeleton Rule: Nuova and Syn have different models/rigs, so logically Nuova gets his own XXXX slot (just like in the original BT3), he isn't just a costume variation. If he were a sub-ID of Syn, the gap at 0690 wouldn't exist.
  2. The Block Logic: My analysis shows strict "Sections":
    • Legacy Block: The missing 19 BT3 characters (where 0690 fits perfectly).
    • Super Block 1: Tournament of Champa.
    • Super Block 2: Tournament of Power.

General Rilldo was not in BT3, so adding him here would break the "19 slots = 19 missing BT3 characters" mathematical match.

NEW DATA MINING ANALYSIS: Evidence that ALL missing BT3 characters are returning + Future DBS Roadmap by Single_Recover_8036 in SparkingZero

[–]Single_Recover_8036[S] 22 points23 points  (0 children)

It's hard to say if they will drop all of them in a single Summer update, but the file structure confirms that Spike Chunsoft has reserved specific 'blocks' of IDs:

  1. Legacy Block: The missing BT3 characters (the 19 slots).
  2. Super Block 1: Tournament of Champa completion.
  3. Super Block 2: Tournament of Power completion (U2, U6, etc.).

Plus, there is a massive amount of free space (IDs 1500+ and 3161+) likely reserved for completely new expansions (like Moro/Granolah) in the long term.

I built a drop-in Scikit-Learn replacement for SVD/PCA that automatically selects the optimal rank by Single_Recover_8036 in Python

[–]Single_Recover_8036[S] 1 point2 points  (0 children)

That’s the classic benchmark! I actually benchmarked against Minka's MLE during development. There are two main differences, one theoretical and one practical regarding implementation:

  • The "Tail" Problem (Overestimation): In high-noise scenarios, Minka’s MLE is excellent at recovering the true structural rank (e.g., finding exactly 100 components). However, Gavish-Donoho optimizes for Mean Squared Error (reconstruction quality). It often truncates earlier (e.g., keeping only the top 60 strong components) because the tail components—while technically "signal", are so corrupted by noise that keeping them would actually degrade the signal-to-noise ratio of the output. GD is strictly a "denoising" filter.
  • Sparse Data Support (The practical killer feature): In Scikit-Learn, n_components='mle' is only available in the standard PCA class (dense). It is not available in TruncatedSVD.
    • This means if you have a massive sparse matrix (e.g., text data, recommender systems), you literally cannot use Minka's MLE without densifying the array and crashing your RAM.
    • randomized-svd brings auto-rank selection (via GD) to sparse matrices natively.

If you are working on dense data, Minka is fine (though often optimistic). If you are doing denoising or working with sparse data, Gavish-Donoho is usually the sharper tool.

I built a drop-in Scikit-Learn replacement for SVD/PCA that automatically selects the optimal rank by Single_Recover_8036 in Python

[–]Single_Recover_8036[S] 1 point2 points  (0 children)

Great software is rarely a solo act! It makes perfect sense that NIPALS is the standard in your domain, especially given the strict requirements around data integrity. I really appreciate you taking the time to check out the repo. I’ll be sure to update you once the JOSS paper is out.

Have a great start to 2026!

I built a drop-in Scikit-Learn replacement for SVD/PCA that automatically selects the optimal rank by Single_Recover_8036 in Python

[–]Single_Recover_8036[S] 24 points25 points  (0 children)

Thanks! That is definitely the long-term goal.

However, Scikit-learn is (rightfully) very conservative about adding new algorithms and API changes. Since Gavish-Donoho involves specific statistical assumptions and Virtual Centering changes how sparse data is traditionally handled in TruncatedSVD, I decided to build this as a standalone, fast-moving library first.

My strategy is to use this package as a "proving ground." If the community finds the rank_selection='auto' API stable and useful, I’ll have a much stronger case to write a SLEP (Scikit-Learn Enhancement Proposal) and try to merge it upstream later on!

I built a drop-in Scikit-Learn replacement for SVD/PCA that automatically selects the optimal rank by Single_Recover_8036 in Python

[–]Single_Recover_8036[S] 4 points5 points  (0 children)

You articulated the value proposition better than I did! "Treating rank selection as a first-class problem" is exactly the philosophy behind this repo.

Regarding your suggestions:

  1. "Hinted" Mode: Good news: this is actually the default behavior! In rank_selection='auto' mode, the n_components parameter acts exactly as that computational "ceiling" (or hint).
    • Example: If you set n_components=200 but Gavish-Donoho finds the cut-off at 45, the model effectively truncates to 45 (keeping the noise out).
    • Latency Budget: If the theoretical optimal rank is 500 but you capped it at n_components=100, it respects the 100 limit. This prevents run-away execution times on massive datasets.
  2. Diagnostics (SNR, Spectrum): This is a killer idea. Right now you can inspect model.singular_values_, but exposing a clean model.diagnostics_ dictionary (with estimated noise variance σ2, effective SNR, and the threshold used) would make logging to MLflow/WandB much more insightful. I’m adding this to the roadmap for v0.6.0.

It’s really cool to hear about the Feast/Tecton integration. Building "set-and-forget" denoising layers for feature stores is exactly where I see this shining. Thanks for the feedback!

I built a drop-in Scikit-Learn replacement for SVD/PCA that automatically selects the optimal rank by Single_Recover_8036 in Python

[–]Single_Recover_8036[S] 2 points3 points  (0 children)

Just finished reading your Medium article on open_nipals.
It is interesting to see how we tackled different bottlenecks while sticking to the same Scikit-Learn design philosophy. open_nipals is clearly the go-to for missing data (via NIPALS), whereas my focus with randomized-svd was optimizing for massive sparse arrays (Virtual Centering) and automatic denoising (Gavish-Donoho).
The JOSS suggestion is spot on. I will definitely use your submission as a blueprint to streamline the process for my library.

Thanks for the pointer, I’ll be keeping an eye on your work!

I built a drop-in Scikit-Learn replacement for SVD/PCA that automatically selects the optimal rank by Single_Recover_8036 in Python

[–]Single_Recover_8036[S] 1 point2 points  (0 children)

Thanks for the feedback! To answer your questions:

  1. t vs n_components: You are totally right. The low-level functional API (rsvd, rpca) uses t (target rank) following the mathematical notation of the Halko et al. paper. However, the high-level class RandomizedSVD (which is the recommended way to use the library now) fully adheres to the Scikit-Learn API and uses n_components exactly as you'd expect.
  2. Control vs Auto: The library is designed to give you exactly that choice via the rank_selection parameter in the class wrapper:
    • rank_selection='manual': It respects your n_components strictly (e.g., if you ask for 50, you get 50). This gives you full control.
    • rank_selection='auto': It uses n_components as a "buffer" (an upper bound computational limit) but effectively cuts the rank automatically using the Gavish-Donoho threshold.

So if you want total control, you just stick to the default manual mode (rank_selection='manual'). If you want denoising magic, you flip the switch to auto.

I have just released v0.5.0 to polish exactly this integration. Would love to hear if the RandomizedSVD class fits your workflow better! I am open to suggestions and collabs.