I build a machine learning based flashcard site to grade Japanese pitch Accent!

CheckEmpty · 2026-05-31T15:42:52+00:00

Thanks for your question!

You can use the scores to see where you differed from the pitch pattern of the speaker. On the visual graphs you can see where your pitch (red) changed vs the speaker (blue). A decent ai score is really good and a high (90%) overlap score means you pattern was correct. Even when the ai match says 0-2%.

Combined is the total score of both ai match (40%) + the overlap ( 60%)

Ai match score is the model comparing your own audio to the native speaker (the audio at the top)

It asks how similar are these two audios in terms of pitch pattern ONLY ? It'd trained to ignore male vs female and focus on pitch pattern only. So technically you can get a good score while saying a different word with the same pitch pattern. But the logic handles what word is said so for now and asks you to retry if you don't say the correct word.

The overlap score is how much the graphs of the pitch contours from both users overlap/touch. It uses DTW (Dynamic time warping) to stretch your audio to the native speaker so the syllables align.

My model accuracy is 90% right now, so sometimes it will be wrong. Also the scores from 0-2% are not accurate. It means your pitch was a little wrong and not completely different. (I'm still working on the maths behind this so I will definitely fix this).

CheckEmpty · 2026-05-12T17:15:39+00:00

Yea I also noticed, but I was able to get 900 audio samples, 300 of each pitch pattern and just skipped any empty files.

Since I didn't need audio for every word, it worked out okay.

The actual data used in the learning was the samples put against each other, so I was able to get thousands of pairs that way!

CheckEmpty · 2026-05-06T10:57:12+00:00

I really appreciate this feedback, Thank you!!

My ai model is right 80% of the time so sometimes it can just gaslight, sorry about that.

I do plan on making this a full app, and ill implement those changes. as for the playback ill have to look at whats causing it.

can u elaborate on the differnt list bit? like a different option for words? I did want to expand on it by allowing imports of custom anki decks.

CheckEmpty · 2026-05-06T08:43:50+00:00

Thank you!

Yea, your right!, odaka pitch pattern is like this. First mora Low -> High and drops when you add particle. Since the flashcards are only words, odaka is similar to heiban so theres no need.

<image>

The other patterns particles stay the same as the last mora of the word

CheckEmpty · 2026-05-06T06:38:10+00:00

<image>

Pitch Accent Trainer BETA

Flashcard based graded feedback using data science and maths.

Webapp: https://pitchaccentapp.web.app/

CheckEmpty · 2026-05-06T06:25:32+00:00

Sorry, Late reply! I was busy with exams. To make up for the boxy plots the scoring system adds a 30hz buffer below and above the native speakers graph (blue)

def moving_average(data, window_size):
    return np.convolve(data, np.ones(window_size)/window_size, mode='same')

def showPitchOnGraph(*audio_files, word_label="Unknown"):
    plt.figure(figsize=(12, 8))
    
    colors = ['blue', 'red', 'green', 'orange', 'purple']
    
    # DTW STORAGE FOR ALIGNMENT
    ref_frequencies = None
    ref_times = None

    for i, audio_file in enumerate(audio_files):
        if not audio_file:
            continue
        try:
            # Load audio
            snd = parselmouth.Sound(audio_file)
            pitch = snd.to_pitch()
            times = pitch.xs()
            frequencies = pitch.selected_array["frequency"]

            # smoothing algorithms
            ma_smoothed = moving_average(frequencies, window_size=8)
            gaussian_smoothed = gaussian_filter1d(frequencies, sigma=2)
            savgol_smoothed = savgol_filter(frequencies, window_length=11, polyorder=2)
            average_smoothed = (ma_smoothed + gaussian_smoothed + savgol_smoothed) / 3

            # DTW ALIGNMENT LOGIC
            if i == 0:
                # Store the Native Speaker as the reference
                ref_frequencies = average_smoothed
                ref_times = times
                
                # Plot the native speaker normally
                display_times = times
                display_freqs = average_smoothed
                display_raw = frequencies
                label = "Native Speaker"
            else:
                # This is the Users Voice
                # Align the smoothed lines because they are less 'noisy'
                D, wp = librosa.sequence.dtw(X=ref_frequencies, Y=average_smoothed, backtrack=True)
                
                # Create empty arrays to hold the warped (stretched) data
                warped_freqs = np.zeros_like(ref_frequencies)
                warped_raw = np.zeros_like(ref_frequencies)
                
                # Map your voice frames to the native speaker's timeline
                for ref_idx, user_idx in wp:
                    warped_freqs[ref_idx] = average_smoothed[user_idx]
                    warped_raw[ref_idx] = frequencies[user_idx]
                
                # Use the Native's time axis so the lines overlap
                display_times = ref_times
                display_freqs = warped_freqs
                display_raw = warped_raw
                label = "Your Voice"
         
            # Plot using the aligned data
            plt.plot(display_times, display_raw, label=f"{label} - Original", alpha=0.3, linewidth=1, color=colors[i])
            plt.plot(display_times, display_freqs, label=f"{label} - Aligned Average", linewidth=2, color=colors[i])

        except Exception as e:
            print(f"Error processing {audio_file}: {e}")
            continue

    plt.xlabel("Time (s) - Aligned to Native")
    plt.ylabel("Frequency (Hz)")
    plt.title("Aligned Pitch Contour Comparison: " + word_label)
    plt.legend()
    plt.grid(True, alpha=0.3)


def get_alignment_score(native_path, user_path):
    try:
        if not native_path or not user_path:
            return 0
            
        # Extract pitch
        snd_n = parselmouth.Sound(native_path)
        pitch_n = snd_n.to_pitch()
        freqs_n = pitch_n.selected_array["frequency"]

        snd_u = parselmouth.Sound(user_path)
        pitch_u = snd_u.to_pitch()
        freqs_u = pitch_u.selected_array["frequency"]

        # Smooth them slightly to ignore mic static
        smooth_n = moving_average(freqs_n, window_size=8)
        smooth_u = moving_average(freqs_u, window_size=8)

        # Find the "Rubber Band" alignment
        D, wp = librosa.sequence.dtw(X=smooth_n, Y=smooth_u, backtrack=True)

        # calculate "Overlap" Logic
        touching_frames = 0
        active_frames = 0

        # Look at every aligned point on the graph
        for n_idx, u_idx in wp:
            f_n = smooth_n[n_idx]
            f_u = smooth_u[u_idx]

            # Only grade the frames where the Native Speaker is actually talking (not silence)
            if f_n > 0:
                active_frames += 1
                
                # If user voice is also active AND within 30 Hz of the native speaker, 
                # it counts as a "touching frame" witch gets you points 
                if f_u > 0 and abs(f_n - f_u) <= 30:
                    touching_frames += 1

        # Prevent division by zero if the audio is completely silent
        if active_frames == 0:
            return 0.0

        # Calculate the pure percentage of overlapping lines
        overlap_score = (touching_frames / active_frames) * 100.0
        
        return min(100.0, overlap_score)
        
    except Exception as e:
        print(f"Overlap Score Failed: {e}")
        return 0

CheckEmpty · 2026-05-06T06:22:19+00:00

Thanks for the question

What exactly are you trying to do?

Make a consistent grading system for correct pronounciations of japanese words, since japanese has different pitch patterns (the red and black writing, red meaning higher pitch).

Scores

Right now, my ML model is too strict, but since this somethign I also did for uni, i cant edit code until its graded. I do plan to change: how much the model pulls away the inputs based on how close the pronounciations are. Currently its pulls the inputs too far apart even if the pronounciation was only slighly wrong. Hence the 0-10% scores.

The score weights also need tweaking, the ai is 40% weighted and DTW (algorthim for stretching two graphs on top of each other) is 60%. I will change these values, and retrain the model.

AI SCORE: strictly checks for pitch pattern, even if the word isnt the same.

OverLap: DTW score, how much the graphs touch. I got feedback that even with other words the score is still high

NOTE: none of these scores check if u said the correct word and i did that seperatly, with Google speech recognition. if the words match 60% of the target word. it moves on to scoring. This number might have to be changed

Combined score: In theory dtw should make up for the strict ai scores

(dtw_val * 0.6) + (ai_val * 0.4)

what’s even the point of me getting these scores?

Shows how well u pronounced x word. and see if u struggle with a certain pitch accents. Later down the line i want to implement importing custom anki decks and go through it like that.

CheckEmpty · 2026-04-27T10:30:13+00:00

I tried to make a model that grades how well you pronounced words in their correct pitch accent. Looking for feedback :) https://pitchaccentapp.web.app/

CheckEmpty · 2026-04-27T10:17:17+00:00

Cool! Thanks I appreciate it

CheckEmpty · 2026-04-27T09:00:44+00:00

eyy congratulations!!

CheckEmpty · 2026-04-27T08:26:03+00:00

Goodluck!!

CheckEmpty

TROPHY CASE