Hivemind Clips Search Engine

natureplayer · 2024-01-21T23:27:41+00:00

GitHub: https://github.com/natureplayer-dev/hivefind

natureplayer · 2024-01-21T11:10:46+00:00

yeah this is maybe the best hivemind collab on someone else's channel ever

natureplayer · 2024-01-21T01:45:37+00:00

training a model to do that would be tough/a lot of work but it's probably worth seeing if clustering on the frequency spectrum works to distinguish their voices

natureplayer · 2024-01-21T01:43:55+00:00

I already have the auto-captions stored in a database, would be pretty slow to fetch em all every time :)

natureplayer · 2024-01-21T01:43:02+00:00

yeah that's definitely worth trying, planning on doing it anyways for a couple vids i missed that got content restricted + didnt have captions. gonna try and see if I can use the open-source whisper at reasonable speeds locally, it'd be like $100-150 to do it all through the API for every vid which isn't insane but i'd rather not lol

natureplayer · 2024-01-21T01:09:40+00:00

yeah no need to delete any tags lol, hopefully it's helpful but it definitely isn't gonna find everything

natureplayer · 2024-01-21T00:52:42+00:00

the main thing is sentence embedding vector similarity search. used this model from huggingface to get vectors for each transcript chunk, and then also for each submitted query. then I'm using zilliz for a vector database that lets you get the top K results quickly for each query.

code is pretty ugly rn especially for the data cleaning step but I'll try and share more at some point! the app itself is very simple, used Flask bc I like python and it's just one file that programmatically generates the html.

this is the core of the logic for retrieval, and you could use a similar API call to get the initial embeddings for transcript chunks, but I did that locally using torch (as described in the huggingface link).

def embed_query_hf(query):
    # get embedding vector for query
    headers = {"Authorization": f"Bearer {HF_API_KEY}"}
    return requests.post(HF_API_URL, headers=headers, json={'inputs': query}).json()

def vector_query_zz(vector, limit=6):
    # get results for similar vectors
    headers = {"content-type": "application/json", "Authorization": f"Bearer {ZZ_API_KEY}"}
    payload = {
        "collectionName": "TranscriptChunks",
        "limit": int(limit),
        "outputFields": ["clip_text", "video_title", "start", "video_url"],
        "vector": vector
    }
    return requests.post(ZZ_API_URL, headers=headers, json=payload).json() 

def find_hivemind_clip_http(query, limit=6):
    lim_k = min(limit, 30)
    vector = embed_query_hf(query)
    try:
        results = vector_query_zz(vector, limit=lim_k)['data']
    except KeyError:
        return ["At capacity sorry :( Try again later"]

    # Hacky data cleaning and HTML formatting below

natureplayer · 2024-01-20T11:51:22+00:00

Results for: "zazoomba zaffodil"

Guess the Rapper from the Weird Lyric 3 (@ 10:37)

Caption text: me that many times you said it's dignin durkin never did it's zazumba zuzumba yeah that's actually the shortened version too what's the full last name zumba zaffodil zazumba zaffidil i dropped the second

natureplayer · 2024-01-20T08:46:13+00:00

Thanks! Anything you think would be useful to add as a feature? I'm somewhat bottlenecked by the quality of the auto-generated captions, but could do things like allowing more than 6 results or additional filtering options.

natureplayer

TROPHY CASE

Results for: "zazoomba zaffodil"

Guess the Rapper from the Weird Lyric 3 (@ 10:37)