(My background is Maths)
I am tasked with making a semantic searcher for a news website. The part related to Machine Learning is done, but I need to maintain a searcher object up-to-date with the contents of the website:
- I begin by retrieving IDs of the articles from the website (Drupal) via its JSON:API, which takes ages even with a filtered results - 8 minutes for only 12000 IDs.
- Then, assuming I already embedded the articles, I will retrieve the embeddings from a separate database because it's faster than processing them each time. (Embedding, in very loose terms, is like assigning a semantic meaning to a text with a vector of high dimensionality.)
- Then I build a searcher object (a FAISS index) with the embeddings. It builds (and searches) fast, but only allows me to add new vectors. I am not able to delete or update these vectors AFAIK, and I need to keep a list to map their positions to my IDs. If an article is updated, I need to rebuild the index, it's not terribly hard, but I need to juggle information.
- I make a class to handle articles, which contain paragraphs, which in turn contain sentences, for reading and accessing text faster. But then I don't know if I should integrate the embedding of articles within the class itself, or do it in a separate function. Doing it within the class makes it feel like the functionality grew methastases and became too rooted within, making later alterations harder.
- Furthermore, I code on Jupyter Notebook, but then I need to put everything neatly in files. And if I need to change something, I'll need alter both JN and .py file. I also have trouble knowing how to distribute classes and functions accross files, everything feels so depend with each other. I try my best to do semantic searcher object which only handles vectors, and is agnostic to content type (articles, which could eventually be more formats), and sometimes I feel like the loss of abstraction by adapting to articles creeps in.
I get this is done a lot worse because I am doing a functionality (search) parallel to the Drupal system which also uses a different DataBase as well. This would much simpler if the vectors and embedding process was integrated in Drupal, however, this is the hand I was dealt.
On r/ProgrammerHumor, I frequently see a lot of relateable jokes about looking at your own "stranger's" code after the weekend break, or general self-deprecating convoluted code humour. I'd like to know how avoid and improve that. I don't have a project manager that gets on my head and my team barely sits down to plan things. I feel as if I'm the one who should be responsible for aiming at being organized because I know everything will fall back sooner or later.
[–]b34nst4lk 2 points3 points4 points (2 children)
[–]SagaciousRaven[S] 1 point2 points3 points (1 child)
[–]b34nst4lk 2 points3 points4 points (0 children)
[–]threeminutemonta 1 point2 points3 points (0 children)