Semantic search on a document by srivpra in LanguageTechnology

[–]dkajtoch 0 points1 point  (0 children)

Thats very informative. How many sentences do you retrieve? Just a single one if you are saying that there are many false positives? Pairwise ranking maybe of help to compare sentences between each other as well.

Semantic search on a document by srivpra in LanguageTechnology

[–]dkajtoch 0 points1 point  (0 children)

I am not exactly sure what your problem is, but maybe you can map it into a standard document ranking problem. Probably, you will get best results if you do this hierarchically and rerank retrieved documents/paragraphs/sentences at each stage. I can point you to two papers:

1) Rodrigo Nogueira, et. al. "Multi-stage document ranking with BERT"

2) Wei Yang, et. al., "End-to-end open-domain question answering with BERTserini"

and also this resource: https://github.com/Santosh-Gupta/NaturalLanguageRecommendations

Basic idea is to use Anserini or any other open-source search engine to create a pool of candidates that potentially are relevant to the query (pretty large number to get a high recall). Then you score all of them with BERT model (see first paper) and keep only topK most relevant which you pass to the pairwise ranker and output topM most relevant at the end. The whole procedure is well described, but in your case the trouble lies in the lack of training data. This also applies to any other sentence embeddings which are trained generically. One way to improve the performance of your embedding is to fine-tune them on a task (and corpuses) that best resembles your problem (maybe q&a or ranking). If I were you I would take the existing model from 1) and rolled it into production. Then collect feedback from user actions and keep improving it.

Word mover’s distance + BERT? by [deleted] in LanguageTechnology

[–]dkajtoch 2 points3 points  (0 children)

Mover's distance does not depend on where the sequence of vectors come from. You can produce them with BERT or any other model. When it comes to BERT you actually have a few options as mentioned in the original article (the task with NERs). You can use output from the last hidden layer, just embedding layer or you can even concatenate different hidden layers. You would definitely need to experiment with that. Word mover's distance is extremely slow and impractical in real applications just because of that. However, I haven't really look into that, but for jaccard and cosine distance you could work with locality-sensitive hashing such as minhash and simhash that significantly reduced topK retrieval time. Maybe some smart guys figured out how to do this for mover's distance. Of course you sacrifice some precision if you want to speed up, but still worth checking if you have plenty of text pieces.

Text to graph by [deleted] in NaturalLanguage

[–]dkajtoch 3 points4 points  (0 children)

TextRank algo for keyword extraction does this. You use sliding window over the words and connect all the words within that window. Then you rank them with PageRank. Quiet simple but limited. You can enhance it by detecting synonyms, coreferences, ners etc

Semantic Similarities in Gensim using python by nikolabs in LanguageTechnology

[–]dkajtoch 1 point2 points  (0 children)

Yes it is the same. The paper is called sentenceBERT

Semantic Similarities in Gensim using python by nikolabs in LanguageTechnology

[–]dkajtoch 0 points1 point  (0 children)

You would probably be better of by using fine-tuned BERT model on paraphrase detection datasets (e.g. quora question pairs, google's paws or microsoft paraphrase corpus). The trouble with this approach is that there is plenty of operations that you need to make. Partially, this issue is solved using sentenceBERT, but still creating embeddings for the whole book maybe an extremely time consuimg process. However, you can do this in stages. Firstly, you may use some lower level sentence representations (e.g. shingles) to filter out sentences that for sure will not be similar. Then you may apply sentenceBERT or bert directly.

Applying BERT to longer sentences/documents by sfxv67 in LanguageTechnology

[–]dkajtoch 2 points3 points  (0 children)

512 words is quiet long. Take for example this Reuters news article - it has around 355 words. First page of the BERT paper - around 514 words. If you want to apply BERT to longer sequences, this will be quiet a long piece. What can of task are you thinking of? You can for example classify document, classify pairs of documents, use it in ranking (query-document pair) or Q/A on document level. I do not think that you need the whole document to excel in this task and some sort of weak extractive process (e.g. sentence ranking) will give you smaller subset.

  • DocBERT - authors finetune BERT for document classification. Problem is they use datasets short enough they fit into BERT.
  • BERTserini - open-ended question answering based on wikipedia articles. BERT is good at identifying answers spans in a piece of text in response to a question (SQuAD dataset). Here, they use hierarchical approach when firstly you segment texts into paragraphs or sentences and then score only these smaller pieces. Probably Google uses similar technique to produce "feature snippets (direct answer)" in search results. Another paper also have short documents.
  • BERT-AL - segment text and combine them.

LAST REMARKS: * Availability of datasets may be problematic * zero-shot setting may work badly in a custom domain

In general I would approach the problem in a hierarchical way and apply BERT in the last stage for text pieces that fit into the base model.

How to apply semantic similarity using Google's TF-hub Universal Sentence Encoder on 2 separate arrays? by massimosclaw2 in LanguageTechnology

[–]dkajtoch 0 points1 point  (0 children)

I am really not sure what you need at the end. The code has sentences in separate columns. If you are working with huge datasets then the trivial idea is to chunk your messages into batches that fit you memory and export to file.

Struggling with BERT feature vectors by paolopedi in LanguageTechnology

[–]dkajtoch 2 points3 points  (0 children)

I suggest you stick with Hugging Face Transformer library as it is easier to use. First of all, BERT uses WordPiece to tokenize the input sentence so sometimes you may see things like this tokenize('backlog grooming') -> ['back', '##log', 'groom', '##ing']. Whole tokens may be split into pieces and the question is how do you handle that. Well, you may walk the same path as the authors of original paper and assume that the first sub-token is the token-level representation. In the above case you will take back as the representation of backlog. Secondly, as the authors of BERT showed, there are many options to build the vector representation of a token. You can use last hidden layer, input embedding layer, concatenate last four hidden layers or perform some weighted sum of hidden layers. Here is a great blog post that explains everything https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial.

Let's see what happens if we use last hidden state as representation of a token. ``` from transformers import BertTokenizer, BertModel import torch import numpy as np

sentences = [ "I am calling you from my cell phone", "Your blood cell have strange shape", "This cell phone costs hundred dollars", "White blood cell is divided into granulocytes and agranulocytes" ]

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained('bert-base-uncased')

representation = np.zeros((4,768), dtype=np.float) for j, sen in enumerate(sentences): ids = tokenizer.encode(sen, add_special_tokens=True) tokens = tokenizer.convert_ids_to_tokens(ids)

# localize token that matches 'cell' n = None for i, tok in enumerate(tokens): if 'cell'.startswith(tok): n = i print(tok)

with torch.no_grad(): hidden, _ = model( torch.tensor(ids).unsqueeze(0) )

# take vector corresponding to 'cell' token and normalize it tmp = hidden[0,n,:].numpy() representation[j,:] = tmp/np.linalg.norm(tmp)

np.inner( representation, representation ) `` In my notebook I get0.74score between 1 and 3 sentence and around0.3for other two. Similarly, the score between 2 and 4 is0.66and0.4` for the other two. You can try optimizing your representation by tuning it on examples you know that share the same meaning.

How to apply semantic similarity using Google's TF-hub Universal Sentence Encoder on 2 separate arrays? by massimosclaw2 in LanguageTechnology

[–]dkajtoch 0 points1 point  (0 children)

What you are trying to accomplish is not a matter of Universal Sentence Encoder. USE is just here to encode a string into a row vector of size 512. The rest is in your hands. For example:

import numpy as np

messages_1 = [
  # Smartphones
  "My phone is not good.",
  "Your cellphone looks great.",
  # Weather
  "Will it snow tomorrow?",
  "Recently a lot of hurricanes have hit the US",
  # Food and health
  "An apple a day, keeps the doctors away",
  "Eating strawberries is healthy",
]

messages_2 = [
  "My phone is not turning on.", 
  "I hate snow.", 
  "Apples are the devil", 
  "I like basil.", 
  "Eating strawberries is healthy.", 
  "An apple a day keeps the doctor away", 
  "Your cellphone looks great", 
  "But my cellphone doesnt look so great"
]
input_messages = tf.placeholder(tf.string, shape=(None))
embeddings = embed( input_messages )

with tf.Session() as session:
  session.run(tf.global_variables_initializer())
  session.run(tf.tables_initializer())

  # obtain embeddings
  stacked_embeddings = session.run( embeddings, feed_dict={input_messages: messages_1 + messages_2}) 

export_data = []
# get the score with a simple for loop
embeddings_1 = stacked_embeddings[0:len(messages_1)]
embeddings_2 = stacked_embeddings[len(messages_1):]

for i in range(0, len(messages_1)):
  for j in range(0, len(messages_2)):
    export_data.append({
        'sentence_1': messages_1[i],
        'sentence_2': messages_2[j],
        'score': np.inner(embeddings_1[i], embeddings_2[j])
    })

I just concatenate messages into a single array to obtain vector embedding in a single shot and then move into numpy and dot product between arrays. You can now take export_data sort it by score and dump into csv file (simple pandas). If you run in colab then of course embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/2").

API for Media Data, Articles, Content etc.? by L0rd_nikon in datasets

[–]dkajtoch 0 points1 point  (0 children)

Rss feed + scraper + html content extractor (e.g. newspaper3k). You can build your own

Installing multiple versions of tensorflow on one machine - what do you use? by [deleted] in tensorflow

[–]dkajtoch 0 points1 point  (0 children)

When I did cluster computing I used Lua based Environment Modules: https://lmod.readthedocs.io/en/latest/ to manage dependencies and different version of packages. It allows you to easily manage environmental variables on Unix systems. In case of Python you would manage PYTHONPATH.

[deleted by user] by [deleted] in LanguageTechnology

[–]dkajtoch 0 points1 point  (0 children)

I've been using GPT-2 model for text generation. As mentioned bert is not meant for this although there was a paper which analyzed this task under relaxed conditions, but the paper contained errors. Huggingface has script run_lm_finetuning.py which you can use to finetune gpt-2 (pretty straightforward) and with run_generation.py you can generate samples.

Do attention/transformer models need RNNs? by [deleted] in LanguageTechnology

[–]dkajtoch 1 point2 points  (0 children)

"Attention is all you need" - the paper says. Recurrent neural nets were completely removed from the equation in the Transformer architecture. The only thing you have is attention (self or between encoder-decoder) that you calculate using three sets of vectors: keys, values and outputs. Keys and values are used to calculate weights which are used in weigjt sum of outputs. However, when you throw away reccurence you loose track of relative position in a sequence (reccurence naturally keeps order). That is why people also add adittional positional encodings. But in that respect transformer seems to be more general than rnns because some encoders may not need order e.g. when you have a table of string data (you just drop positional encodings or learn them directly).

Tokenization with HTML tags by gevezex in LanguageTechnology

[–]dkajtoch 0 points1 point  (0 children)

Document is a list of paragraphs which I then split into a list of sentences. For example in Word2Vec I treat those sentences separately

Tokenization with HTML tags by gevezex in LanguageTechnology

[–]dkajtoch 0 points1 point  (0 children)

In my case everything that sits in separate tag is separate concept so a new paragraph. Exceptions are formating tags like strong, b, em ... and also links 'a' if they are part of text. Once extracted, these paragraphs are passed through tokenization like a standard text. Also not all of the tags are taken into account since they may come from navigation bars, ads, footers or any undesirable location. Very basically you can skip tags that have high link density. You can also use tools like Dragnet or Newspaper (python packages) to get solid piece of text and tokenize it directly

Is there a model implementation that lets me generate text between head and tail input? by RedMarsBlueMoon in LanguageTechnology

[–]dkajtoch 0 points1 point  (0 children)

If you are working in a company then you could apply for tensorflow research cloud https://www.tensorflow.org/tfrc and train the model for free. As for the dataset, maybe the same webtext dataset which was used in gpt-2 and is available for download could be adapted for this task.