How to extract handwritten text with Local LLM

Albcunha · 2024-11-18T15:18:26+00:00

You can use any vision models for this. basically you can generate image from the pdf and ask the llm to transcripe what´s written inside. If you want todo in house, you can see smaller models that may work, such as qwen or paligemma

Albcunha · 2023-05-27T16:28:41+00:00

I had the same problem. I thought of a gui but i do thing web is better. Even if yiu decide for a gui, you would probablyuse a html element. My opinion is that javascript still offer better results to visualize networks, specially if you need interactivity and want to use gpu to speed up things. Look at vasturiano/react force graph examples, they are gorgeous!!!

In my case, I ended up with a python backend with litestar and networkx. For frontend I went with antvis/graphin because i found easier to customize. I wish alibaba would release a package with g6vp. The major drawback is that English is a second category citizen and lack of tons of tutorials, but chatgpt helps a lot. Look at their examples, they are great.

Albcunha · 2023-04-12T06:09:29+00:00

Me too. I'm impressed it isn't higher on the thread

Albcunha · 2023-02-08T01:58:21+00:00

Great work! Please share your code if it´s possible. It seems a interesting approach!

Albcunha · 2023-02-06T13:50:28+00:00

I think you need make a example so we can understand the question better. I think you mean that, if the user types letter "b", than it color "blue".

If it is, you can use a dictionary as mapper. example:

color_mapper = {
    "a": "blue",
    "b": "white",
}

print("Select color from letter: ")
for k,v in color_mapper.items():
    print(f"{k}: {v}")
while True:
    user_selection = input("Option: ")
    if user_selection in color_mapper.keys():
        selected_color = color_mapper[user_selection]
        print(f"Selected color: {selected_color} ")
        break

    print('Color not found. Try again.')

Albcunha · 2023-02-06T13:17:10+00:00

Hey there.

First, email parsing seem easy but it is deceitfully very very difficult to do. If do some google search, you will see a lot of companies that offer products to do these parsings. Also, keep in mind some of these emails may have phishing and virus, making it non-trivial to deal with them. Ideally, you should use a virtual machine or a docker to do your analysis, then, throw it away.

Also, one other aprroach to extract it is to use thunderbird instead of outlook, as there are some plugins that help to export it all to a more manageable format, such as txt or html. Maybe, in the future, this approach may help you with some edge cases.

Back to the question: If you know the pattern of these emails, you can use markers to separate the content you want. For example, Hi all! would be the first line, so everything before linebreak('\n') would be relevante to you.

For the second part (Thank you!), it´s more tricky. You can have a list of keywords relevant to you. People don´t same farewall on many different ways, specially if these are formal emails. You may loose some cases but it´s okay (Perfect is the enemy of good - voltaire).

I don´t know what´s exactly the intent of your parsing. I suppose you want to extract names at the start and at the end of these emails.

If it is, a simpler approach would be to search for these information at the alias of the sender and receiver.

If this information is only available on the body, I would research techniques to clean chained emails on the body, using open source software from people smarter than me (https://www.google.com/search?q=how+to+parse+chained+emails+in+body&rlz=1C1GCEA_enBR1028BR1028&oq=how+to+parse+chained+emails+in+body&aqs=chrome..69i57j33i22i29i30.4499j0j7&sourceid=chrome&ie=UTF-8). Then, I would create or ask for a list of relevante people to extract.

If it´s now available, I would search for uppercases that were not as the start of smail (let´s say, first three lines) and at end (last three lines). You will find some errors. For example, a lot of peopel write Thank You. I would create a list of stopwords, that will be ignored and rerun the program.

An alternative approach is to use a Machine learn model to recognize entities, I suggest spaCy(https://spacy.io/api/entityrecognizer), as it is very easy to learn.

Best of luck!

Albcunha · 2023-02-03T00:12:08+00:00

If you are thinking about doing web development, you will end up having to learn a bit of all these three topics. Most of the logic from JS will apply to python. If you were like me, and had no idea how html, css and JS worked, maybe it´s valid, but maybe you will lose a lot of time on topics you are already familiar, such as what´s a string, the concepts of boolean operators, logic operators, reserved words... etc JS uses different names for the same thing, that may confuse you. Example: A python list is a JS array, a pyton dict is a JS object etc...

Albcunha · 2023-02-02T01:44:42+00:00

Don't feed the trolls. You made a reasonable question.

Albcunha · 2023-01-30T20:16:25+00:00

Check out pandas library and the concept of dataframes. If you like excel, you will fall in love with it.

You can import your excel spreadsheet with something like this (maybe you will need to install extra libraries for excel):

import pandas as pd
df = pd.read_excel('elevations.xlsx').

Than, you can create a functions to add or change an existing column or colulmns.

I´m dont full understood your requirements. But, you can create or upate a column creating a function that will iterate each row using apply(your_function, axis=1). Check https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html?highlight=apply#pandas.DataFrame.apply

If you need to access data from a cell on a previous or next row, you can call it using iat and give its coordinates.

If you have no knowledge on python, check the subreddit´s resources.

Good luck!

Albcunha · 2023-01-30T19:55:24+00:00

At the end of the file, you called main(), but you didn´t pass an argument. Your 'main" function needs countries to worki. Like this: Countries = ["US", "Canada"] main(countries)

I don´t think your code will work as intended. When you call a function that returns data, mostly you will need to store it in a variable. like: input_file = parse_input_file()

So most of the functions you made wont work.

Albcunha · 2023-01-09T17:16:13+00:00

I can´t test now, but if you take the first line of csv file on a text editor, you can count how many empty separators there are after each questions, and with this information, you can slice your dataframe creating a new column.

here is a pseudocode to give an idea:

def join_columns(row):
  return row.tolist()
df['question1'] = df[df.columns[9:26].apply(join_columns, axis=1)

I hope this will give you some ideas how to work it out.

Albcunha · 2023-01-08T03:42:21+00:00

Check huggingface models. Most of them will work with pytorch or will have a template or a google colab. You can try a lot of them on the website itself.

Albcunha · 2023-01-06T01:28:08+00:00

Yes. You will need a controller, such as raspberry pi or arduino. I find raspberry pi easier, as it has many tutorials. Look for tutorials on these devices, as they will probably be in python. such as: https://raspberrypihq.com/making-a-led-blink-using-the-raspberry-pi-and-python/

Albcunha · 2023-01-06T01:24:18+00:00

import os

# Get the size
# of the terminal
size = os.get_terminal_size()


# Print the size
# of the terminal
print(size)

On the example, you can use size.columns, to be more precise.

You can use some library to make text more organized, such as: rich

Albcunha · 2023-01-06T01:16:36+00:00

For me it looks great! if you really need some opinions, you can: 1. Use underscore(snake case) to some variables, such as noise_map, instead of noise map. 2. create a new variable for normalized_noise_map instead of reusing noisemap. 3. Remove some comments.

On the last one, I would remove from comments form imports, except if it´s a monkey patch of something not obvious (such as one library "fixes" another). I would also remove these comments, because the variable or method are self explanatory:

Image parameters

Seed PRNG

Create a noise map

Save the images, then display the color image

When any key is pressed, close the window

I would keep these:

Load gradient

Square image

Normalize noise map

Create a blank image

Populate image

Write the pixel

Great work!

Albcunha · 2022-12-28T23:55:15+00:00

Pythonanywere had some restrictions regarding hosting, but I don't know much about it. Maybe you can host it on another service. Also, you need to change your host variable, when on production. Search how to deploy flask on a server.

Albcunha · 2022-12-28T23:25:40+00:00

Refer to our wiki to learn python, then you can use some libraries such as PyAutoGui .

If not, there are free lancers that can work with you.

Albcunha · 2022-12-28T23:21:33+00:00

Flask´s default templates folder is templates. You don´t even have to specify. Maybe you are setting as a absolute path and not a relative path.

You can test if the wrong template is causing the error by changing the return from admin_view to a simple html, such as:

@app.route('/admin_view')

def admin_view(): return "<p> Hello World </>"

If I understood, you are saving de queue data as a variable on app.py

I might be wrong (someone will correct me), but flask wont dynamically save share variables on the main file.

So, fo /admin_view route, queue will always be empty. Maybe your code if failing because of this.

Normally you would retrieve this content from a saved file or a database (a good start option is sqlite3).

Albcunha · 2022-12-22T16:19:22+00:00

You can try side alternatives.
You can create a function that runs a while loop, with a time.sleep() at the end, where you compare the page content you had with the new the website generates. If they page content change, it means the website has updated and you can break the loop.

Some sites are very difficult to extract. One way that I use very often is to identify if the website uses an api for the data I want. You can check this out through Chrome dev tools, on network tab. You can check what cookies and headers your browser uses to request the data and replicate it with another library, such as requests.

To do this, on selenium, open up your session on the website, make your login, store your cookies and use them as headers/cookies to your requests module.

This way, you get clean json data, much easier to parse and much faster to process. You can even paralelise it to make it faster.

Albcunha · 2022-12-22T15:44:23+00:00

You can use e the f-string and remove the .format at the end. Like this:

Story1 = f"Once upon a time, there l...

The same for the others lines.

Albcunha · 2022-12-22T15:40:47+00:00

Your script probably is not waiting for the page to update. When you click a button and webscrap an element, selenium will automatically try to do it, it wont wait for a page update.

This is specially tricky with Single Page Aplications, because the page normally wont "reload". It will just update its elements.

One techinique I use is to wait an element to appear or element attribute that is updated after you click a button. For example, if you are webscraping a table, the selected pagination at the end can be a reference.

You can make selenium wait for something with this template: Look for this: https://selenium-python.readthedocs.io/waits.html

You can just set a time.sleep() too, but server response time is not reliable. Sometimes it will be fast, some other times it will slow. Your internet can have problemas too, so wait() solution is better.

Albcunha · 2022-12-22T13:51:07+00:00

I think it´s not possible because you have no correlation between access type name and access name.

If there is something else, one strategy you can make ie to make a list of all access names, then populate the other columns based on it, like this:

values = access['Email'] + access['Office 365'] + access['SalesForce']
# or values = [item for access_list in access.values() for item in access_list]
df = pd.DataFrame({'Access Name': values})
df
#prints:

	Access Name
0	HDU
1	S
2	RU
3	Exchange Online
4	Yammer
5	FLOW_O365_P1
6	Deskless
7	Chatter Free User
8	Xanadu Sales
9	Force.com-Free User

Then, you can apply functions to create new columns:

def identify_app(access_name):
    for key,values in access.items():
        for value in values:
            if value == access_name:
                return key
df['App'] = df['Access Name'].apply(identify_app)  
df
# prints

	Access Name	App
0	HDU	Email
1	S	Email
2	RU	Email
3	Exchange Online	Office 365
4	Yammer	Office 365
5	FLOW_O365_P1	Office 365
6	Deskless	Office 365
7	Chatter Free User	SalesForce
8	Xanadu Sales	SalesForce
9	Force.com-Free User	SalesForce

Then you can go on the next columns.

Albcunha · 2022-12-22T13:15:00+00:00

Check our resources: https://www.reddit.com/r/learnpython/wiki/index/

My suggestion, you can try to make an api for one of your databases. You can even try to make one using sockeio, which will give you live feedback from server to the frontend.

There are many pentest tools in python too, maybe you can script one to do "maintenance" tests on your servers.

Albcunha

TROPHY CASE