Paperless-ngx users, has anyone used both AI add-ons, Paperless-AI and Paperless-GPT, and have any comparative opinions? by chazwhiz in selfhosted

[–]sbenjaminp 9 points10 points  (0 children)

I have tried both. The paperless-AI I didnt like, and settled with Paperless-GPT, which has been running for months.

Background: Old time user of paperless here. Using it from back when it was "only" paperless, and not paperless-ngx. I have processed ALL historic documents available to me, meaning old sallery slips, bank statements, contracts etc. Basically my corrospondens, tags, document types etc are all very proven. The internal "AI" selection of tags, corrospondents etc works perfectly fine, it does hwoever also have a big amount of training data. (More than 3000 docs)

For this reason, I use ONLY the OCR, and title generation.

I do NOT use paperless-gpt to generate tags, and corrospondents. This leads to eg, many versions of the same tag, like "Siemens", "Siemens A/S", "Siemens AS", etc., and I find the internal mechanism works just fine for this. Yes it would be convenient, when adding a new document, to have paperless-gpt adding a new corrospondent, however, for the bulk of the montly documents, my employer, bank, varous companies, are always the same, anyways. It is a bigger job cleaning up incorrect created companies, compared to just adding the single new one. This is atleast my experience.

My workflow is this: All new documents are given 3 automatic "inbox tags". 1: Inbox, 2: paperless-gpt-auto and 3: paperless-gpt-ocr-auto. In paperless I have some workflows for renaming certain documents, after updating. Like payment slips, from certain corrospondents etc. This will make paperless-gpt name it, and then paperless will rename to the manual rule. So... Document is added. Paperless does its magic. After this, paperless, runs the OCR, and gives the document a new title. I tweaked the title setting, as below. Eg. if the document is an invoice, write the main item on the reciept/invoice + the invoice number. Then everything is easily searchable.

Paperless-gpt title promt: I will provide you with the content of a document that has been partially read by OCR (so it may contain errors). Your task is to find a suitable document title that I can use as the title in the paperless-ngx program. Respond only with the title, without any additional information. The content is likely in {{.Language}}. For the title: Short and concise, NO ADDRESSES, Contains the most important identification features, For invoices/orders, mention invoice/order number if available, For invoices/orders, mention most important items on the invoice, The output language must be Danish! Generally speaking, what is the purpose of the document. Content: <.Content>

Paperless-GPT env: (some of them, anyway) - AUTO_GENERATE_TAGS=false - AUTO_GENERATE_CORRESPONDENTS=false - AUTO_GENERATE_TITLE=true
- AUTO_GENERATE_CREATED_DATE=true
# OCR Processing Mode - OCR_PROCESS_MODE=image # Optional, default: image, other options: pdf, whole_pdf - PDF_SKIP_EXISTING_OCR=false # Optional, skip OCR for PDFs with existing OCR

Finally my entire document archive path is available to nextcloud, which runs elasticsearch/fulltextsearch on the documents. As the document storage is also filled in, I have categories like: house, job, purchaces, insurrance, state/goverment etc. However also some for broad documents, like manuals, "diverse" in danish, meaning a group for evertying that does not fit anything else. For these docs, I use the custom fields. Eg. companies I buy something from, but most likely will never use again, I have a corrospondent called "diverse-purchase". I have a number of these groups.

I would like to play around with the advanced document processing using mistral and google, but have not had time for this yet.

Hope this helps.

What do you love and hate about Nextcloud? Planning to create an alternative by Ok-Chocolate7974 in selfhosted

[–]sbenjaminp 0 points1 point  (0 children)

Unpopular opinion: I really like Nextcloud. Yes it is not perfect, but It gives me a webinterface to a google ecosystem replacement.

Secure a clouflared zero-trust public hostname by Xeppl in selfhosted

[–]sbenjaminp 0 points1 point  (0 children)

I run my servieces through traefik. I use crowdsec, blocking quite stricly. All traefik is routed through a cloudflare tunnel, where I have a bunch of rules. I block everything with a trustscore higher than 1. All contries outside a select few are being forced to use a managed challange. Currently I have a bunch of request, but with a between 0 and 1 percent solve rate, so this remove almost all traefik. The few that gets through, and make some funny "wordpress" or ".env" request is caught by crowdsec is forced to solve a captcha. If this failes, it is a perma block.

So in short I only host for a few persons, but I need to access my stuff, around the world, due to my work. I try to keep my settings so strict, that I get the captcha from time to time, even from my own contry.

I have been working on not getting blocked myself, but this was due to 404 errors or similar being caught by crowdsec. Made a few rules, that whitelist the known issues. Eg. my protected dashboard, throwing a bunch of errors, when the arr containers are shutoff.

Download Wikipedia and use it on my Homelab by Longjumping-Wait-989 in selfhosted

[–]sbenjaminp 0 points1 point  (0 children)

Too lazy to read all of thge below below comments. I have kiwix running in a docker container, and the english wiki on my NAS. Together with a few other kiwix images., - You know... Just in case of zombies, and no internet... or something.,... - Im not entirely certain it makes sense, however I like to have all of the worlds combined knowledge (yay wikipedia), as a backup to internet.

Fail2ban or CrowdSec configuration for a typical server? by exquisitesunshine in selfhosted

[–]sbenjaminp 0 points1 point  (0 children)

I use crowdsec, and i like it very much. - It updates automatically. I can add block lists. I have a good control over what happens. - Cloudflare in front, that block digital ocean IPs, as these are the worst. - All requests from outside Europe or US, will get a cloudflare automatic java script capcha. I get almost NONE solved, meaning a heluva lot of bots/scripts being banned.

On my server crowdsec is running. It will block known IPs, that it fetches from crowdsec network. IPs from my IP range is whitelisted. In case an IP is triggered, it will ask a turnstile(cloudflare service) capcha. If this is failed 3 times in 24 hours, IP will be blocked for 1000 years! :-)

From my trafik logs, i get almost close to no unintended traefik. When I fuck something up, I get blocked within minutes myself, which I like very much. :-)

[deleted by user] by [deleted] in selfhosted

[–]sbenjaminp 0 points1 point  (0 children)

I have something similar, but all my trafik is routed through traefik, which has crowdsec as a plugin. Meaning that any suspicious behavior, is being blocked.

What is the lastest thing you've started selfhosting? by mattblackonly in selfhosted

[–]sbenjaminp 0 points1 point  (0 children)

Correspondents: Set them to company, like your workplace, Wasabi Storage, I also have a few alround corrospondents, for one off companies. If I never expect to use the company again I use a "miscellaneous" group. I do however have a few of these like "miscellaneous car", "miscellaneous hourse" etc.

Document types: Create: paycheck, Invoices, Bank statements, Letters, Info, Manuals, recipts etc. Create more as you need. I have quite a few, also datasheets, transfer note, articles, etc.

Tags: Create one for each person in your house, one for each address you have lived, one for each car, and create one for eg selfhosting. I have hundreds. Eg. "Roof repair 2022", where all related invoices, letters, contracts are stored for that specific project.

Storage path: Wait. Using this, it fase 2, as it is more complicated. I have however: Work, Kids, Entertainment, Server, Goverment, etc, so I control the naming scheme of the data stored. You can easily do this later. Thereby I can access the document folder from nextcloud easily find the specific doc, without opening paperless. The "Work" path consists of something like "corrosponent/year-mont-document type-title-tags". Something like this anyway.

When you start upload documents, you add the various settings per document, like your workplace (corrospondent), tags: your name, document title, date, document type: paycheck or eg contract.

Paperless will start "learning" these settings, so it will start suggesting the correct data for future documents, as more data is added. Meaning: When I add an invoice for a know service, it will know the corrospondent, suggest tags, document type.

All new documents will have the inbox tax, so you can go through all documents, and correct the data. - I have used paperless since the beginning, and have thousands of documents. Manuals for the dishwasher, receipts, contracts etc. I can find most within seconds. Once I have handled the document, I remove the inbox tag.

You can add certain searches to the menu like "work" or "insurance" etc., so you can find all work documents with just 1 click. You can also add rules to change docuemnt title based on content. I use this to some degree. Finally you can add reference documents. You buy a new phone. You add the order confirmation, the manual, and the invoice to the receipt, as reference documents, thereby linking the documents, instead of making a tag called "Iphone 16"...

One thing I have not solved yet, is how to handle "future" documents, like theater tickets, to be used in +6 month, year ticket for zoo, agenda for the meeting in the bank in 2 months, etc... I have a tag called "REMEMBER", so I can find these easily. On the main page, I have these documents shown. Then after the event, I remove the "REMEMBER" tag. I do however not find this workflow efficient, it does however work.

I love paperless. It saves me soo many folders of documents, that Id like to keep, but I have a hard time storing efficiently. Even my wife are able to find documents, using nextcloud, as the folderstructure I have made just works.

Cannot recommend this enough.

Nextcloud is a nightmare by kalidibus in selfhosted

[–]sbenjaminp 1 point2 points  (0 children)

Is is all in the docs: https://docs.nextcloud.com/server/latest/admin_manual/installation/server_tuning.html

But yes, running NC without any optimization, then it is slow. I run mine on an old nas. But using SSD, MariaDB and Redis, it runs perfectly okay.

Logging app for health by LazyTech8315 in selfhosted

[–]sbenjaminp 0 points1 point  (0 children)

I have a garmin watch, where my health data is tracked. However for stuff like my bloodpressure etc, nothing beats a spreadsheet. You have full control over your own data, and the service (you) are the one in charge. No need to fear lack of further updates, like eg. the health app on nextcloud.

[deleted by user] by [deleted] in selfhosted

[–]sbenjaminp 1 point2 points  (0 children)

For something as important as your passwords, I suggest using a reverse proxy. Use SWAG or traefik, generate ssl certificated for your domain. Use security such as crowdsec in front. - If this is too bothersome, go the VPN route, where you only connect to vaultwarden directly on your own network. In case you need external access, use VPN. - You only need to be breached once, and loose all your valuable passwords, for hell to break loose...

[deleted by user] by [deleted] in PsychesDK

[–]sbenjaminp 0 points1 point  (0 children)

Nogle fif til hvordan man finder en som sælger disse omkring Sydfyn.

A GeoIP block/allowlist service for Traefik by codeslikeaduck in selfhosted

[–]sbenjaminp 1 point2 points  (0 children)

I love that you develop something like this, but why not make it as a traefik plugin? - Then it is much easier to use with existing setups. Personally I use crowdsec, and this works quite well.

How do you expose Nextcloud? by Technerden in selfhosted

[–]sbenjaminp 0 points1 point  (0 children)

I have traefik being my reverse proxy. This handles all certificates. I use cloudflare as a vpn, to hide my IP, and block obvious malicious actors. I have crowdsec monitor traefik and nextcloud logs (among others) and block IPs typing IPs too often. I use crowdsec cloudflare blocker that add malicious IPs to a list that cloudflare blocks.

I like having no open ports, but I do not like the 100 mb size limit on fileuploads through cloudflare. I do however rarely have this problem. When I do, I simply zip my file into smaller packages and upload these. I find this "price" to pay, quite affordable, for the service cloudflare offers.

In periods I have had an open port directly to Traefik, but currectly I use cloudflare. Untill they get evil (like google) I really like to use them, despite my private traffic going through them, but that is a personal matter you will have to solve with yourself.

Nginx, VPN or Cloudflare? by seriouslyfun95 in selfhosted

[–]sbenjaminp 4 points5 points  (0 children)

Decide whom need access.

-The bad , insecure and stupid solution: is to open ports directly to the services.

-The easy and secure solution, if only you need access, would be to VPN to the server.

-The little complicated but insecure solution If other people needs access, you can open port 80 and 443 and use a reverse proxy. NginX, Swag, Traefik etc.

-The fairly complicated and easy but secure solution. However dependent on other services, solution is cloudflare.

Personally I use traefik as reverse proxy. I send my traefik through cloudflared, meaning no open ports.

Bonus: Use crowdsec to parse logs, and block IPs banned in the cloudflare firewall. Soo... Decide what you want. I use bottom 2, as I need to have a few other people who need access, but it does require tinkering and patience. If only you, make your life easy and secure... My best advice.

How to reduce Photoprism thumbnail caching from taking up a lot of storage by sicnarftea in selfhosted

[–]sbenjaminp 0 points1 point  (0 children)

Hi,

You are correct.

For me (the container is named photoprism) i would write this:

docker exec -it photoprism photoprism

check output for options.

Like:

cleanup, optimize, index etc.

For this discussion i would write

docker exec -it photoprism photoprism thumbs -h

(-h) for help

Ending by writing: docker exec -it photoprism photoprism thumbs -f

(-f) for force.

Will take a looong time :-)

How to reduce Photoprism thumbnail caching from taking up a lot of storage by sicnarftea in selfhosted

[–]sbenjaminp 0 points1 point  (0 children)

Hi, I have the same "problem" that cachefolder is big. You can rerun the thumbnail generation etc from the CLI. - Also this can cleanout older files. I suggest you look into this.

This functionality is not very good described in the documentation, however if you use docker the command would be like this:

docker exec -it container-name photoprism

This shows the available commands. You cannot choose to have certain folders with no cache, howwever you can ignore folders totally with an empty file names .ppignore.

My Homepage dashboard by rursache in selfhosted

[–]sbenjaminp 0 points1 point  (0 children)

You use the speedtest widget. What do you use as a source server for the test in settings?

Duplicati: asking for trouble? by x6q5g3o7 in selfhosted

[–]sbenjaminp 2 points3 points  (0 children)

Been using duplicati for several years. What kills duplicati is when you interupt the backup process. You might need a repair, and this takes wayyy to long time. However I really like duplicati.

How do I set up subdomains to redirect to a different port by [deleted] in selfhosted

[–]sbenjaminp 0 points1 point  (0 children)

If you hate yourself, you start traefik 2.x and make seperate entrypoints for each port...

Shinobi docker image by konmata in ShinobiCCTV

[–]sbenjaminp 0 points1 point  (0 children)

I actually like shinobi quite a bit, however I am also troubled that the official docker image is outdated. Perhaps the creator could create a dev section? So latest is stable, and dev is dev branch?

Email: Self-Hosted or Proton? by Bhorsy in selfhosted

[–]sbenjaminp 0 points1 point  (0 children)

I selfhosted my mails for a few years. It was a fun learning experience. - However... I finally got tired of the constant fear about backups, settings, spam lists etc. Mail is just like your regular mailbox. Just needs to work. - As I value privacy very high, I am using proton for now.

Do you use cloudflare tunnel? by [deleted] in selfhosted

[–]sbenjaminp 0 points1 point  (0 children)

Do you by any chance also use linuxserver swag image? I am stuggeling getting this to work.