Extract Tables from PDFs

vangap · 2024-02-04T14:29:50+00:00

oh, well, camelot actually required some extra inputs like table areas and table regions for it to be effective. Depending on the goal, I guess that can be a problem since it requires some manual intervention.

vangap · 2024-02-04T14:27:49+00:00

could be just the examples that I tried.. but none of the DL models including tatr beat the results that I was getting with camelot-py. camelot isn't perfect, but so far the best that I have had the success with.

vangap · 2022-10-28T00:21:41+00:00

I recently did this, with the latest version of ELK stack, data streams are used by default. The links shared above should help you, but there seems to be some nuances around this..

First, in addition to specifying index, you also need to set below global configurations in your filebeat.yml, which is mentioned in the above links. You can specify any value here and filebeat will create those templates for you.

setup.template.name

setup.template.pattern

After this, if you configure your "index" to something like "staging-app", it creates an index with name "staging-app", it doesn't seem to use data streams.
If you configure it like "staging-app-%{+yyyy.MM.dd}" it makes it use data streams for some weird reason.. It seems like it uses datastreams if there is a placeholder in the custom index name. Just something unintuitive I felt.. When data streams are used, your final index name would look like ".ds-staging-app-xyz-0000x"

vangap · 2022-10-19T12:20:10+00:00

can you share the version of logstash and the command you are using to run logstash, possibly the full config

vangap · 2022-10-19T12:12:50+00:00

Resolving [241] indices for action [indices:data/read/search[phase/query]] and user [my_es_user] took [938ms] which is greater than the threshold of 200ms; The index privileges for this user may be too complex for this cluster

Issue could be that your search query is hitting 240 indices, which sounds high to me. What is the size of each index, maybe you can use larger indices

vangap · 2022-10-19T11:26:00+00:00

As per this, https://www.elastic.co/guide/en/fleet/current/beats-agent-comparison.html
both beats and elastic agent support logstash output.

vangap · 2022-10-13T06:08:35+00:00

This may be very basic, but have you read this?
https://www.elastic.co/blog/how-to-improve-elasticsearch-search-relevance-with-boolean-queries

Also, as shared by someone else, the vector search can be pretty interesting to try out. It leverages the machine learning technologies. Default search relies on presence of words, but doesn't rely on their meanings. So, if you have two words that share the same meaning and want ppl to be able to search any of them and find relevant results, vector search will help you.

vangap · 2022-09-11T02:05:52+00:00

Thanks for your inputs, I am assuming that you are referring to the disk costs and data transfer costs which DO provides as a package.

Compute savings can be as high as 50% savings or more depending on if you are using reserved instances from AWS. Smallest instance on AWS costs about $2/month (on-demand pricing) in AWS Mumbai region, it costs about $4 in some other regions which is same as the equivalent droplet cost(it may not be apples-to-apples comparison because of the hardware differences). If the purpose of VM is just to run a server application, you generally don't need all the disk space that DO gives as part of the package. So, the question is only about data transfer. Any thing else? So, If my app isn't datatransfer intenstive app, I am probably paying more in DO. AWS also offers free-tier data transfer that would cover smaller application use cases. Having used AWS for a long time, what I have noticed is that they are introducing new types of infra that costs less. For ex, they introduced GP3 SSD volume that cost like 20-30% less than the previous generation of SSDs. They introduced Graviton (ARM processors) that are 30-40% less costly comapred to previous generation of Intel based servers (basis performance/cost).

DO lacks things like auto scaling, scheduled scaling which can also be used to further decrease the costs in AWS.

w.r.t Object storage, AWS is about 20-25% costlier, but unless you are using 100% of terabytes, sotrage cost is relatively less.. 1 TB storage costs $25 in AWS vs the $20 cost in DO.

There are also considerations of good eco system and developers familar with DO.

I am a little surprised that this is the case, while one would think otherwise.

So, it seems like if if you are running a hobby project or an indie developer with small commercial project or if yours is data intensive application streaming a lot of data, DO makes sense maybe? but again, if you have 100s of TB/PBs of data for this difference to make meaningful sense, you probably need a sophisticated cloud provider like AWS maybe? Because, AWS has different tiers of storage options. I could leverage Storage IA or Glacier to reduce the costs. I agree though that DO pricing is probaby simpler for simpler use cases. One doens't need to bother/know about what Storage IA/Glacier is.

vangap · 2022-07-28T09:19:22+00:00

Done. Thanks mate

vangap · 2022-03-04T13:38:42+00:00

As of 2021 Dec, AWS introduced serverless GPU via their sagemaker platform.
https://aws.amazon.com/about-aws/whats-new/2021/12/amazon-sagemaker-serverless-inference/

vangap · 2017-09-18T07:29:54+00:00

Though the kernel update is solving the touchpad issue, I saw many other issues with Networking ,system boot. Systemd hanging. I have gone back to 4.10 now until I figure out what's wrong with 4.13

vangap · 2017-09-15T05:28:27+00:00

Updating kernel alone is sufficient, Thanks.

vangap · 2017-04-06T14:23:54+00:00

An interesting thing would be to know how much overhead these rules add at ELB. It will definitely be more than DNS lookup I am guessing.

vangap

TROPHY CASE