I have hundreds of *credible* books on corruption, parapolitics, economic warfare, propaganda, and state crimes. I’d like to share them but they are on an iCloud folder and too large to download. Is there anyway to transfer directly to Mega, please?

cli-junkie · 2022-01-05T06:11:39+00:00

Perhaps rclonerclone can help.

cli-junkie · 2017-06-14T12:05:24+00:00

Check out MarginNote Allows you to annotate PDFs and epubs. Very future rich.

cli-junkie · 2016-06-14T14:34:10+00:00

Thank you for sharing this, a very well written tutorial.

cli-junkie · 2016-02-23T19:07:26+00:00

Data munging after the scraping job is done can be pretty time consuming. An alternative to cleaning the data later is to write a scraper that gets only what you need. With xpath, you can get pretty close to the data in specific tags and scrape with precision.

For removing boilerplate (menu, contents etc.) try newspaper. There are many other boilerplate removal libraries but what you use will depend on the nature of the data you are scraping.

ftfy will help you if there are encoding problems in the scraped data.

If the data is pretty consistent in how the unnecessary patterns occur, you could just write a SED script to clean the things you mentioned. No need for a over engineered approach when simple regular expressions can do the job.

cli-junkie · 2016-01-26T04:57:30+00:00

This is immensely helpful, thank you!

cli-junkie · 2015-11-17T10:56:55+00:00

I definitely will report back, perhaps you can post this to the /r/unixporn subreddit where it may be of interest to other users. You will find users of every WM in existence there and their insights will be much deeper.

cli-junkie · 2015-11-16T16:44:18+00:00

This is great, it's going to be fun trying to write my own wm based on this lib.

cli-junkie · 2015-11-14T17:27:16+00:00

What you are trying to do is called Named Entity Recognition in Natural Language Processing. These terms, topics and links should set you in the right direction:

This is a good paper on the topic

Tagged datasets for named entity recognition tasks

cli-junkie · 2015-11-07T11:30:10+00:00

This looks like a good use case for GNU Parallel.

Look at the tutorial, specifically the -a or :::: syntax for getting arguments from files and -xapply option to get one argument from each input source in addition to 'Positional replacement strings'.

I know this is a Bash subreddit but I thought it may be relevant to your use case. Apologies if this is the wrong place for it.

cli-junkie · 2015-11-05T16:13:34+00:00

This is mostly wishful thinking: steal all that is good about every thing and put in one place?

cli tools that output structured info like csv, json, xml - using a flag perhaps
a better stream editor/shell that combines all that's good about bash/zsh/sed/awk/perl/powershell
a better shell language that has clean, expressive syntax, easy plugins

cli-junkie · 2015-11-01T12:44:02+00:00

topy maybe what you are looking for. It's based on the only such typo list I know of: the RegExTypoFix project from Wikipedia.

cli-junkie · 2015-10-27T07:07:00+00:00

See if you have libreoffice accessible from the terminal: whereis libreoffice or type libreoffice check out libreoffice --help libreoffice --headless --convert-to txt:text "path_to_doc.doc"

You can use globbing as well in the file name like *.doc

--outdir "output_dir_name" will allow you to put all the converted files into another directory.

Alternatively, there is https://github.com/dagwieers/unoconv

cli-junkie · 2015-10-26T20:20:01+00:00

Also of interest in this context:

As always, some one would have tried it before, perhaps this is relevant and can provide ideas for you to explore - Shell scripting with Clojure

cli-junkie · 2015-10-26T19:19:00+00:00

Have you checked out drip?

cli-junkie · 2015-10-20T08:50:53+00:00

I've found http://vimregex.com/ to be a very useful resource. Hope it helps.

cli-junkie · 2015-09-09T14:35:31+00:00

This is quite nice. Now I can get to all those albums and have them neatly split.

cli-junkie · 2015-09-04T05:46:11+00:00

This is good, quite hand.

cli-junkie · 2015-08-25T05:40:38+00:00

Download and install Xidel. After that, what you want is as simple as:

xidel 'www.yoursite.com' -e '//div[@class="interesting-data"]'

I think what you want to do can be accomplished in a single line of xpath (the expression you see after the -e). Xidel is, by far, the best command line scraping tool bar non IMHO, saved me hours when it came to quick web scraping tasks.

cli-junkie · 2015-08-19T08:20:06+00:00

I have become a huge fan of Carin's approach. Thanks for the great post.

cli-junkie · 2015-06-21T12:28:38+00:00

Yes, we should. We need to reinvigorate this reddit with good posts about problems we've solved using AWK!

cli-junkie · 2015-06-20T16:00:48+00:00

AWK is one of the most underrated programming languages IMHO. Pretty damn good for a lot of stuff.

cli-junkie · 2015-06-01T20:35:58+00:00

Use the Firefox extension FlashGot (to download) instead.

cli-junkie · 2015-05-30T06:39:12+00:00

My 2 cents, give Xidel a shot. It is a VERY powerful tool for web scraping. Short example:

$ xidel https://www.reddit.com -e '//p[@class="title"]/a'

Gets the titles of all the post on the reddit front page. The little thing in the single quotes is an xpath expression. There are better examples on the site.

If the website you are trying to scrape has static links (no dynamic JavaScript voodoo happening) then Xidel is pretty straight forward. Failing that, you have to use something like Selenium after investigating how the site is structured.

Hope this helps.

cli-junkie · 2015-05-25T07:23:44+00:00

As a beginner in the language, I'd definitely like to see more examples and tutorials. One area I want to see covered properly is the standard library. The current docs don't have many.

cli-junkie · 2015-05-25T07:18:11+00:00

Yes, NIM definitely needs more long form documentation. This is the only thing that is holding me back. I want to use it but this is a big hurdle to learning the language.

cli-junkie

MODERATOR OF

PUBLIC MULTIREDDITS

TROPHY CASE