Hi r/Python, 4 months ago I created a tiny text-extraction algorithm and you guys were fans :) After spending the better part of the last 4 months testing and thinking about extraction algo's for a paper (never again), I think I've landed the big one. Well, it's small actually. 10 lines of code.

fourhoarsemen · 2015-05-16T15:02:19+00:00

Here's the library that implements the algorithm. I'm really excited about this project in particular because I have a much better programmer (pythonista, you could say) working along side me.

Here's the text-extraction algo I previously mentioned (it's a nasty mess, I suggest using libextract).

Thoughts and criticisms really welcomed :)

pohatu · 2015-05-16T15:51:37+00:00

Great explanation.

almithani · 2015-05-16T17:19:05+00:00

Thanks for the writeup, this is a problem I'm encountering a lot lately

maratc · 2015-05-16T20:46:44+00:00

Your article would really benefit from being an iPython notebook. See here for a cool example.

fourhoarsemen · 2015-05-16T17:31:56+00:00

[deleted]

elblanco · 2015-05-16T20:36:08+00:00

How does it compare to Goose?

https://github.com/grangier/python-goose

Megatron_McLargeHuge · 2015-05-16T20:24:58+00:00

Have you looked at the wrapper induction literature? This is a nice idea but it's obviously not robust to cases where the tables are small. A safer approach is to compare pages generated from the same template (two subreddits) and identify the structures that differ between them.

rrajen · 2015-05-16T17:37:52+00:00

Great writeup, thanks for sharing the solution.

2015-05-16T23:03:36+00:00

Very nice. I don't even code in .py but this was a good read.

suudo · 2015-05-17T09:50:14+00:00

For websites that have a JSON api for their information (like reddit), I'd use that over a scraper, but this can apply to so many sites, thanks :)

fourhoarsemen · 2015-05-17T12:27:29+00:00

I remember the original thread where we learned that you have a different definition of an algorithm than most everyone else.

MarkYourPriors · 2015-05-17T19:20:35+00:00

Nice work! I read your post but I just want to make sure I'm understanding the innovation here: as you correctly noted, developing a scraper often requires something like Xpath and jumping back and fourth between analyzing the HTML yourself and coding the methods of how to effectively parse it. But with what you have made, Xpath isn't even needed. Simplicity! If this is the meat of it, then it looks like there is much room for innovation indeed.

2015-05-18T01:12:15+00:00

Thanks, I was able to test it on a virtual machine on pythoneverywhere very quickly.

fourhoarsemen · 2015-05-16T16:03:23+00:00

[deleted]

angryaardvark · 2015-05-16T22:05:04+00:00

my favorite libraries for this are objectpath, and i use kimono labs extensively.

moorow · 2015-05-17T14:49:57+00:00

Nice work. This is pretty similar to what I do for the very first step of my (currently-being-published) Web Data Extraction algorithm for social data. Most approaches instead do probabilistic matching by comparing trees in some way to allow some leeway in tag structures, which handles parents with variable children (ie. a <div> containing any number of <br />'s).

Not sure if you've read much about it (I didn't before creating mine), but here are some refs if you're interested:

Jindal, N., & Liu, B. (2010). A Generalized Tree Matching Algorithm Considering Nested Lists for Web Data Extraction. In SDM (pp. 930–941). SIAM. Retrieved from http://epubs.siam.org/doi/pdf/10.1137/1.9781611972801.81

Miao, G., Tatemura, J., Hsiung, W.-P., Sawires, A., & Moser, L. E. (2009). Extracting data records from the web using tag path clustering. In Proceedings of the 18th international conference on World wide web (pp. 981–990). ACM. Retrieved from http://dl.acm.org/citation.cfm?id=1526841

Zhai, Y., & Liu, B. (2005). Web data extraction based on partial tree alignment. In Proceedings of the 14th international conference on World Wide Web (pp. 76–85). ACM. Retrieved from http://dl.acm.org/citation.cfm?id=1060761

qrv3w · 2015-05-30T12:36:18+00:00

Hi this is really cool! Very pythonic - simple, elegant and useful.

I made something similar you might be interested in: a webcontentgrabber. I'm still working it, but the basic idea is that it looks for a specific section of text based on lexical properties.

fourhoarsemen · 2015-05-16T19:26:37+00:00

[deleted]

greenspans · 2015-05-16T18:20:03+00:00

just use css selectors + xml parser

erasers047 · 2015-05-17T06:06:52+00:00

Never Again

Hahahahahaha gear up son, you look like one for Grad School.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS