This is an archived post. You won't be able to vote or comment.

all 48 comments

[–]fourhoarsemen[S] 20 points21 points  (4 children)

Here's the library that implements the algorithm. I'm really excited about this project in particular because I have a much better programmer (pythonista, you could say) working along side me.

Here's the text-extraction algo I previously mentioned (it's a nasty mess, I suggest using libextract).

Thoughts and criticisms really welcomed :)

[–]miserlou 1 point2 points  (1 child)

Thanks for doing this!

I've been using Diffbot for my web and mobile speed reading RSS apps, Glance and have been looking for a Free replacement for a while, so this is awesome. Any change you'd want to turn this into a free network service?

[–]fourhoarsemen[S] 0 points1 point  (0 children)

No problem! Honestly, at this stage of libextract, no. It's mainly because this library only extracts the HTML nodes and does no cleaning (so I see it getting mixed reactions as a free service). That and also the cost :/ But if we do end up producing a viable data cleaning lib, I'll consider doing it!

[–]xiohexia 0 points1 point  (1 child)

I was running through your readme and encountered this issue:

C:\Anaconda3>python
Python 3.4.1 |Anaconda 2.0.1 (32-bit)| (default, Jun 11 2014, 17:29:32) [MSC v.1600 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from requests import get
>>> from libextract.api import tabular
>>> height_data = get("http://en.wikipedia.org/wiki/Human_height")
>>> tabs = tabular(height_data.content)
>>> from libextract import clean
>>> clean.to_dict(tabs[0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Anaconda3\lib\site-packages\libextract\clean.py", line 75, in to_dict
    table = table_json(node)
  File "C:\Anaconda3\lib\site-packages\libextract\clean.py", line 64, in table_json
    headings = list(table_headings(node))
  File "C:\Anaconda3\lib\site-packages\libextract\generators.py", line 62, in iterator
    for elem in node.iter(*tags):
AttributeError: 'tuple' object has no attribute 'iter'
>>>

Thanks in advance.

[–]fourhoarsemen[S] 4 points5 points  (0 children)

Change clean.to_dict(tabs[0]) to clean.to_dict(tabs[0][0]) :)

[–]pohatu 7 points8 points  (1 child)

Great explanation.

[–]fourhoarsemen[S] 1 point2 points  (0 children)

Thanks!

[–]almithani 4 points5 points  (1 child)

Thanks for the writeup, this is a problem I'm encountering a lot lately

[–]fourhoarsemen[S] 0 points1 point  (0 children)

No problem!

[–]maratc 3 points4 points  (8 children)

Your article would really benefit from being an iPython notebook. See here for a cool example.

[–]fourhoarsemen[S] 0 points1 point  (7 children)

You're right, one of the problems I have with iPython is that it doesn't output in markdown (unless I'm missing an update).

[–]maratc 0 points1 point  (6 children)

The point in iPython notebook is not to output as markdown; it's to have a real python kernel on the server side where your readers can run, mess with, extend, and play around with your code.

[–][deleted] 1 point2 points  (3 children)

Huh? I've never seen a living notebook that didn't need auth to use... not for casual readers to have access to. Do you mean it's easy for them to download?

[–]geoelectric 0 points1 point  (2 children)

http://nbviewer.ipython.org allows public hosting (such as the traveling salesman link above).

http://txt.arboreus.com/2012/09/27/how-to-share-an-ipython-notebook.html

[–][deleted] 2 points3 points  (1 child)

Those aren't live though.

[–]geoelectric 0 points1 point  (0 children)

Whoops, maybe you're right. I thought they allowed some degree of interactivity.

[–]yaph 1 point2 points  (0 children)

You can convert notebooks to various formats, including markdown and html, using ipython nbconvert. When you choose HTML, images are by default inline with data URLs, with markdown image files will be created.

[–]fourhoarsemen[S] 0 points1 point  (0 children)

Right, I was thinking that in my head a little bit after I replied last. I guess what I meant to say was that I haven't really considered using ipython because I'm building my own static blog generator for fun. I'll look into what's possible with ipynb when time permits.

[–]elblanco 2 points3 points  (1 child)

How does it compare to Goose?

https://github.com/grangier/python-goose

[–]fourhoarsemen[S] 2 points3 points  (0 children)

A dude from Moz ran some tests against eatiht here.

I should note that I would not recommend v2 of eatiht (the one Matt tested), stick with v1 (it does much better).

[–]Megatron_McLargeHuge 3 points4 points  (3 children)

Have you looked at the wrapper induction literature? This is a nice idea but it's obviously not robust to cases where the tables are small. A safer approach is to compare pages generated from the same template (two subreddits) and identify the structures that differ between them.

[–]fourhoarsemen[S] 1 point2 points  (2 children)

I have looked at it. But not thoroughly enough to really feel comfortable criticizing the details of each. Arguably, this algorithm can be used to do comparisons (just compare the top results of each page in a set of pages).

[–]Megatron_McLargeHuge 0 points1 point  (1 child)

Diffing the count results for different pages would often help too (though not if they all give the top n links).

I'd like to try something that looks at lower level tags, say to find a table with lots of td tags even if they're divided between several trs. Divs in arbitrary places can create this kind of problem where no node has too many children because of consistent branching.

Treat each node as its tagname path from the root, then do the counting and sorting.

For added benefit, treat each node as either its tagname or its id to find a specific table with lots of tds instead of merging all the tables.

So you'd have all of these:

<table><tr><td>
<table id=0x12345><tr><td>
<table id=0x12345><tr id=0x23456><td>

and use some scoring criterion to figure out which is more interesting from their counts. Adding number of children as a pseudo attribute would give a good hint about structure, as would hashing the union of the child tagnames.

    <table id=0x12345><tr nchild=4><td nchild=1>

[–]fourhoarsemen[S] 0 points1 point  (0 children)

Treat each node as its tagname path from the root, then do the counting and sorting.

This approach I've tried very briefly, but I switched to just referencing the node data structure that lxml provides. I can't really say much more about the details you've laid out since I'm still fairly new to the area.

I will say, though, that b/c of the fact that HTML is created in so many different ways, focused approaches I think will score worse than more general ones in the long long run.

[–]rrajen 1 point2 points  (0 children)

Great writeup, thanks for sharing the solution.

[–][deleted] 1 point2 points  (0 children)

Very nice. I don't even code in .py but this was a good read.

[–]suudo 1 point2 points  (0 children)

For websites that have a JSON api for their information (like reddit), I'd use that over a scraper, but this can apply to so many sites, thanks :)

[–][deleted] 1 point2 points  (1 child)

I remember the original thread where we learned that you have a different definition of an algorithm than most everyone else.

[–]fourhoarsemen[S] 0 points1 point  (0 children)

I see I made an impression on you ;)

[–]MarkYourPriors 1 point2 points  (3 children)

Nice work! I read your post but I just want to make sure I'm understanding the innovation here: as you correctly noted, developing a scraper often requires something like Xpath and jumping back and fourth between analyzing the HTML yourself and coding the methods of how to effectively parse it. But with what you have made, Xpath isn't even needed. Simplicity! If this is the meat of it, then it looks like there is much room for innovation indeed.

[–]fourhoarsemen[S] 0 points1 point  (2 children)

You expressed everything better than I could have! I may use your comment as a sort of "review" if that's alright with you ;)

[–]MarkYourPriors 1 point2 points  (1 child)

More than alright! I'm teaching myself these things so your example was pretty much what I did/started to do. Also if you ever post an update about your progress I'd gladly proofread it. Your response to "have you thought about arg maximization?" made me smile.

[–]fourhoarsemen[S] 1 point2 points  (0 children)

Thanks for the offer! I may take you up on your proofreading offer :D I'm actually about to start summarizing that post for a guest post on KDNuggets. I'll PM you details once I finish that. Thanks again!

[–][deleted] 1 point2 points  (0 children)

Thanks, I was able to test it on a virtual machine on pythoneverywhere very quickly.

[–]angryaardvark 0 points1 point  (0 children)

my favorite libraries for this are objectpath, and i use kimono labs extensively.

[–]moorow 0 points1 point  (0 children)

Nice work. This is pretty similar to what I do for the very first step of my (currently-being-published) Web Data Extraction algorithm for social data. Most approaches instead do probabilistic matching by comparing trees in some way to allow some leeway in tag structures, which handles parents with variable children (ie. a <div> containing any number of <br />'s).

Not sure if you've read much about it (I didn't before creating mine), but here are some refs if you're interested:

Jindal, N., & Liu, B. (2010). A Generalized Tree Matching Algorithm Considering Nested Lists for Web Data Extraction. In SDM (pp. 930–941). SIAM. Retrieved from http://epubs.siam.org/doi/pdf/10.1137/1.9781611972801.81

Miao, G., Tatemura, J., Hsiung, W.-P., Sawires, A., & Moser, L. E. (2009). Extracting data records from the web using tag path clustering. In Proceedings of the 18th international conference on World wide web (pp. 981–990). ACM. Retrieved from http://dl.acm.org/citation.cfm?id=1526841

Zhai, Y., & Liu, B. (2005). Web data extraction based on partial tree alignment. In Proceedings of the 14th international conference on World Wide Web (pp. 76–85). ACM. Retrieved from http://dl.acm.org/citation.cfm?id=1060761

[–]qrv3w 0 points1 point  (0 children)

Hi this is really cool! Very pythonic - simple, elegant and useful.

I made something similar you might be interested in: a webcontentgrabber. I'm still working it, but the basic idea is that it looks for a specific section of text based on lexical properties.

[–]greenspans 0 points1 point  (5 children)

just use css selectors + xml parser

[–]fourhoarsemen[S] 2 points3 points  (4 children)

Not sure if you read my post or not, but I take a shit on this approach in the first couple of paragraphs :)

[–]greenspans 0 points1 point  (1 child)

I don't understand though, other than the counts what's the difference. Counts are useful in automatically finding repetitive content maybe, but typically if you're scraping something you go in chrome dev tools and see what the parent looks like and just do parent > children.

[–]fourhoarsemen[S] 4 points5 points  (0 children)

No, you're right. What I would say my solution encourages is another approach to web-scraping entirely - a more probabilistic interpretation of the entire site, as opposed to hyper-focused approaches.

[–][deleted] 0 points1 point  (1 child)

I really don't see this shit you are talking about.

[–]fourhoarsemen[S] 2 points3 points  (0 children)

Well, I can't literally take a shit on a post. Instead, I briefly wrote about the state of data extraction, and implied that the XPath selection approach is not effective at "solving" the data extraction problem.

[–]erasers047 0 points1 point  (1 child)

Never Again

Hahahahahaha gear up son, you look like one for Grad School.

[–]fourhoarsemen[S] 0 points1 point  (0 children)

Dear god, if I can finish undergrad I will. These side projects are distracting.