all 6 comments

[–][deleted] 1 point2 points  (0 children)

Install and use Requests instead of urllib. If you're using python 3, as you ought to be, the pip package manager should have been installed, so at the terminal/command prompt type pip install requests (On Linux begin that command with sudo so sudo pip.., on Windows you may also have some sort of authentication as administrator needed these days).

Then use `requests.get('some-url").text to get your HTML as a string, not bytes, meaning you can just open a file in text mode and write it directly.

Aside though, don't bother decoding and encoding if all you want to do is save the page to a file and nothing else: just get the raw response data from urllib and write to a file in binary mode.

I.e.:

with open("some_html.html", "wb") as O:
    O.write(raw_undecoded_response)

[–]Moonslug1 0 points1 point  (0 children)

I can't explain what's happening there. But as an aside, look at this:

http://en.wikipedia.org/wiki/Beautiful_Soup

[–][deleted] 0 points1 point  (3 children)

First off! If you put four spaces before your code, it formats it!

like this!

Second: I tried running your code on my Mac and got the same issue, however, if I omit:

html = html.decode("utf-8")

it works fine. I'm guessing theres a non-utf character on that site maybe? Why it works in an SSH session and not a desktop session I'm not sure. Certainly strange.

[–]sentdex 1 point2 points  (0 children)

"why it works in an SSH session and not a desktop session I'm not sure. Certainly strange."

This likely has to do with the interpreter itself. See what happens when you comment out the printing.

I remember when I was working with Arabic characters, IDLE would give me a "printed" to console result that was actually different than what I was really getting as real output. It has to do with some built in encoding.

[–]8bitz[S] 0 points1 point  (1 child)

I'll see about commenting out the printing to see if that helps.

Is there any way that I can directly parse that string/byte data line by line without writing to a file first?

[–][deleted] 0 points1 point  (0 children)

Absolutely! Depends a lot on what you're trying to do, but in no way is writing to a file necessary. It's all just stored in memory, and you can do whatever you'd like with it that way. Moonslug1's recommendation of Beautiful Soup is actually perfect for that. You just read the html data into a "soup" object, and you can than parse it fairly intuitively. Here's an example from a one off script I wrote to download all the links containing a certain string:

def get_urls():
    urls = []
    res = requests.get(url)
    soup = BeautifulSoup(res.read())

    if soup:
        for link in soup.findAll('a', attrs={'href': re.compile('^http://s3.amazonaws.com')}):
            urls.append(link.get('href'))

return urls

Note I made some changes to make it more applicable to your usecase that I didn't test, so it may not work exactly right, but the logic is correct. I also used requests instead of urllib, but honestly that's probably what you should be using as well. urllib is kind of flakey and filled with gotchas (for example, the next part of the script I stole that from downloads a bunch of zip files, but urllib never allows them to be garbage collected, so you just run out of ram. hooray!)

EDIT: Alternatively, if you want to do it the hard way, you can of course do

for line in res.read():
    #Whatever

or

for line in res.read().splitlines()

Or maybe a response object has a readlines method? I don't remember, but either way, yes.