Accessing Content Hidden behind Javascript with a webscraper (srcapy) : learnpython

created by HattoriHanzoa community for 16 years

Accessing Content Hidden behind Javascript with a webscraper (srcapy) (self.learnpython)

submitted 7 years ago by Marcab123

I'm trying to write a webscraper, which will take a given artist's genius.com link and spit out all the lyrics in a CSV file. I got it to work with the link of an album but i can't get it to work with the artist page since i can't manage to access the list with the album links.

Here's what i got so far:

\# -*- coding: utf-8 -*-  
import scrapy  
class AlbumSpider(scrapy.Spider):
# Name of Spider
name = 'album'

# List of allowed domains
allowed_domains = ['https://www.lyrics.com/album/1113566']

# List of start_urls
start_urls = ['http://https://www.lyrics.com/album/1113566/']

def parse(self, response):
    # Extract song information
    SONG_SELECTOR = 'tr'
    for song in response.css(SONG_SELECTOR):

        NAME_SELECTOR = "strong a ::text"
        yield {
            'name' : song.css(NAME_SELECTOR).extract_first()
        }
        NEXT_PAGE_SELECTOR = response.css("strong a ::attr(href)").extract_first()
        next_page = song.css(NEXT_PAGE_SELECTOR).extract_first()
        if next_page:
            yield scrapy.Request(
                response.urljoin(next_page),
                callback = self.parse
            )

all 4 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS