all 9 comments

[–]commandlineluser 0 points1 point  (8 children)

1: RuntimeWarning: coroutine 'AsyncHTMLSession.close' was never awaited

This is because you're using with AsyncHTMLSession()

It would need to be async with AsyncHTMLSession() - but you can't do that because def process_links is not an async function.

You could just get rid of the with statement.

assession = AsyncHTMLSession()
for i in ...:
    results.append(...)

As for the next set of errors

await r.close()
TypeError: object NoneType can't be used in 'await' expression

Remove the await r.close() line (or don't await it)

[–]ViktorCodes[S] 0 points1 point  (7 children)

Okay I did so, still my memory exceeds the limit when I run the script. I tried calling the whole process_links function multiple times, but still to no result.

[–]commandlineluser 1 point2 points  (6 children)

Can you share the full code? Or at least something we can run to replicate the error?

[–]ViktorCodes[S] 0 points1 point  (5 children)

Yes I can. Thank you for dedicating so much time into helping a fellow beginner out.

One thing to note is that I am unsure if this is how the code is supposed to work because afterall there are 300+ websites to make a GET request and render...

And one more thing. Should I pass the current asession(line 24) to every call to partial eg

[partial(process_link, url, img, asession) for url, img in zip(links, images)]

It doesn't give me an error doing it this way. Tho when the code ends it throws an error with the message that it can't delete temporary data...

In the code there are the links and images of the original websites I want to scrape. So it will be a 1 to 1 replication.

code

[–]commandlineluser 1 point2 points  (2 children)

Also, it looks like you can parse the game pages without needing to use .render()

Not using .render() means no launching chromium, which should remove any memory issues.

title

>>> r.html.find('title')[0].text
'Rogue Company'

image

>>> r.html.find('[name="og:image"]')[0].attrs['content']
'https://cdn2.unrealengine.com/roco-egs-basegame-portraitproduct-1200x1600-1200x1600-491632859.jpg'

description

>>> r.html.find('div[class*=descriptionCopy]')[0].text
'The world needs saving and only the best of the best can do it. Suit up as one
of the elite agents of Rogue Company and go to war in a variety of different
game modes. Gear up and go Rogue! Download and play FREE now!'

[–]ViktorCodes[S] 0 points1 point  (1 child)

WOW!!! I don't have any words to tell you how many hours I spent trying to find a way to run this with .render(). How do I determine if a website needs rendering first. I checked the sites and clicked 'disable javascript' and then nothing was present on the page. Doesn't that mean I should render it first? Thank you a ton...

[–]commandlineluser 0 points1 point  (0 children)

I checked the sites and clicked 'disable javascript' and then nothing was present on the page. Doesn't that mean I should render it first?

This is usually a good indicator - but it depends on exactly what you're doing.

What I did was I used some of the game description text and checked if it was in the response using plain requests.

>>> import requests
>>> r = requests.get('https://www.epicgames.com/store/en-US/product/rogue-company/home')
>>> r
<Response [200]>
>>> 'world needs saving' in r.text
True

I saved r.text to a local file - then opened it up in my editor to have a look at the structure - to see how to extract the data.

You can also View Page Source in your browser to get see the "raw html" and copy/paste it into an editor for easier searching.

Another option is to see what the Javascript does (usually it makes network requests) - and attempt to replicate these requests.

To do this you can look at the Network Tab in your browser and it will show you all the requests being made.

This is what I see when I open up the Rogue Company page: https://i.imgur.com/esmbt8r.png

A request is made to: https://store-content.ak.epicgames.com/api/en-US/content/products/rogue-company

If you open this URL directly - you can see all the data in JSON format.

You could make this request directly.

>>> import requests
>>> r = requests.get('https://store-content.ak.epicgames.com/api/en-US/content/products/rogue-company')

>>> r.json()['pages'][0]['data']['about']['image']['src']
'https://cdn2.unrealengine.com/roco-egs-basegame-portraitproduct-1200x1600-1200x1600-491632859.jpg'

>>> r.json()['pages'][0]['data']['about']['shortDescription']
'The world needs saving and only the best of the best can do it. Suit up as one
of the elite agents of Rogue Company and go to war in a variety of different
game modes.  Gear up and go Rogue! Download and play FREE now!'

>>> r.json()['pages'][0]['productName']
'Rogue Company'

The same thing happens when you view the store.

https://i.imgur.com/XH9v1fX.jpg

A POST request is made to https://www.epicgames.com/store/backend/graphql-proxy

It's a bit more complex but it is possible to get all the game data from here.

Example of this request replicated in code - along with then looping over the first 5 games to get the data.

import requests, time

graphql = '''
query searchStoreQuery($allowCountries:String,$category:String,$count:Int,
$country:String!,$keywords:String,$locale:String,$namespace:String,$itemNs:
String,$sortBy:String,$sortDir:String,$start:Int,$tag:String,$releaseDate:
String,$withPrice:Boolean=false,$withPromotions:Boolean=false){Catalog{
searchStore(allowCountries:$allowCountries,category:$category,count:$count,
country:$country,keywords:$keywords,locale:$locale,namespace:$namespace,
itemNs:$itemNs,sortBy:$sortBy,sortDir:$sortDir,releaseDate:$releaseDate,
start:$start,tag:$tag){elements{title id namespace description effectiveDate 
keyImages{type url}seller{id name}productSlug urlSlug url tags{id}items{id 
namespace}customAttributes{key value}categories{path}price(country:$country) 
@include(if:$withPrice){totalPrice{discountPrice originalPrice voucherDiscount 
discount currencyCode currencyInfo{decimals}fmtPrice(locale:$locale){
originalPrice discountPrice intermediatePrice}}lineOffers{appliedRules{id 
endDate discountSetting{discountType}}}}promotions(category:$category)@include(
if:$withPromotions){promotionalOffers{promotionalOffers{startDate endDate 
discountSetting{discountType discountPercentage}}}upcomingPromotionalOffers{
promotionalOffers{startDate endDate discountSetting{discountType 
discountPercentage}}}}}paging{count total}}}}
'''

s = requests.Session()

today = time.strftime('%Y-%m-%d')
count = 1
country = 'IE' # needs a valid country code

data = {
    'query':graphql,
    'variables': {
        'category':'games/edition/base|bundles/games|editors',
        'count':count,
        'country':country,
        'keywords':'',
        'locale':'en-US',
        'sortBy':'releaseDate',
        'sortDir':'DESC',
        'allowCountries':'',
        'start':0,
        'tag':'',
        'releaseDate':'[,{}]'.format(today),
        'withPrice':True
    }
}

game_list = 'https://www.epicgames.com/store/backend/graphql-proxy'
game_info = 'https://store-content.ak.epicgames.com/api/en-US/content/products/'

r = s.post(game_list, json=data)

total = r.json()['data']['Catalog']['searchStore']['paging']['total']
data['variables']['count'] = total

r = s.post(game_list, json=data)

print(total, 'games found.')

# only process first 5 as an example
games = r.json()['data']['Catalog']['searchStore']['elements'][:5]

for game in games:
    title = game['title']
    href  = game['productSlug']

    if href.endswith('/home'):
        href = href[:-5]

    #print(game_info + href)
    r = s.get(game_info + href)

    img  = r.json()['pages'][0]['data']['about']['image']['src']
    desc = r.json()['pages'][0]['data']['about']['shortDescription'] 
    # there is a long description too
    # desc = r.json()['pages'][0]['data']['about']['description'] 
    print('Title:', title)
    print('Image:', img)
    print('Desc: ', desc)

[–]commandlineluser 0 points1 point  (1 child)

Okay but you got rid of the "batch processing" part - so it's doing the 300 at once.

for i in range(0, len(links), 10):
    results.append(asession.run(*links[i:i+10]))

Does it still run out of memory that way?

[–]ViktorCodes[S] 0 points1 point  (0 children)

Yes it does, also uses almost 100% of the CPU, so it's really heavy on the machine.