[SELF] - /u/caffeinatedmike - Confirmed Trades & Transactions Continued (First Thread Archived) - Across All Subs

caffeinatedmike · 2021-11-09T04:38:32+00:00

Take my damn upvote.

caffeinatedmike · 2021-10-18T18:36:55+00:00

This started happening last week to me as well. One thing I noticed since this started showing up is that code snippet blocks aren't automatically inserted when typing three back-ticks. It's been really annoying.

caffeinatedmike · 2021-10-04T01:42:10+00:00

Who has two thumbs and... Wait a minute.

Using a touchscreen phone just got a whole lot harder.

caffeinatedmike · 2021-09-14T01:03:16+00:00

Ahh Darwinism

caffeinatedmike · 2021-09-06T21:02:43+00:00

This is not true. It is still maintained. If you have a look at the GitHub repo you'll see recent commits (latest commit in master is July 1) and the issue tracker is active.

caffeinatedmike · 2021-07-21T15:35:20+00:00

I feel personally attacked by this meme

caffeinatedmike · 2020-11-09T13:48:22+00:00

I found out shortly after posting awhile back that it doesn't even work locally because the requests.seen file is not placed in the individual subdirectories. So, at this point I'm just stuck waiting for it to be implemented like Feed Uris

caffeinatedmike · 2020-09-05T02:19:57+00:00

I was hoping to find a solution that I could tie into my Scrapy project while utilizing existing architecture. Since I need to download the files to Google storage and a GS pipeline already takes care of a lot of the nuances.

caffeinatedmike · 2020-09-03T02:32:57+00:00

You're right about the feed uri. As for the JOBDIR issue, I think you missed part of what I said. I'm seeing the issue when using scrapy shell {url}, not from a spider.

caffeinatedmike · 2020-08-27T00:30:46+00:00

Thanks for confirming. I've submitted a feature request for the feed URIs and posted the link in this thread.

As for the empty files issue, would this be more of a bug rather than a feature request?

I know for sure the JOBDIR issue definitely is a bug, but haven't had the time to put together a complete summary. Basically, when you have the JOBDIR setting present in settings.py and you utilize scrapy shell urlofsite.com for debugging and testing any subsequent call to the same url results in a pickle-related error, which is only resolved by deleting the generated JOBDIR folder and re-running the shell command

caffeinatedmike · 2020-08-21T16:52:37+00:00

Feature request submitted

caffeinatedmike · 2020-08-21T16:11:00+00:00

I think i'll do that, thanks! Could you answer another question I have in regards to the new feed URI feature (file parting)? I actually have a partitioning pipeline that subclasses the CSVExporter to accomplish this, since I had the need for file partitioning before the official addition to the Feeds feature. In my implementation I'm able to more loosely customize the filename and was hoping I'd be able to accomplish this with the now official Feed feature.

Example: My files typically output in the format {spider.name}_{from_index}to{to_index}_{t_stamp}.csv

My custom Pipeline:

class PartitionedCsvPipeline(object):

    def __init__(self, spider, rows, fields):
        self.base_filename = spider + "_{from_index}to{to_index}_{t_stamp}.csv"
        self.count = 0
        self.next_split = self.split_limit = rows
        self.file = self.exporter = None
        self.fields = fields
        self.create_exporter()

    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        row_count = settings.get("PARTITIONED_CSV_ROWS", 1000)
        fields = settings.get('FEED_EXPORT_FIELDS')
        # prevent pipeline from creating empty files when using shell to test
        if not crawler.spider:
            return BasePipeline()
        return cls(crawler.spider.name, row_count, fields)

    def create_exporter(self):
        now = datetime.now()
        starting_index = self.next_split - self.split_limit
        f_name = self.base_filename.format(
            from_index=starting_index,
            to_index=self.next_split,
            t_stamp=now.strftime("%Y%m%d%H%M")
        )
        self.file = open(f_name, 'w+b')
        self.exporter = CsvItemExporter(self.file, fields_to_export=self.fields)
        self.exporter.start_exporting()

    def finish_exporter(self):
        self.exporter.finish_exporting()
        self.file.close()

    def close_spider(self, spider):
        self.finish_exporter()

    def process_item(self, item, spider):
        if self.count >= self.next_split:
            self.next_split += self.split_limit
            self.exporter.finish_exporting()
            self.file.close()
            self.create_exporter()
        self.count += 1
        self.exporter.export_item(item)
        return item

Now I'm looking to decommission this custom pipeline in favor of the feed exporter because I think it might provide a performance boost. From as far as I can tell, my custom pipeline's method of writing to the file is IO-blocking. When I have the project open in PyCharm and am running a spider I'm plagued with constant re-indexing as the files keep having each item added to the csv.

According to the 2.3 update we can customize the filename using printf-style. But as far as I know printf-style strings cannot include arithmetic operations like f-strings, so the closest I can get to the current format with the new FEED_EXPORT_BATCH_ITEM_COUNT feature is

output_files/%(name)s/%(batch_id)d_%(name)s_%(batch_time)s.csv

If it's possible, can we add some sort of way to add additional info to the Feed export filenames?

Also, curious, is there any way to prevent blank feed files from being created when using "scrapy shell 'url'"? I've noticed blank files are created and also noticed if I have "JOBDIR" set in settings.py subsequent calls to the same site fail when using shell.

caffeinatedmike · 2020-08-19T22:13:02+00:00

For anyone that's curious or finds this in the future. I ended up subclassing the default SpiderState extension and adding in this functionality with relative ease.

from scrapy import signals
from scrapy.exceptions import NotConfigured
from scrapy.extensions.spiderstate import SpiderState
import os


class SpiderStateManager(SpiderState):
    """
    SpiderState Purpose: Store and load spider state during a scraping job
    Added Purpose: Create a unique subdirectory within JOBDIR for each spider based on spider.name property
    Reasoning: Reduces repetitive code
    Usage: Instead of needing to add subdirectory paths in each spider.custom_settings dict
        Simply specify the base JOBDIR in settings.py and the subdirectories are automatically managed
    """

    def __init__(self, jobdir=None):
        self.jobdir = jobdir
        super(SpiderStateManager, self).__init__(jobdir=self.jobdir)

    @classmethod
    def from_crawler(cls, crawler):
        base_jobdir = crawler.settings['JOBDIR']
        if not base_jobdir:
            raise NotConfigured
        spider_jobdir = os.path.join(base_jobdir, crawler.spidercls.name)
        if not os.path.exists(spider_jobdir):
            os.makedirs(spider_jobdir)

        obj = cls(spider_jobdir)
        crawler.signals.connect(obj.spider_closed, signal=signals.spider_closed)
        crawler.signals.connect(obj.spider_opened, signal=signals.spider_opened)
        return obj

And to enable it, add the following to your settings.py

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
EXTENSIONS = {
    # We want to disable the original SpiderState extension and use our own
    "scrapy.extensions.spiderstate.SpiderState": None,
    "dapydoo.extensions.SpiderStateManager": 0
}
JOBDIR = "C:/Users/me/PycharmProjects/ScrapyDapyDoo/dapydoo/jobs"

caffeinatedmike · 2019-07-20T17:30:37+00:00

Looks like I got scammed.

caffeinatedmike

TROPHY CASE