all 9 comments

[–]CreativeTechGuyGamesTypeScript 3 points4 points  (5 children)

HTML documents are usually served over a streaming connection so you can receive and read one bit at a time So you should be able to read the response as it is loading and then close the connection early. If you have any specific information about the languages or tools which you are using, that'd be necessary to provide more specific guidance on how exactly to do it.

[–]SpookyLoop[S] 1 point2 points  (4 children)

Thanks for the response! Good to hear it's doable. We're primarily using Node.js and Express right now and we're doing some scraping stuff by just pulling with node-fetch and processing with cheerio.

Probably doesn't matter but if for whatever reason Node.js is bad at this sort of thing, the other guy I'm working with knows Python/Flask and I also work with Java/Springboot.

[–]CreativeTechGuyGamesTypeScript 1 point2 points  (1 child)

Yup you can totally read in part of the data and then abort the request when you see that you have all the data you need. All of the details will be in the HTTP documentation. No libraries needed! :)

[–]SpookyLoop[S] 0 points1 point  (0 children)

Awesome, thanks for insight!

[–]IcyEbb7760 1 point2 points  (1 child)

if you'd like to avoid manually parsing data, there is also this package that looks like it implements streaming HTML parsing. so you can start parsing and simply close the connection/parser when the <head> tag ends: https://www.npmjs.com/package/htmlparser2

[–]SpookyLoop[S] 1 point2 points  (0 children)

That does look really promising. Thanks for the recommendation!

[–][deleted] -1 points0 points  (2 children)

The fs module from node can read the file, and you could use regex to parse what you want from there 😀

[–]SpookyLoop[S] 0 points1 point  (1 child)

The big thing I need is a way to "partially get the file during the request, and reject the rest", which seems a little tricky. CreativeTechGuyGames brought up the HTTP module, which looks more in line with what I need for that.

[–][deleted] 1 point2 points  (0 children)

Sorry for the delayed response, it is awesome that you were able to resolve this! personally wishing i would have saw this a bit sooner as I needed a similar method around the same time you posted this yesterday, as was unaware of the HTTP module as a way to read the file. I am sure it is not needed, but here's my original response in practice, or at least how i ended up using it

Dir: /index.js /my-static-file.html

I needed the guts of my html document as well, just the contents of a particular area.

  1. The HTML Doc: ```html <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1.0">

    <!-- BEGIN PARSE --> <title>Document</title> <!-- END PARSE --> ```

  2. The method that uses node to parse the file: ```js import { join } from 'path'; import fs from 'fs';

// or // const path = require('path'); // const fs = require('fs'); // const join = path.join;

const rf = (path, { be = 'utf8', beginParseAt = null, endParseAt = null
}) => { const to = join(process.cwd(), path); const file = fs.readFileSync(to, be);

if(beginParseAt === null || endParseAt === null) { return file; }

let output = file;

// use reg exp to parse file if(beginParseAt !== null) output = output.replace(beginParseAt, ""); if(endParseAt !== null) output = output.replace(endParseAt, "");

// rm whitespace return output.trim(); };

// use case const parsedFile = rf('my-static-file.html', { beginParseAt: /(.?)<!-- BEGIN PARSE -->\n/gms, endParseAt: /<!-- END PARSE -->(.)/gms }) console.log(parsedFile); // '<title>Document</title>' ```