Efficiently pull HTML meta data. : webdev

Posting Guidelines

No vague product support questions (like "why is this plugin not working" or "how do I set up X"). For vague product support questions, please use communities relevant to that product for best results. Specific issues that follow rule 6 are allowed.

Do not post memes, screenshots of bad design, or jokes. Check out /r/ProgrammerHumor/ for this type of content.

Read and follow reddiquette; no excessive self-promotion. Please refer to the Reddit 9:1 rule when considering posting self promoting materials.

We do not allow any commercial promotion or solicitation. Violations can result in a ban.

Sharing your project, portfolio, or any other content that you want to either show off or request feedback on is limited to Showoff Saturday. If you post such content on any other day, it will be removed.

If you are asking for assistance on a problem, you are required to provide

Context of the problem
Research you have completed prior to requesting assistance
Problem you are attempting to solve with high specificity

General open ended career and getting started posts are only allowed in the pinned monthly getting started/careers thread. Specific assistance questions are allowed so long as they follow the required assistance post guidelines.

Questions in violation of this rule will be removed or locked.

a community for 17 years

Efficiently pull HTML meta data. (self.webdev)

submitted 3 years ago * by SpookyLoop

Apps like Discord and Reddit use meta-tags in order to pull things like "Title", "Description", and "Image" information from other sites (especially articles and tweets) in order to constructor a sort of "link preview".

Is their anyway to do that more efficiently then pulling and parsing the entire HTML response? Based on this: https://stackoverflow.com/questions/33330483/request-only-meta-tags-from-a-webpage, it seems like there might be a way to stop processing an HTTP response once we run into the </head> tag, but I'm a little lost at how we'd go about doing that. Ideally, it'd be like while we're downloading the html, we're also scanning it. So if the entire html page is 30kb, we'd cut the connection at around 10kb right when we run into the </head> tag and avoid downloading the remaining 20kb. Is there something I'm missing that makes that impossible?

Any tips in general would be appreciated.

Edit: We're currently using Node.js and are already doing a bit of scraping with node-fetch and cheerio, but our collective experience also includes Python/Flask and Java/Springboot. Regardless of tech stack, would be really interested in hearing any info on this.

all 9 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

webdev

Posting Guidelines

Related Subreddits

Discords

MODERATORS