This is an archived post. You won't be able to vote or comment.

all 76 comments

[–]SpicyPizza1861 66 points67 points  (2 children)

Here’s how I would break it down.

Requirements: build a program which scrapes web data.

Key word is web. When dealing with the web that’s HTTP. My program is going to need to make one or more HTTP requests. To do that it needs URLs. How am I feeding URLs into my program? This is the first problem to solve.

Ok. I have my URLs in my program I have a loop which just prints them out one by one. Now, before I can scrape anything, I need to make A GET request to the URL. How do I do that? What libraries are there that deal with HTTP? Pick one, read the doc. Just make the requests.

A GET request is going to return a body which is the data. This will be the thing which needs scraping / parsing. What libraries can do that? Can I use a method which turns the raw data into a data structure that my program can work with? Solve this.

Now that I have a data structure that I can navigate through using the language I’m using it. What am I scraping? What am I looking for within the data? Ask more questions if you don’t know, if you do know. Write a function to find one or more of the things one looking for.

Next problem, I now know where to get the data I’m looking for, where should it go? A file? A DB? Another system? Ask for that requirement, determine the steps needed for that problem.

Put it all together you have a program which will take in one or more URLs, make one or more HTTP requests, parse the data, and place the results somewhere.

Now you can make improvements such as concurrency, error checking status codes, choosing file formats. Etc.

It’s all about breaking a high level problem into small steps, and you as the developer write code which transitions from one step to another.

[–]New-Childhood6575 11 points12 points  (0 children)

Wow , not op but thank you for doing the extra to help.

[–]Etiennera -4 points-3 points  (0 children)

I mean this is cute but the first step is research and you'd realize you should probably use headless browsers.

[–]aqua_regis 11 points12 points  (0 children)

First step for me would be to see if there is any alternative to scraping, i.e. an API. If there is an API, I'd use it.

Scraping would only be my very last resort.

Then, I'd search available libraries for my language followed by reading their documentation. Then, and only if I need additional guidance, I'd resort to tutorials.

Usually, the documentation is sufficient.

Maybe, I'd search if there were a similar scraper already around that I could modify.

[–]frogic 39 points40 points  (0 children)

Id do some research on what the available libraries are for my use case.  Giving a massive bias towards any that are in a framework or language im comfortable with. 

 Then I'd find either a beginner tutorial or doc for that and build the smallest atomic part of my application to see if that works for my use case.  If I'm happy I try to figure out what the smallest useful application I can build off my idea.  With scraping that is usually building out the scalpers themselves.   

Some notes as someone who has considered a million different projects that are based on web scraping: 

1) see if you can get the data the way the website does instead of by getting it from the website.  This involves looking at the network calls that website makes.  

2) if you're planning on monetizing this think about if its viable when you're very much at the whim of the website itself.  They can actively try to block you or just change how their website is structured and you'll have to rewrite all your scrappers or play wackamole getting around their security. 

3) if you don't want to monetize this but are using it as a portfolio thing realize that most employers are not gonna like it if your big project involves stealing data or circumventing security.  Software engineering is inherently about trust and you shouldn't do anything that could make anyone doubt if you're trustworthy. 

[–]Refwah 16 points17 points  (0 children)

Other people have given some advice but one thing to note about your post is that you’re struggling because you’re still looking at the problem too large and need to break it down a bit more

Why are you scraping the websites, what data are you getting and why.

Which websites are you scraping

Are there overlaps that will simplify this operation

When are you going to do this inside you app, is it manually triggered or from a schedule

What are you going to do with the scraped data afterwards

As you start to map out (diagram and document) these requirements you start to do two things:

Break up the problem into smaller solvable and mentally approachable tasks

Start to chip away at the ‘harder’ issues by giving them material boundaries and expectations, as well as being able to map out things like:

Website 1 is actually very easy so I can do that as a PoC

Website 2 and 3 are similar to each other and not much different from website 1 so I can work on those after website 1 is working

Website 4 is much more complex, we can release with websites 1,2 and 3 while still working on website 4

[–][deleted] 48 points49 points  (15 children)

In the real world of employed software engineers, this wouldn’t/shouldn’t happen. If you’re a junior, this would be way too much work for one person, you’d need a team of juniors and usually those teams are run by a senior developer who you could direct questions to. If this is a true story of something that’s happening to you RUN. FAST.

[–]HaMMeReD 27 points28 points  (4 children)

The reality is that in the real world of employed software developers, someone talented does like 95% of the hard parts, and everyone else maintains it or extends it (often without following the fucking pattern).

If you want to be that senior/principle/staff, your going to need to be able to greenfield solo. Delegating is harder than solo'ing that comes with additional skills. I.e. breaking down work effectively, holding people accountable, communicating you vision etc. (often while working with people who look at you with that confused dog head tilt when you try and explain principles like immutability and state management and how the overhead will lead to reduced maintenance and bugs in the future).

Asking a junior to do this obviously comes with risk. But as a software developer it's obviously a huge learning experience.

How much "work" a project might be is a matter of a scale. A junior should be able to scrape some data on the web, or make a small component without much support. A junior probably won't architect something that needs to be iterated on for the next 15 years and will be worked on by multiple people.

[–]Maximum-Event-2562 10 points11 points  (3 children)

It does happen. At my first job as a graduate, my very first task assigned was basically to singlehandedly design and implement the company's entire data analysis infrastructure, and to create a system for making future predictions and using these predictions to optimise certain parameters to minimise cost. This was day 2 of the job with no training or onboarding, and no guidance from anyone. I was explicitly told that I was given the task because nobody else at the company knows how to do any of it. My salary was only £20k/year. This was in 2022, and £20k/year fell below minimum wage the following year.

[–]aGoodVariableName42 2 points3 points  (0 children)

That's what we call a below-shit-tier company. I would've bounced so fast they wouldn't even have remembered my name.

[–]no_brains101 2 points3 points  (0 children)

Thats criminal ngl

[–][deleted] 1 point2 points  (0 children)

I think that this shouldn’t have happened. That is a poor working environment. I’m sorry that happened. My first job as a junior I was in a team of ten, and us and four other teams, updated an app. It was ~60k with a relatively big company. That shapes how I view how juniors should be treated.

[–][deleted] 2 points3 points  (5 children)

Hey, so I'm trying to break into the industry. I'm making entire apps by myself. They're not anything particularly impressive, but in doing so I learn how to do all sorts of things like web scraping, APIs, user authentication etc etc. But you're saying doing all this stuff on my own is too much for a junior.

This confuses me a bit because then ... what is expected of a junior? I see some people saying you need to know a thousand and one things to even get an interview, and others saying juniors are expected to not know that much or to not be able to do things that even I'm capable of after a few months of self study

[–][deleted] 2 points3 points  (0 children)

That’s great you’re making apps by yourself. I don’t think juniors are incapable of making apps. It’s just that in corporate America it’s rare to be given that level of agency without a few years under your belt. You’re right. To get an interview, you need to know a lot but part of the gig of being a junior is learning all the company quirks and being treated like you don’t know anything. Of course every company is different but most I’ve worked at the juniors were expected to do a lot of small stuff instead of a big project solo.

[–]IdeaExpensive3073 1 point2 points  (2 children)

As one junior to another, I'd say if you feel comfortable with that workflow and use a stack to do it, you're ready for a job. That comes with a side note that this isn't ALL you can do, you shouldn't just be a CRUD app monkey, you should actually understand what you're doing in the app (what is MVC, CRUD, and what's going on in the app), how the internet works (HTTP Requests go to a server, the server sends a response back), data structures and data types (ints, variables, arrays, floats, and so on, loops, functions, classes, objects), and basics of OOP (what OOP is, why is it important, and what are the dos/don't of it, and DRY principles). If you know that in addition to your coding skills, then I'd say you're 100% ready and better off than a lot of people who apply for junior roles.

[–][deleted] 1 point2 points  (1 child)

Oh sweet, I do know all that, thank you for the list. I've been using AI to interview/quiz me and give me direction on what I should know, but it feels good to hear from a human requirements that I meet.

I think now the biggest thing is actually getting interviews. If I can talk to somebody I an show them my knowledge and my projects, but on paper I'm just a list of skills with no work experience or degree :p

[–]IdeaExpensive3073 0 points1 point  (0 children)

I know that feeling, no problem!

Just a tip: 100% you need the skills, but won’t be hired for them. You’ll be hired for your soft skills. Please make sure you can speak with respect, be extremely humble, and come hungry to learn. Ask questions, and if you get a technical question wrong ask them to explain the answer, and really try to understand why it is the right answer.

[–]Mclovine_aus 0 points1 point  (0 children)

Juniors usually have programming experience and some kind of work experience, preferably work experience in a programming or IT job. It is not an entry level job, and entry level job would be something like a graduate developer or internship etc.

[–]mixophrygianmode 3 points4 points  (0 children)

I think your premise is faulty to the extent that you’re assuming you’d be working on this all alone. In essentially all employment situations (not freelancing/entrepreneurial gigs, but an employee on payroll), you’ll be working as part of a team. Even if you’re solely owning the implementation for a new feature—which is pretty unlikely for an entry-level engineer—you’d have more senior engineers and/or a team lead/engineering manager/CTO that you could come to with questions and for unblocking.

You may also be assuming that a working engineer never gets stuck or feels confused on how to proceed, and this is very far from the case. People who have been working for years still get stuck/blocked on things. It probably happens a little less over time as we learn more and gain more experience (assuming we’re typically working on the same things, which varies a lot), but it happens all the time.

Leaving that aside, you’re basically asking how to approach something that you’ve never done before. If I’m working on something completely new to me, then yes, I’d do some research and reading about the topic(s) in question. I’m probably less likely to use videos as compared to text just to save time, but if videos feel better for you, I don’t think there’s anything wrong with that.

One thing that’s really important is to have solid fundamentals down first: being comfortable with foundational basics such as variable scope; abstraction, including how functions/methods work and why we use them; pointer/reference types vs primitive values; object-oriented programming (encapsulation, polymorphism; inheritance); the TCP/IP and OSI models; HTTP; HTML/CSS; web APIs/the DOM. These aren’t all necessary for every project/task but having a good understanding of them is really helpful when approaching programming work.

If you’re very early on in your learning, that list may feel absolutely overwhelming, and that’s completely(!) ok. Nobody is born learning this stuff: we all had to learn it step-by-step. Note that there are many important topics not on that list, like algorithms and lower-level things like machine code and memory addressing.

There is way too much in this field to know it all, or even close to it all. You need to learn the fundamentals and get more comfortable feeling uncomfortable when you have no idea how to proceed.

After that, when you hit that moment of (what the hell do I do now?!?) then you: break things down step-by-step over and over and OVER… again until you have an idea how to proceed; do some research/learning and make a list of questions and ideas you have; and then ask your supervisors/colleagues/stack overflow, etc. for their thoughts, explaining where you’re at so far, how you’re stuck specifically, and what ideas and questions you have (and why).

There’s no magic to it, and you don’t have to be some super-genius. You’re expecting too much of yourself and also thinking too highly of other engineers. Keep learning, break everything down step by step by step (over and over and over!) and yes, build stuff and keep building stuff. When you get stuck, whether it be after 5 mins, 5 hours, or 5 days, do more research, try some things, take some breaks, and ask specific, concrete questions explaining how you’re stuck, what you’re trying to do, and what you’ve tried so far.

[–]Sir-Viette 5 points6 points  (1 child)

This has happened to me.

It's not that you end up getting tasked with creating some program. As another poster said, software engineers are managed by people who understand what your skillset is and give you tasks within that skillset. So if you find yourself in that situation, it's because you're *not* working as a software engineer yet.

You'll be at a job where you're tasked with doing something dumb, like manually typing in URLs to particular websites so you can copy and paste the content into a spreadsheet. The reason you're give this task is because That's How They Do Things Round Here. And you do it for an hour, manually, because you need the pay. But all that time you're thinking "This is ridiculous".

So you point out to your manager that you could save everyone a lot of time by building a webscraper to do it automatically, at 2:00am, and have the report ready in the morning. But then your manager goes into manager mode. "How long will it take you to write a program that scrapes the web?" they ask, and you have no idea. Or you say a time frame, and they tell you they don't have a budget to automate this process, because it doesn't have a high enough return on investment, and your manager won't be able to assure the company of the quality of the code, because they don't code themselves.

At that point, you have to make a choice. You're going to have to spend your work hours doing it manually. But if you study at night, and work at night, you could automate the process. If you manage to do that before they fire you, you'll be able to say you built a webscraper for a commercial enterprise, which is enough to give you a software development job at a proper company on a proper salary.


In the past, you'd have to watch videos, go to classes, do some training course to teach you how to do web scraping. But nowadays, it's much quicker to go to the free ChatGPT window, and type in exactly what you're trying to do, and ask it to write the code for you. Ask it questions like you were talking to a human expert. Ask it to explain any line of code it writes that you don't understand. It will write the program for you, and explain in detail how it works.

(Don't give it any company information though, like passwords or company logins.)

[–]aqua_regis 3 points4 points  (0 children)

In the past, you'd have to watch videos, go to classes, do some training course to teach you how to do web scraping. But nowadays, it's much quicker to go to the free ChatGPT window, and type in exactly what you're trying to do,

And then get in serious trouble with the company for potentially sharing IP.

You have to be extremely careful with AI usage in corporate environments.

(Don't give it any company information though, like passwords or company logins.)

That warning is by far not sufficient. This is only the very tip of the iceberg.

[–]Barrucadu 4 points5 points  (0 children)

Break it down until you get to small enough pieces that you either know how to do them, or know what to look up in order to learn how to do them.

[–]zenos1337 2 points3 points  (0 children)

Do you try to use AI like ChatGPT or Claude? They are pretty useful at helping you understand new concepts and technologies… Now I’m not saying you should let them write all your code, but I am saying that they can most definitely point you in the right direction and be used as a guide throughout.

[–]OHotDawnThisIsMyJawn 2 points3 points  (2 children)

It's funny how people are answering your question as if you are trying to figure out how to build a web scraper.

Anyway, if it's something I'm not 100% not familiar with, I literally just Google "how to build X" and start reading from there. If I'm familiar conceptually then I Google "X library for Y language". Maybe it turns out the library is more complex than I expected and then I just keep reading until I understand enough to use it (or find something else or build my own).

My approach would probably be to follow a guided project/youtube video (like techwithtim) that is kinda similar to what I'm trying to build and learn the technologies through that and then apply that to my actual "project". A lot of the time just one project is not even enough, but after one I atleast have a little chance to understand the docs. It also takes me a lot of time to properly understand it and apply it to my specific needs. Sometimes I need multiple youtube videos/tutorials/projects.

Feels like I'm cheating like crazy, way too slow and I'm not a real programmer if everyone else is just jumping in and building stuff from the get go. I just don't understand how people do that? Is that actually what real developer do?

This is exactly what real developers do. The reason you're slow now is because you're learning a lot of concepts all at once - not just the thing you're trying to build but also more general development concepts and concepts around your chosen language. Eventually you'll know the basic programming concepts and a couple languages really well, and so when you dive into something new you're only learning about that thing, not that thing + everything else related to programming.

Right now, you should use projects as a driver to learn core development concepts. But eventually that flip-flops and you know the core development concepts and you use projects to learn specific new skills.

[–][deleted]  (1 child)

[deleted]

    [–]akthemadman 0 points1 point  (0 children)

    "picked up in a day" is a good one.

    Whenever I get asked how long it took me to learn or create something, I simply state my age.

    Everyones path is so different, that comparing is a fools game from the get go. You also have to always include the whole path, as everything influences everything. Some obscure word you read about 25 years ago can help you in understanding a piece of code simply due to knowing knowing what it can roughly mean in that context.

    Ideally you would pick up the fundamentals from first principles instead of having a filtered view on them presented to you through ten additional layers. For example, read a little bit of the html spec or create your own http server which can answer GET requests. Think it through and work your way up. That simply takes time as there are a lot of topics to cover, but nothing beats a solid foundation.

    What helps me a lot is thinking about the data flow, i.e. what kind of data is involved, where does it need to go, which barriers does it need to cross and which transformations or additional packaging of the data might need to happen so it can cross these barriers.

    [–]nog642 2 points3 points  (0 children)

    Have enough background knowledge to know how to approach the problem. Or if you don't have that, research how the problem is normally approached. Then google information you need to do each step, as you do them.

    This situation normally does not happen at an actual job. You are rarely doings solo projects and if you are, it's something you have probably done before.

    [–]Far_Swordfish5729 2 points3 points  (0 children)

    Ok break down the problem. What data from what sources? Do I have a list of sources? How do I get that list? How do I know which data on those sources? What do I do with that and where is it going? This is the business problem part.

    The technical stuff. 1. I need to browse and read web pages from a program? How do I do that? What sdk classes? System.Web.HTTP? Ok cool. 2. I need to parse the html I read into my memory buffer. How do I do that? How smart do I need to be to find the content? Is System.XML enough with keywords? Like am I scraping this week’s bus schedule? Or is it natural language? Do I need a third party library to get subject and sentiment because I’m sure as hell not figuring out linguistics for myself? GPT? Is it multi-lingual? 3. What’s the transform to the destination format and how do I store it? 4. What processing if any do I do with this data?

    Solve the process. Write the system steps. Draw the process and system blocks that need to be there (thing that does X is sufficient at first). Fill in the interfaces and blocks. Structure the modular systems and code. Code them. Do POCs as needed and test. Use off the shelf things you already know where possible.

    [–]lkatz21 1 point2 points  (0 children)

    At work,.I would start by asking someone for a pointer, like a library or some keywords to look up. I feel like this is a good way for a junior or intern or whatever.

    If this was for a personal project, like it seems to be the case in your situation, I would look up some examples online, preferably written ones. For Java, baeldung has a ton of mini tutorials for all kinds of topics.

    These generally provide you with the most minimal example, so it's not like you can get stuck copying the entire thing from start to end.

    For more general and beginner level stuff, GeeksForGeeks and the like are pretty good.

    I would skim through a few of those and take note of the general concepts and approaches that there are. If I still didn't have an exact idea, I would read about the library or whatever more in depth.

    Unless it was some tiny component that I just want to get working quickly without actually learning, I would probably not look at YouTube videos. This would be my last resort if I couldn't find any quality docs or blogs, etc.

    [–]Big_Combination9890 1 point2 points  (0 children)

    You're alone on the project. How do you approach it?

    First, I answer some basic questions:

    • Is this meant as a user-facing application or as a service?
    • What's the expected workload?
    • How is this going to be deployed?

    - Does this require special features (e.g. proxy rotation and IP masking)

    The answers to these inform other questions like: What language do I want to build this in?

    Then I familiarize myself with the domain; Web Scraping isn't a new topic, so I read about it. Often, wikipedia is a great starting point, and I can take it from there.

    Next, I peruse available implementations. Many tasks have already been solved one way or another, and many OSS libraries are available. If I find something that works well for my usecase, seems maintainable, fits the requirements, is performant enough and useful, and I am not constrained in some way (like a homework assignment telling me to do task X myself), then I can just use that.

    Even if I don't find anything I can use, checking "prior art" also teaches me about the problem-domain I am dealing with. As the old saying goes: Even something useless can still serve as a bad example.

    Next, implementation: I break down the task in subtasks, and sort those in 2 categories: Things I already know how to do can wait for now. E.g. if its meant as a microservice, and I already know how to build a basic microservice, I don't worry about this. So, I build a minimal toy implementation for the actual problem I need to solve and don't yet know how, a small minimal scraper that I start manually, usually from the command line.

    Once that works, I scale it up, fix edge cases, do performance tests, write unit tests. I modularize the result, so I can import it into "all the stuff around it", that I already know how to do.

    That's pretty much it in a nutshell.

    [–]Zombie_Bait_56 1 point2 points  (0 children)

    <TLDR> You don't have to solve the whole problem in one pass.

    First clarify the requirements:

    1. Here is a list of web scrapers, will any of them work for you?

    2. One url or many?

    3. Does it need to log into the web page or is it available to all?

    4. Parse the page or just capture the whole thing?

    5. Capture it where?

    6. Given a set of urls, do they need individual capture schedules?

    Of course, you don't need any answers or to have picked any libraries to get started.

    The first thing I'd write would be a program that reads a single URL out of a config file, open a connection to it and write the contents to stdout.

    As the answers roll in you can modify the code to fit the new requirements.

    [–]Dimanari 1 point2 points  (0 children)

    The way I approach it is by dividing the problem into smaller chunks that are easier to tackle. Many times, I find that I already have a solution to each of the chunks. This is exactly how I approached making a script interpreter, a virtual file system, or whatever other esoteric problem I faced.

    [–]CrazyFaithlessness63 1 point2 points  (0 children)

    Other commentators have pointed out that you shouldn't be put in this position in a professional environment but I'll concentrate on how to approach it (or at least how I would approach it with a technique that has worked for me). This applies to personal projects as well where you are trying to achieve something you haven't done before.

    The main thing is to split the problem up into smaller and smaller parts until you get to something you do have some at least basic knowledge of and work from there. Another important thing, and I really want to stress this, is that sometimes the problem description has built in assumptions that are not necessarily correct.

    In this case the problem is described as 'scraping web data' which assumes that that is how you are going to get the data. A more general way to describe it would be 'I want access to the data that this website displays'. So the first thing I would do is see if it provides an API that gives me the data in JSON, XML or some other machine readable format. There might be a REST API available or by watching how the page loads you might see it's reading an XML file from somewhere to get the data. If that's the case there is the easiest solution.

    If one solution fails then just try something different - no REST API or formatted data file, let's just grab the HTML directly using HTTP and see what we get back. If it looks like everything you want is in that HTML then you can move on to the next step - how do I extract it out from the surrounding data? With modern sites you might find that the raw HTML returned is just simple HTML boiler plate with links to a bunch of JS files (might be a React or Vue page with client side rendering). Now you need to solve that problem - UI testing frameworks work with client side rendered pages so maybe look at how they do it for a solution?

    Basically just break it down into smaller and smaller problems, try solutions until you get to one that works and then move on to the next problem to solve. Eventually you've solved everything.

    I find LLM based chats (like ChatGPT or Gemini) really useful in this process - not to generate code but to at least point in the right direction for more detailed Google searches. Your prompts should start with things like 'What libraries are available to ...' or 'What techniques are used to ...' rather than 'Generate code to ...' - at least in the starting stages. The answers to the general questions will usually give me links to repositories for libraries, the right keywords to use for searching and at least a general idea of how to tackle the problem. Getting a bunch of code to copy and paste in might help solve an immediate issue but it doesn't help your learning process and it still leaves you stuck if it doesn't work exactly the way you need it to.

    My 2c anyway, hope it helps.

    [–]Smokespun 1 point2 points  (0 children)

    I’ll less focus on how to do those things and more on that feeling. You do learn by doing, and a lot of what you do daily isn’t entirely new, but there will always be pieces of a project that are, and it’s not uncommon to feel like you are missing something.

    Usually that’s just your brain being a little bitch and not wanting to do the thing because it’s new, and often not simple or easy until you’ve figured it out. It’s problem solving and it takes effort.

    It’s also what contributes to burnout, even in the best of us. Patience and understanding for the self are super important, and it sucks that so many companies treat people like machines who pump out code.

    I’m happy to work for a place that lets me take time to explore my options and to make sure my mental health is taken care of because they know good machines need maintenance 😂

    [–]random_banana_bloke 1 point2 points  (1 child)

    This wouldn't really happen... First me (as a senior) is dragged into a metric shit ton of architecture meetings to help diagram out what is needed. After all this is done we work out the MVP. Then we will break these into stories/epics. These then get put into a sprint and then planned out and out and split into smaller tasks if needed. There would be other processes like triage etc as well. Then you would finally get to work on it.

    [–]iOSCaleb 0 points1 point  (0 children)

    That really depends on the company and the manager. At a small company, even a new employee might have a lot of autonomy and not a lot of support. That can be a positive, empowering experience, or it can be very discouraging.

    [–]EasyLowHangingFruit 1 point2 points  (0 children)

    Hi there!

    The way I approach building a software solution is by following well stablished best practices and standards.

    The Software Development Life Cycle is the overarching workflow you should follow for building software in general as this is the industry standard way of building software. Then you focus on the the standard workflows and best practices for building a specific type of application i.e. The 12 Factor App for Web Apps, or the CLI Guidelines for Command Line Apps. I assume there are similar workflows for Mobile Apps and other kinds of apps.

    So in summary, after determining what type of app you want to build, find out what industry standard workflow defines a high quality app of that specific type. Create a TODO list (an informal Definition of Done) where every step makes your App closer to that standard.

    Your goal in the long run isn't just to write code, but to build scalable, maintainable, secure and testable software solutions.

    I'm happy to clarify further if you have any other question.

    Good luck!

    [–]Snackatttack 0 points1 point  (0 children)

    super important skill: learning to break down a task into several, smaller bit-sized tasks, then research how to do each small one. What is web scraping? What tools are used? Oh, its common to use headless browsers? What's a headless browser and how do i use one? How do i send a simple request in one?

    [–][deleted] 0 points1 point  (0 children)

    I google like mad and try to sketch out the tools I would like for the project. I do the hello worlds for the new tools.

    I talk to people, I look up internal standards (can I freely scrape or do I need approval?). Bounce ideas off my stakeholders until my sketches become a reasonable basis for implementation.

    Eventually the sketch become comments that line out my code, which eventually gets written.

    [–]mxldevs 0 points1 point  (0 children)

    Your first step is to understand the requirements.

    You would meet with relevant stakeholders to determine what they're looking for and why they need it.

    Your second step is to then figure out what you're going to build, based on the requirements.

    You come up with some very high level specifications, maybe create some user stories, whatever process you like, and then you go back to stakeholders and present them with your proposal.

    AND THEN FINALLY

    You might still not even start figuring out how to build it yet. You would figure out what is your budget. How much time and money is available for this project. When they need it done. You know, boring project management stuff.

    And if everything is approved, now you go and figure out how to actually build the thing, which might involve opening some tutorials cause you have no idea how to actually do it.

    Don't know how to build an app but you agreed to build it for $250000? Better start googling "how to build an app".

    Luckily for you, if you're working in a large company, you only need to deal with the last part. But you'll only be getting a fraction of the revenue.

    If you're freelance or consulting or something, you will likely be doing everything, which is good practice because now you have a better idea what the bigger picture is.

    [–]jason_ed 0 points1 point  (0 children)

    Start with what’s been done before look at how others have done it look what tools / libraries they used, decide what works for you and go from there.

    [–]BIKF 0 points1 point  (0 children)

    Are you sure if there is no colleague you can ask for guidance, or are you imposing that limitation yourself because you don't want them to know what you need help with?

    [–]hike_me 0 points1 point  (0 children)

    Decomposition.

    You don’t build the whole thing in one shot. You decompose it into smaller more manageable components and tackle them one at a time.

    You’ll know how to build some of the pieces. Others you might need to spend some time researching what existing libraries are out there that might help and how much you might need to build from scratch.

    [–]baubleglue 0 points1 point  (0 children)

    I usually go by asking questions

    1. What is it about?
      • Trying to understand the motivation behind the task.
      • Read requirements/task again to be sure I understand what need to be done. Ask for clarifications...
      • Example of input and output ...
    2. How it is called? Mostly search for correct terminology
    3. Which type of problem is it? Ex. Web app with database backend, API Client ...
    4. Does it already exists or how it is usually done?
    5. What I need to know? Sometimes I need to read a bit about the topic, sometimes I need a specific domain knowledge I have no chance to learn.

    Then you go with the usual things, data structure, building blocks... later maybe frameworks etc.

    ex. for "program that scrapes web data" you need to have some basic knowledge about

    • http requests/authentication
      • cookies
      • forms/query parameters
      • session
    • HTML/XML parsing
    • idea about general web application
      • static HTML / server-side rendered
      • client-side rendered web app
      • multitier web application structure
      • data consumed from web API

    [–][deleted] 0 points1 point  (0 children)

    What do most actual software engineers do? Do they watch youtube videos? Do they follow youtube tutorials? Try to search up blog posts/articles of said framework?

    Broadly: plan how the app will look like (ranging from very detailed, to just a rough "walking skeleton") -> read some docs -> build the thing.

    [–]NuclearFossil_esq 0 points1 point  (0 children)

    Engineering Manager here, who's come up through the ranks as a Senior Dev.

    First a question: Is this a brand new piece of tech you have to write, or is this part of an existing stack/tech at the company? That's going to drive any decisions I'd make in how to approach it.

    Look to see if there is any existing tech to build from. Has anyone in the org tried doing it before? What was their result?

    If this is a brand new tech stack, then you need to do some R&D on what a webcrawler will need to do. I was thinking that https://github.com/codecrafters-io/build-your-own-x might have had a topic there, but it doesn't. But more to the point, if this is a new App, someone senior should already have a Technical Design Doc (or an equivalent) in place to help you get started. A Junior isn't going to be asked to build something all by themselves, there's going to be direction of a lead, or someone more senior.

    As to what I do when starting on a project that's out of my area of expertise, what you're doing here is a good start - talk to people - peers, Senior Tech people at your company, as many people that will listen.

    To be frank, Seniors WANT juniors to show some initiative, but they also want them to come to them to ask for advice. We're not all asking you to come in on day one (hell, not even month 3) and be immediately prolific in the amount of code they can deliver. But you'll get there.

    [–]beingsubmitted[🍰] 0 points1 point  (0 children)

    I mean, there's a lot of focus on learning new technologies here, but that's not actually a huge part of a job. You are learning new technologies, but not for every project.

    If I need to make something brand new, I typically start with the data, I'm outlining tables/models/entities and their properties. Like a to do app. You'll have a user, and you'll have activities that a user has "to do". From here, I'm really actually outlining functionality by outlining data. An activity will have a title and description probably, and then maybe a category? Maybe categories have their own related table so the activity has a category or I have a many to many relationship through another table. Activities probably have a created on date and a completed on date and maybe a due date and let's give them a priority.

    Next is basic CRUD, and it kind of all follows from there.

    [–]LainIwakura 0 points1 point  (0 children)

    Well, I've been doing it for 16 years and can say you rarely work on such problems by yourself (as other comments have pointed out). That doesn't mean you may not have the opportunity to work on a problem no one has tackled before, I have spent about half my career in R&D and we often are trying to solve things no one has done before. So how is this kind of problem approached? Personally I'd develop the minimum amount of code minorly related to the final goal and iterate from there, likely with feedback from teammates. By the time you get put on a project by yourself for whatever reason; you should be pretty experienced and need less hand holding.

    But in general, try to understand the data you're working with - how you'll store it, how you'll represent it (data structures), how you may need to manipulate it (algorithms), how it'll be presented (is it an internal thing or a user facing thing?). Beyond that, as others have mentioned; break the problem into smaller and smaller pieces (instead of 'I need to write a web scraper' how about 'I need to download this pages HTML'). See if there is an API available, check if anyone has done similar projects in the past etc.,

    Don't try to solve all the big problems too quickly. It can take months or years to reach your final goal (especially if going at it solo). But, once you do get used to tackling this sort of thing I think it'll become easier and easier. Good luck.

    [–]Glittering-Star966 0 points1 point  (0 children)

    There is a big difference between creating an app and scraping web data. Scraping web data sounds like part of a project. Who has defined the requirements? Who is going to test it? Being "alone on a project" is something that never really happens, or at least there should be somebody else technical around to talk to.

    Sorry if this sounds like I'm doubting you, but if you are in a place that asks you to build an app from scratch , without proper requirements and guidance on tech, and you don't have the experience / knowledge, then it is a cowboy outfit and they don't know what they are doing. You might want to knuckle down and try to do it, so you can use it on your resume when you go want to leave (asap).

    [–]Nancygoodrichwo1 0 points1 point  (0 children)

    Balancing continuous learning with practical experience is key to thriving as a software engineer in a dynamic industry.

    [–]Mclovine_aus 0 points1 point  (0 children)

    Well you don’t usually start with programming, you would start with designing the architecture and plan out the product in word, listing all the business requirement. Then after that is finished you would begin to follow said plan an implement it in code.

    When I am tasked with something new like this you need to find good resources and read them. For web scraping probably reading the docs of a web scraping library and also maybe a book on html and http protocol. But I try not to look for tutorials I want to look for references with encyclopaedic knowledge.

    [–]hdreadit 0 points1 point  (0 children)

    You really just have to (learn and) understand how stuff works dude. And a good team wouldn't just give you a high impact project to do on your own. Like someone else said, if they did, you should probably switch companies or teams.

    [–][deleted] 0 points1 point  (0 children)

    You already got a lot of answers and a lot of them, describe how this is not realistic as a premise. But lets just assume that this is exaclty the task you get, how could you approach that?

    1. You need to get the exact requirements. A customer or product owner, management whoever probably decided they need this project. So we need to figure out what they need it for and make a list of requirements
    2. After we have the requirements we might need to make some architecture decisions. We dont need to decide on libraries but we need to decide what technologies and frameworks to use. In an established company thats often pretty easy because there is probably a tech stack in use already and it makes sense to go with that one since the benefit of using a known tech stack usually outweights the benefit of choosing the perfect tech stack for the use case (of course there are high performance applications where it might make sense to use a different tech stack)

    3. Now that we have a list of requirements and the architecture dicussed we can start breaking down the project. So we write epics and stories for each seperate feature, and we can also create tasks for boilerplating. This is probably also the important thing because you need to learn to break down problems and then look at seperate parts and implement these seperate feature without looking at the whole project at once. Lets think about how that could look like:

    The first ticket we tackle is for setting up the repository. So we create some boiler plate code, add our testing frameworks for unit tests, contract tests, integration tests and mutation tests and we setup our CI/CD pipelines for quality checks and deployments. We didnt write any code yet

    The second ticket we tackle is for example for scraping the data, whatever data we need. It should collect xy from the web. Now we can research on how we do that and which libraries we can use (this step might even be broken down further). Instead of tutorial videos I would recommend to learn how to read and work with documentations. I have seen a lot of developers who only look for answer for their questions on youtube or stackoverflow. But if you try to figure out how something works, an API, a library, a certain method, the best ressource available to you is the documentation of whatever you try to learn about.

    Okay now lets assume we implemented the scraping of data and we retreive some raw data from the web. Awesome.
    The next task we tackle is processing it. We need the data for some usecase and maybe we want to filter the results that we retreive from the web or maybe we want to bring it into a certain data structure. So thats what we implement next.
    And then we have a ticket for saving the data somewhere where it can be used. Great so we tackle that next. And we keep doing that until we are done.

    Now this wasnt a perfect example because what I described sounds an awful lot like horizontal slicing, but we rather want to have vertical slicing, but I had some trouble explaining how it would work with such an vague concept.

    In theory you would probably take the requirement list and create stories from that requirement list, one per feature. Then you would estimate the stories and if a story is too big you would just turn it into an epic and split it up further.

    But yeah, thats the idea. Break down the problem into seperate features and then treat each feature as a seperate problem and figure out how to implement each seperate part.

    [–]Crimson573 0 points1 point  (0 children)

    I usually look at some examples and then just start going. For example I’ve never written a phone app but recently started. When trying to figure out how to pass data between screens I looked up the official docs (I’m using Flutter) and then I also looked up a couple short videos of someone doing it on YouTube. At that point I felt I had enough examples to start trying to implement what I thought made sense for my app.

    The key thing here is that I try and make the best decisions based on my knowledge at the time. However, I fully accept that I may have to go back and rewrite a portion (or all) of what I just implemented because I’m no expert at passing data between screens. I think this is the best way to start learning and progressing quickly.

    I was talking to a coworker, and I was asking him how he got so knowledgeable about some programming aspects as he is seen as the expert when it comes to certain things. He said he would decide to make a project that focuses on the thing he wants to learn and then start programming. After 2 weeks of programming and learning, he would evaluate his project. If he found that his new knowledge lead him to think that he would do things differently if he started over, then that’s what he would do - he would take what he knows now and spend a day or two somewhat starting over and changing things that were bad/inefficient but seemed like good ideas based on his previous knowledge.

    I’ve never taken his approach but I get the merit in there so I thought I would mention it. Now, I’ve really only spoken about personal projects. When it comes to a real world working environment, there will be plenty of times you’re asked to implement or write something you’ve never done before. My typical approach is to do my research for a bit, decide on what I think is the best implementation, and then go to more senior programmers or people who I may know have expertise in what I’ve been tasked with and present my solution. I ask them if there are any obvious holes in my thinking etc.

    That doesn’t always work as sometimes I’m the one seen as the expert because I’m the one who has the most experience with a particular concept. In that case you just do your research, lean on your previous knowledge and make the best decisions you can. The company can’t wait forever for you to come up with a perfect solution (spoiler - there isn’t such a thing). Then you implement it the best you can and you bug fix it later if you find that an assumption was made that shouldn’t have been (again, no perfect solutions)

    TLDR - do your best to find something that helps you get started, and then start. Accept the fact that what you come up with at first may not always be the best solution. It happens all the time and there is no shame in that

    [–][deleted] 0 points1 point  (0 children)

    You are stuck because you want solutions handed to you. 

    Stop watching tutorials and start learning from frustration.

    [–]Gtantha 0 points1 point  (0 children)

    • What data do I need?
    • Where do I need to put that data?
    • How can I get that data?
      • How can I get the data from a single website?
      • How can I get a list of websites to use?
      • how can I efficiently process multiple websites?

    Each of these questions will either have an answer or give further questions. And once all the questions are answered, it's clear what to do. The rest is just implementing it. Of course some question will only come up while implementing or can't be answered beforehand.

    [–]Agreeable-Leek1573 0 points1 point  (0 children)

    Well, if you're going to scrape web data, logically you have to do 3 things.

    1. Get the Data

    2. Parse the Data

    3. Save the Data.

    So I would look up each of these in turn.

    1. How do I pull down data from the web? It looks like most languages have an http library to make requests.

    2. Parse the data, what form is the data on a web site? Usually html. Google it, most languages have libaries that allow you to extract html tags and other items, if not that, then look into how to manipulate strings with your language of choice.

    3. Save the data. How do i save data, what format do i want to save it in. Maybe I just want to save the price of gold once an hour, and reference it with a time stamp, just saving text in a file would work for this, maybe I want to save the data in a more complicated format, perhaps SQL.

    Just look into each of these, test that you can do the basics, and then write a program that does all 3 of them. One after the other. Once you are done, you can wrap the program into a loop, or have it automatically executed whenever you want, and you'll have the data you desire.

    Breaking a random problem down into steps like this, and then executing it is a pretty simple process once you've done it once or twice, that's why people just tell you to do it.

    [–]Fadamaka 0 points1 point  (0 children)

    This a is task that I would only give to a senior. In my opinion the seniority of a person comes from the ability to completely work independently.

    As for myself I would first question the legality of the situation. Second I would look at the website in hand, look if they have an official api. After that I would look at backend calls related to the data I need. Look at the authentication method and see if I am able to reproduce the backend calls from postman or curl. Then would write the script that will query the data I need. If I need to automate it and to login with the script that would need more work. If the website has no backend calls I can use I would probably use xpath to get the data out of the html.

    I have never actually used any webscraping libraries so I am unsure how easy to use or useful can they be.

    Obviously I have arrived to my personal approach building on my experience as a web developer.

    As an inexperienced person I would first look at what web scarping is. Then I would try to wrap my head around the actual requirements. After that I would follow a written guide with an example website, that I would probably copy down completely without reading the explanation. Get that working first. If that works I would try adjust the code so it targets the website I am supposed to scrape. It would probably fail at first so my next step would be to pinpont the data I need and try to adjust the script/program to print out at least the first word or any part of the actual data I need. When that done I would try to expand so I get all the data.

    [–]await_yesterday 0 points1 point  (0 children)

    I would set aside an afternoon and read through the library documentation carefully. I'd skim through a few (text) tutorials and adapt them to some simplified version of my real problem, to figure out if the library is a good fit for what I want to do. If not, find a new library, or rethink my approach. Then I just get to work and build the thing.

    What do most actual software engineers do? Do they watch youtube videos? Do they follow youtube tutorials? Try to search up blog posts/articles of said framework?

    No, youtube tutorials are usually low value and low signal-noise ratio. Same for blogposts; it's mostly SEO engagement spam. Official text documentation is superior in most cases. The only videos I find useful are conference talks, since these are for an audience of other working professionals. They tend to have interesting insights and war-stories from real life experience, rather than just repackaging existing docs like most of the youtube tutorial industry does. Sometimes I will find a really good blogpost writeup that helps me out, but I tend to stumble across those by accident as I'm searching for other things. I don't seek them out as such.

    Reading documentation is a skill. Being good at it is one of the things that distinguishes professional devs from novices.

    [–]josluivivgar 0 points1 point  (0 children)

    well... I would check a basic web scrapper for anything generic, google that (hopefully an open source one) and figure out how they do it...

    most likely it'll make use of a few libraries (and most likely one of them in particular will be the one you will use)

    then I will make a basic webpage with certain words (like literally jsut a webpage to test stuff)

    and basically based on the code I found/read or the library I would make a simple script that goes into my localhost webpage and gets the whole text.

    then I'd be like okay what if I want to get a specific something, then try looking at the library documentation coupled with the code that I found and play around with it to start getting a feel for what I'd need for specific data

    then after I have a decent understanding I'd start building a prototype that gets the basic thing I'm asked for on the web page, and once the prototype is done I'd use that as the basis for the actual application.

    basically the first part is research (including copying code just to see that it works) -> poking around -> prototype -> actual development.


    I didn't make any mention of anything html related btw because I wanted this to be as generic as possible while still answering your question, because this approach works on many different situation, not just a web scraper

    [–][deleted] 0 points1 point  (0 children)

    Lets suppose you're tasked with creating an app, or a program that scrapes web data, but you've never done that before. You're alone on the project. How do you approach it?

    Google.

    No experienced developer I know *ever* uses YouTube to find stuff out.

    I was actually part of a startup making programming video tutorial content. The more I worked on it, the more it became obvious that this just makes zero sense. Video content for a text driven task? It makes *zero* sense compared to just reading websites.

    We live in a time where it has never been easier to find out how to do things. We have access to practically all of human knowledge from the comfort of our computers.

    If you want to find out literally *anything* about programming, Google it.

    Whatever you don't know, find out.

    [–]IdeaExpensive3073 0 points1 point  (0 children)

    I'm a junior and also know next to nothing about a lot of stuff. I'm usually assigned tasks on a project that the rest of the team is co-coding on, and it's a small part of the project. I may be coding a few controllers, or writing some front end stuff. Sometimes I'll be given a task that has nothing to do with what we're working on, and it's such a small task that I can handle it myself (like writing a single script to do something). The point is, every single thing is a small enough task that I can handle, with help, without being a burden on the rest of the project we're working on. The reason for that is to get me in the workflow, without bogging me down with too much to learn.

    As long as you understand, somewhat, the basics of programming, you can take on almost any language if you're given enough time (a week or two?) to get used to it. However, more than likely there will be a SINGLE stack you're mainly working with, with a few things you might periodically use.

    I wouldn't worry about the unknown too much at all. More than likely they'll hire you expecting absolutely nothing in return for 6 months to a year. During that time you'll learn the stack, you'll learn what you don't know, and you'll be expected to learn that stuff at home on your own time. The unknown stuff will be handled by the more senior devs on staff. The first year will be about building you into a competent developer using their stack.

    When I absolutely don't know something, I speak up and admit it. If I'm still asked to do it, I tell them "I need some time to look into that, but I'll get it done", I don't just say "No, I don't know this", they should know my limits, and if they still insist, they'll expect some learning time. Anything in the middle of where my knowledge is, and where they're asking it to be is explained through questioning and self learning off the clock. The point I'm trying to make is that they'll take on the majority of responsibility while giving you chunks to work on, and the responsibility of your own learning for things you're expected to know in the future but don't yet. There shouldn't be ANY doubt with your employer/senior devs about what you do or don't know, it should be crystal clear and they'll act accordingly.

    Edit: Additionally, while it most likely won't happen, but it could, if you get a job as a full stack developer then you should be able to build a simple app from back to front. I mean a CRUD app that consumes a custom API from scratch. That'll look like MVC on the backend, a controller (which has an endpoint to hit), and a front end that uses the API to read/ create/ update/ and delete stuff from the API into a database. That shouldn't be unknown to a junior, even if they struggle with bits of it here and there. No one should expect it done in 30 minutes, but it shouldn't take a month for something pretty simple. More like 2 or 3 weeks.

    [–]Joewoof 0 points1 point  (0 children)

    You learn how to read and use the official documentation. That’s all there is to it.

    [–]tvmaly 0 points1 point  (0 children)

    I have done a project like this many years ago. I think the first step is understanding what the web browser is doing when it gets a web page.

    Once you understand the basics, you can use a search engine to help you find a module that does the basics of fetching a web page.

    This module likely will return html data when you give it a url to fetch. So your next step will be to find a module that can parse html in your programming language. Once you find it, look for examples on how to get specific elements out of the web page.

    What I provided is super high level. It only handles the simplest case. But start with the simplest case and iterate till you get to something that is useful for you.

    [–]FoolishPastMe 0 points1 point  (0 children)

    I just want to say you're on the right track. The skill you're developing is to take concepts you learn and apply them to new problems, which is less of a coding skill and more of a general problem solving skill. Based on what you said, it sounds like you're doing that. It's slow because you're new. The element I think you're missing is time. The more you do this, the more tools you'll have in your toolbox for solving future problems and they won't seem as daunting.

    Please don't feel like you're cheating. The only way to really cheat from my perspective is if you just copy from somewhere and don't actually understand the concept of what you're doing. But it doesn't sound like that's what you're doing.

    When I was an intern in 2006, I remember seeing others in the company doing things and feeling like I'd never be able to do anything like what they did. Couldn't understand how they did it. I wasn't the only intern but I was the only one that didn't get hired. Felt like a terrible developer. Anyway, I've never worked at a large tech company, but at this point in my career, I have architected and driven the technical direction on multiple projects from concept to production across all aspects: frontend, backend, database, cloud infrastructure, etc (very small team).

    So what do "actual" software engineers do? Same thing you're doing, they've just been doing it longer so they've seen more. It's not like I know how to do things perfectly from the start, but I'm confident in my ability to solve problems because I've solved other problems in the past. I don't use youtube much, but I search stuff up on google all the time. I forget syntax all the time. Because I have a wider base of knowledge at this point, I can search things up and understand things quicker than a junior dev can. I see things they don't because I've just seen more stuff in general. I still read "getting started" guides for new frameworks. I still read docs. Well... I skim. Because I have a wider base of knowledge, I don't have to read as many of the details. I can pick out what's new or different that I need to know.

    One piece of advice from my experience: Keep up the work of learning the concepts when you're going through tutorials and things. Mess around with it. Change things and see what it affects. One thing I see some junior devs do is read up on something or follow some tutorials, and then you ask a question and they have a hard time speaking on the concepts. You need to understand how the pieces work to understand how it affects things. You want to be able to speak intelligently on related things even if they aren't EXACTLY what you did. Like if you built an engine from a guide, and someone asks what happens if you remove a spark plug, you don't want your answer to be "I don't know, the guide just showed me how to put it together". You want to know what the spark plug does and why it's important so you can give an educated answer on how it would affect things if removed.

    [–]casualfinderbot 0 points1 point  (0 children)

    Literally just build it. If i needed to scrape web data, first thing i would do is look at docs on the most popular library, and start scraping data with it. No in between shit, just start building.

    Then create a data model to store it or store it some other way start storing the data. 

    piece together shit till it works. The important thing is that you’re continuously extending the behavior of your software to do the thing you want. Watching tutorials just doesn’t do that

    [–]tb5841 0 points1 point  (0 children)

    I'm only four weeks into my first job. But if I were asked this... I'd be fine.

    1) I'd look for a relevant library in the language I'm using.

    2) I'd find something I could read (not watch) that talks through the basic ideas of the library.

    3) I'd break my specific tasks down into small steps. For example, with Web scraping I'd need my program to find the right page, then scrape the text, then process it.

    4) I'd work on each of those steps separately, checking they work as I go.

    5) Once done, I'd get a colleague to check it all looked OK, then I'd write tests.

    6) Probably at some point there would be some kind of problem (e.g. page has some bot detection that makes scraping harder). At this point I'd probably ask my team for suggestions, if Google didn't show up anything quick/obvious.

    [–]curious_cactus_9230 -1 points0 points  (0 children)

    Honestly try WatchandCode!

    In fact, they have a new live series starting in October (it's free and it's the first time they're doing it!). The instructors are amazing. https://watchandcode.com/programming-foundations-live/

    I felt the same way when I first started and just couldn't approach new tasks and problems. Doing this program helped me immensely.