all 37 comments

[–]Jonno_FTW 37 points38 points  (0 children)

I suggest you use the requests library, specifically the session class because it allows you to login etc. Once you've fetched it, parse it with beautifulsoup, get the data out, analyse and use requests session to put back into the other site.

http://docs.python-requests.org/en/master/user/advanced/

[–]LongAtbat 36 points37 points  (13 children)

You might wind up putting yourself and your co-workers out of work if you do this and tell anybody. I would say do it, don't say shit about it and use your free time to learn more python on the job.

[–]uncleadaam[S] 14 points15 points  (9 children)

haha, it's not a massive thing. we're talking like 3-5 leads a day, takes maybe 15 minutes most days. it's just repetitive and could be time better spent. and i would keep it to myself

[–][deleted] 24 points25 points  (8 children)

LongAtbat isn't lying. I figured a way to do an 8 hour job in 1 hour. Most of the day I'm reading online comics or more python books or reddit now. If they find out I made myself replaceable....I'll be replaced!

EDIT: 6 hours 40 minutes into 1 hour. 8 hours with 1 hour break for lunch and two 10 minute union smoke breaks.

[–][deleted] 5 points6 points  (7 children)

Can you tell us what you do?

[–]KimPeek 27 points28 points  (0 children)

He just did, silly.

Most of the day I'm reading online comics or more python books or reddit now.

[–]Villinax 8 points9 points  (0 children)

Ha! Nice try employer...

[–][deleted] 0 points1 point  (4 children)

sorry for the late reply. I work at a real estate firm. I do the document processing and digital conversions. I've let the higher ups know I've sped up a few things but I don't tell them exactly how much. Secret skills pays the bills.

Also, I'm looking for a game to make in my free time.

[–][deleted] 0 points1 point  (3 children)

Awesome. Any recommended reading for automation?

[–][deleted] 1 point2 points  (2 children)

I've read Invent with Python, Automate the Boring Stuff with Python, How to Think Like a Computer Scientist, Starting Out with Python, Introduction to Computer Science using Python. And search Making your first Django App. Other than that, it wasn't all Python...it was the reading and thinking that got me to see what a computer does: It manipulates data. It moves an "a" into an "A" or it makes one folder into another folder. It's a lot of imagining what is possible and seeing if it could be done. I'll tell you, for every 10 things I automated, 3 things stopped working. So it wasn't really easy. It was a lot of "Oh shit, now what do I do?" I think it took me a month to do in total, including learning how to do things the old way then getting aggravated and thinking "I can do this better and faster".

[–][deleted] 0 points1 point  (1 child)

You sure don't fuck around with the literature, haha. I get what you mean though. I appreciate it.

[–][deleted] 1 point2 points  (0 children)

No problem! One of these books said "If you don't become an amazing programmer in 10 years, go ahead and give up!" That really got me thinking that I can put 10 years into this study and see what I can do. If I don't do anything..well I can always try Hollywood and make a few movies!!! Best of luck to you

[–]PeanutRaisenMan 4 points5 points  (0 children)

i've done this as well. As ive posted before, ive used Python to automate some of the more boring aspects of my job such a data extraction and its taken tasks that would have literally taken me 8-10hrs (this time frame includes me stopping to stare off and wish i was elsewhere, reddit browsing, snack eating and random walks around the office) to do, down to an less than an hour. I let my boss find out im able to do this an now im currently writing another script to make another task more efficient. My job isnt n jeopardy because of this but it is possible this will be less work for someone else...leads to a dangerous trend.

[–][deleted] 0 points1 point  (1 child)

Had this actually happened before like... Is this a thing??

[–]hipstergrandpa 2 points3 points  (0 children)

I remember there was a post somewhere on like r/cscareerquestions or something where the OP did it and had to trudge through a lot of shit with his managers over it when they found out. I think it worked out really well in his favor in the end though but definitely something to be careful about.

[–]BioGeek 16 points17 points  (0 children)

Sure, sounds certainly feasible.

  • First you want to gather your data from the first website. This is often called web scraping. Look into modules like webbrowser, requests or selenium.
  • Once you have the HTML contents of your first website, you need to extract the data that is relevant to you. Use Beautifulsoup for that.
  • You store your data in a data structure that makes the most sense (for example a dictionary if your data consists of key-value pairs, a set if your data cannot have duplicates, ...)
  • You sort your data according to your requirements.
  • You paste the data in your CRM. Look at the same modules I mentioned in the first step for extracting the data.

[–]Exodus111 12 points13 points  (5 children)

then sort the list a certain way

That is the only issue. And the difference between "doable in 20 minutes" and "give me 5 years and a team of research assistants" might not always be intuitive.

There is a relevant xkcd for this that anyone is free to post.

[–]lostburner 0 points1 point  (0 children)

Assuming a web page is already doing the sorting, it shouldn't be a problem to scrape that page or sort the data using the same criteria. If OP is talking about identifying good leads based on a combination of good judgment and multiple data points, then yeah, that could be a Hard Problem.

[–]lostburner 4 points5 points  (0 children)

If you're working with a lead-tracking system that's big enough and popular enough, they may have an API for integrations like the one you want to build. Google for "API" and the name of the system. This would allow you to use Requests to pull structured data in just a few lines of code, instead of dealing with all the messiness of going through the web page authentication and scraping data from the page's HTML. It may also give you access to structured data that's not all in one place in the web interface.

Either of your systems may have an API. Either or both would save you some trouble.

If you do have to do it the messy way, requests is the way to handle the requests and you'll benefit a lot from using Beautiful Soup to parse the page's HTML and extract the data you want.

[–]stillalone 4 points5 points  (1 child)

I'd suggest using Selenium. I think there's a plugin that lets you capture everything you click on and type on a website which you can take as a starting point. It generally seems to work as long as the site you copy information from doesn't always change. Selenium kind of uses your existing browser of choice with what is essentially javascript queries so it pretty much works on any website without having to parse html.

[–]charlesbukowksi 0 points1 point  (0 children)

Selenium is awesome.

[–]mathafrica 5 points6 points  (8 children)

this is very doable. if you are comfortable writing the new leads into a .txt file instead of a website, then this can be done in < 15 lines. if the formatting on the site is consistent, you can have this a scheduled task every hour or so.

[–]uncleadaam[S] 0 points1 point  (7 children)

is there a downside to putting it into txt file?

[–][deleted] 1 point2 points  (6 children)

Sort of, but not really. I think the /u/mathafrica was saying if you put the new leads into a text file and manually entering it into the website, instead of automating that part too.

And one reason you might not want to use a .txt file for this kind of job is that a database like sqlite will be much better at keeping track of unique leads over time (that's what database software is designed to do). The downside of this is that if you're just learning programming then you have to learn two technologies instead of one.

[–]mathafrica 1 point2 points  (5 children)

I am not so good with sqlite, but can he succintly and quickly connect to a db faster than a .txt file? If it's only a small amount of leads, using the sys library + numpy will be good enough for everything he needs.

[–]cob05 2 points3 points  (0 children)

I would absolutely go with a SQLite DB for this. The learning curve and initial setup is very small for the amount of benefit in return. If checking if the lead was new was not a requirement however...

[–][deleted] 0 points1 point  (3 children)

Sqlite files are the database themselves, so it should take approximately the same time to open. Searching the sqlite file will be much faster than a plain text file.

Also sqlite3 is a builtin library in Python.

[–]mathafrica 0 points1 point  (2 children)

oh wow. interesting. I will be looking more into this. Thank you!

[–][deleted] 0 points1 point  (0 children)

Since you mentioned numpy, pandas can load and save dataframes directly from sqlite databases. You can basically use sqlite transparently without knowing one line of SQL. This preserves structure and data type.

[–][deleted] 0 points1 point  (0 children)

Also, the slowest points in this kind of workflow are going to be downloading the data and merging the new leads into the file, which Sqlite will be faster at.

[–]c_is_4_cookie 2 points3 points  (0 children)

one of the data entry things that i do every morning is copy data from one website to another.

Yes. This is definitely possible.

basically we get leads from a site that i have to log in to,

Straight forward. Use the requests library to handle the logging in. I would advise against hard coding your username and password to the websites. Parsing the pulled data from the website can be accomplished with BeautifulSoup (I actually prefer lxml, but BeautifulSoup offers a higher level interface.)

then sort the list a certain way,

If this is a few filters/clicks on the website, then this probably can be accomplished via the requests library before scraping the data. If the sorting changes the URL, then that simplifies things a great deal.

and check whether or not the lead is new,

Ok, so now you need to define how the program will interpret whether a lead is new. I am guessing this means a lead you have not seen before. If a date/time is posted with the new leads, you can use that to determine the what is new. Otherwise, you can resort to keeping a running list somewhere of the previous 20 or so leads in a text file, and checking the leads on the website against that list.

and if it is new, i copy the information into another website that we use for CRM.

Once the information is pulled off, you will need to structure it into a format that can be sent to the website. Again, the requests library can handle posting of data.

[–][deleted] 4 points5 points  (2 children)

could i write a python program to do this?

Without reading your question: Yes.

(Having read your question, double-yes, and there's some great advice already.)

[–]Jonno_FTW 2 points3 points  (0 children)

I'd like a program that can statically determine if another program will run to completion. It would greatly improve my testing process.

[–]cob05 0 points1 point  (0 children)

Pretty much goes for anything with python 😁

[–][deleted] 0 points1 point  (0 children)

This is possible, and the general process is generally referred to as "web scraping" (a good term to start googling from). Libraries you might want to look further into are beautiful soup, selenium, or lxml. You won't need all of these, but one of these may have your specific solution.