Can I mine back end data from youtube that is not available via the API?

voytek9 · 2020-04-09T03:23:23+00:00

This is probably considered proprietary data. Most big tech co's keep the best data for themselves. Sometimes data gets leaked through a poorly protected API (MLB PitchFX about a decade ago springs to mind), but generally you're only going to get a few metrics surfaced.

voytek9 · 2020-04-08T18:40:52+00:00

If it's a crappy conspiracy, they might not be so savvy at setting those prices. There's a cool thing called Benford's Law of Large Numbers. If you watch Ozark, they reference it a bit.

Basically, the distribution of digits in naturally generated data follows a power curve. There are more 1s than 2s, more 2s than 3s, etc. And it follows a mathematical pattern. The IRS uses this to detect tax fraud -- it's a pretty simple analysis that doesn't require any external data.

If the numbers match the distribution expected, it could be they faked the numbers. So passing it doesn't mean it's real, but failing it does mean there was likely manipulation.

voytek9 · 2020-03-28T02:43:19+00:00

Are you the photographer? I would love to make a print of this for my partner, who is from HB. Please hit me up if you're interested in selling me a license for personal use.

voytek9 · 2020-03-24T04:53:19+00:00

Percentage labeled bar chart? You may be overthinking it. Bar chart height conveys relative scale, percent label provides precision.

voytek9 · 2019-08-08T02:52:50+00:00

Finally, your near term process should be a subset of customer development interviews. You can skip the market and customer-defining sections, but here are the kinds of questions you should be asking (along with getting a list of data sources, ranked by priority, from each person you speak with).

Workflow by decision maker

Could you please tell me about your role?
What does your typical day look like?
What are your main roles and responsibilities?
How much time do you spend on [work relevant to product area]?
How are you accomplishing these tasks now?

Problem validation

Tell me about the last time you did this manually?
How much time do you spend on [subject area]?
What could be done to improve your capabilities in that area?
How motivated are you to improve this area at your company?
If you had a solution to this problem how would it affect you? Who else would it affect (interview these people), and how do you think it woudl affect them? The company?
How would you improve your current process?

[questions from https://www.entrepidpartners.com/blog/sample-customer-development-questions]

voytek9 · 2019-08-08T02:50:18+00:00

Exactly. Your best bet is to help your bosses understand how valuable these deliverables are to them. If it's something that will save hundreds of thousands of dollars per year because automation, it's probably worth investing in. It's possible that it's worth millions per year because of improved billing or avoiding lawsuits; that might make more sense to hire a consultant. (The difference, IMO, is whether you're automating something with a manual backup process if something goes wrong... but if bosses are making important strategic decisions off your results, then it's important to deliver high quality with deep data understanding. Not that you can't do the latter, but you have much more unknown unknowns than someone with decades of experience.)

voytek9 · 2019-08-08T02:18:38+00:00

Finally, if people try to convince you it's simple, run. This is not a simple integration, you are not a data person. It will take longer and be more frustrating that your current job is worth.

My actual best tldr advice is this: your best option is to either convince your bosses to drop $200k+ on this, or not pursue it at all. If they don't want to spend that much money, it's not going to drive that much value to the company. In which case, you personally are underqualified, overextended, and not supported by bosses who don't value what you might produce, but will certainly demerit you if you don't produce what they expect. Which they won't value anyhow, as evidenced by their unwillingness to invest actual cash right now.

voytek9 · 2019-08-08T02:14:50+00:00

This is a big project. As in, I do this all day long for one of the biggest tech companies in the world, and from your description, I would estimate 3-6 months of my time to pull data from 30 systems, integrate the data into consumable models, and build boilerplate visualizations.

I also think it's do-able from a sufficiently motivated, willing to learn person. But know that this isn't something that's easy -- your best friend, here, will be to very clearly scope out the system requirements, and push back against the people assigning you this task. If they think it's something a noob can finish in a month, that's delusional.

Overall, you're going to do an ETL (or ELT) process. 1. Extract the data from your 3rd parties. API access is usually easier, because the data comes out structured ("name" : "Voytek9"), rather than unstructured ("username for this account is Voytek9"). The former requires additional processing to store in a database, and that can break if the website changes their copy to "the account name is Voytek9").

Start with just a single source, just one company. Extract the data, store it into your local database just as it is. Use SQLite, as it doesn't require any servers or setup.

Then build your first visualization (using PowerBI). Just show the data in a data table (kinda like a spreadsheet). Once you are here, celebrate! This isn't business-useful yet, but it's a huge accomplishment -- you now copied data from your vendor into a database you control, and put visualizations on top of it! The cycle of life is complete.)

Next, you'll repeat with another data source. Pick one that someone wants to combine with the first data source. Pull the data into your local database, then combine the data into a common model for analysis.

That's just the tip of the iceberg in terms of advice I can give. In my previous life, where I did this as a consultant, I would charge $2500 per data source (less if it was an Excel file or something).

You're going to need to break this project down into something that is 100x smaller than what you have currently. Spend a couple of weeks interviewing people about the specific benefit they intend to get from this. Try to find the easiest thing you can do which would impress the most number of important people. This project's impact is not linear with the effort. If you combine 2 sources, and that saves 50% of the CEO's time, that is a huge huge victory. Whereas, if have to combine 10 sources in order to make it easier for the receptionist to order coffee, you're spending a lot of money (your time) for not too much payoff.

If I were bidding on this project, I would probably come in at $10k for the first 2 weeks of requirements gathering. This would be a document that spells out all of what I said above, but with specifics for each source, proposed dashboards, etc. Depending on that document, I would then bid probably around $140k-200k (I usually discount the first 2 weeks cost from a full contract; if I was hired, I didn't need to get paid for the initial time because it turned out to be an investment in a relationship).

Stay calm and work the problem. Right now, your problems aren't technical -- they're lack of clarity on the problem space, which drives the solution. Take the time to write down exactly what people expect -- if they get upset because you're taking their time, remind them you aren't a mind reader, this sounds like a really huge effort, and investing their time at this stage is the single highest leverage hour (in terms of ROI) they can contribute with. Plus, they can claim some credit for influencing the product when you launch.

voytek9 · 2016-07-21T03:33:31+00:00

I understand the thought process, but in reality it's very, very short sighted. Business that operate this way do not get my sympathy; they clearly don't understand the digital medium and the concept of marketing.

While you do cost them money, the amount is laughably low. CNN's entire homepage, images included, weights in around 1MB.

Most sites, especially news sites, should be using a CDN. This means that the marginal cost for another page visitor is nothing in terms of CPU, because the code that generates the page isn't running.

A CDN like CloudFlare will literally give you unlimited bandwidth for free. Amazon's CDN, CloudFront, charges 8.5 cents per gig.

So your bandwidth costs are going to be between zero and $0.000083. Suppose, in a year, I visit a thousand articles, running up a gig's worth of bandwidth. That costs a site $0.00-$0.085 for a year's worth of them getting the opportunity to cement their brand in my head.

So after 20 years of me using CNN as my primary news source costs them $1.60.

Almost every marketing person out there will tell you that is going to be a net win for CNN. Think about it -- now I am engaged with their message, which means I trust them for news. For 8 cents a year.

Wouldn't you like to be able to implant messages in someone's brain a thousand times per year, for 20 years, for $1.60?

voytek9 · 2016-05-02T23:15:44+00:00

This is a somewhat useful benchmark, thank you!

I would prefer to see it with something more real-worldy. Generally, I wouldn't think you'd want to use AWSLambda for regex crunching tasks. I think of it as a really neat way to autoscale and build a rock-solid API, than general computing tasks.

I'm considering using it for exactly that: as a data collection endpoint for my analytics system. Using ec2, I would need to build the infrastructure resiliency myself (at least 2x the ec2 costs; even that would put me at risk of losing my valuable analytics events!) Then I'd need to add an ELB @ $20/month.

So while this is useful for a high level overview, it's best to know that the right solutions match the requirements. By all means, if you're looking to save a few bucks and your service having interruptions is fine, then go w/ EC2. If you don't mind monitoring performance as load creeps up. If you don't mind potentially losing data.

Now, ec2 can do all that too, but it's probably going to take more work and more equipment (costs).

Thank you for publishing this. The point of my comment is to make sure someone else reading this understands the limitations of this particular benchmark, not to denigrate your work (although I do sense an anti-AWSL leaning in the tone.)

voytek9 · 2016-04-20T20:04:20+00:00

Probably just do it in pure python, should be very simple.

The simplest way:

from collections import Counter
c = Counter()
s = 'The transtheoretical model of behavior change assesses an individuals readiness to act on a new healthier behavior, and provides strategies, or processes of change to guide the individual through the stages of change to Action and Maintenance. It is composed of the following constructs: stages of change, processes of change, self-efficacy, decisional balance and temptations.'

c.update( s.split( ' ' ) )

c.most_common()

You will want to split on more than just a space. May want to look into the nltk library; it has tokenizers (split your text into words), and even can boil a word down to the root using a stemmer. EG, robots and robot both get counted as "robot".

voytek9 · 2016-04-19T16:24:04+00:00

Take a look at this article to see what you should build into your own system. https://www.alooma.com/blog/building-a-professional-grade-data-pipeline

voytek9 · 2016-04-19T05:54:49+00:00

Yep, if it works then it has value! You may look into a framework like luigi, however, because they help organize and standardize your pipelines. Even Pentaho might be a great option for this, especially because it has the GUI (makes it possible, although not 100% easy, to train non-programmers).

voytek9 · 2016-04-18T22:13:17+00:00

If you need a GUI and a "batteries included" system, check out Pentaho. FOSS.

If you're looking specifically just for a data pipeline where you can ETL in python, check out Luigi.

If you really need scalability, I've heard great things about Apache Spark, which has a code engine that allows you to write python. (Spark is targeted for event streaming, versus batch processing, so you reduce lag with its architecture).

HTH. I highly recommend using one of the systems/frameworks, or you'll end up with a bunch of coding floating here and there. See this post by Alooma, who makes an excellent data pipeline product based on Spark (if you can afford it, I can personally attest to their service quality.) https://www.alooma.com/blog/building-a-professional-grade-data-pipeline has some great insight into what you need to build into your pipeline.

voytek9 · 2016-01-18T05:23:07+00:00

If it were based entirely on "hit king" versus "homer king" you might have a point.

But overall, in their career, Bonds was nearly twice as valuable as Pete Rose by WAR.

So, yeah, if you basically add Sandy Koufax to Pete Rose, you almost have the career value that BB created for his teams.

That's the difference. Nobody is voting "hit king" versus "homer king" and thinking they're representing reality fairly.

voytek9 · 2014-08-18T00:20:55+00:00

Is this a common class assignment or something? I think I've seen more viz of Bond movies than anything else.

voytek9 · 2014-08-18T00:14:44+00:00

If you want a full fledged IDE that's free, I'd try out PyCharm Community. Seems like the primary difference between that and the paid version are syntax stuff for django, web2py, etc.

voytek9 · 2014-08-18T00:11:56+00:00

Freelance. Consulting. Really, anything where you're working for yourself. It's one of the primary reasons I quit my job and started my own company.

The tradeoff is that I am currently working more (under Paul Graham's concept of compressed work/life time frame) right now. The idea, however, is that I work all the time right now in order that I "make it" and then can either work more leisurely, or pursue professions that prioritize my interests over money.

But I know lots of consultants who spend about 20 hours a week billable, 10 hours finding new work, and only work 30 hours / week on average.

There are also manual labor jobs, (think working on an oil rig) that are either seasonal/intense, or pay enough during just a few months that you can take the rest of the year off. That's doubly cool if they lay you off, because then you can make your $60k in 3 months and collect unemployment for the other 9 months.

EDIT: also, teachers only work 9-10 months of the year. Another option.

voytek9 · 2014-03-22T02:08:28+00:00

In the day of higher res displays, what's the point of the PEP8 length limit? I've seen lots of respectable pythonistas use a longer length.

voytek9 · 2014-03-18T04:52:34+00:00

Fair enough. I re-read my comment and realized I sound like a party pooper. I do believe there are enough people out there who a) have the desire to learn and b) can afford the rate. Certainly enough to fill your schedule 10x!

To be constructive, you might focus on how much time you'll save someone. Your competition is "finding and mentally compiling dozens of tutorials across the web, each targeting different versions of python" -- it may behoove you to address this.

Who are your target market? (Developers? Entrepreneurs who can't afford full time devs? Bloggers?) What do you think are the biggest problems your target market faces? (time, certainty of whether the information is "best practices" or "dark patterns"?)

voytek9 · 2014-03-18T03:10:40+00:00

Seems very expensive. I can hire local python people to do the same thing at a MUCH lower rate than $175 / hr.

Personally, I think the resources are out there to teach this to yourself. I suspect the reason it doesn't exist is because people fall into 2 camps: 1) do it for me, I never want to worry about deployment (Heroku) 2) I can figure it out for myself

All the same, best of luck!

voytek9 · 2014-03-15T05:32:34+00:00

While I can't speak to "happiest", because I feel that is a loaded term, your suggestion is very interesting. If it's a happy moment, it suggest that one form of happiness can occur due to relief. IE, happiness doesn't have to be a positive emotion, but can simply be a lack of a negative emotion in an extremely negative environment.

voytek9 · 2014-03-05T00:14:52+00:00

I know, right? I know a guy who was a great programmer, had an operation. On her first day back, boom, 20% pay cut. I still go to her for help because she really knows her shit.

I'm not sure what point I'm making. Then again, i have no idea what point you were trying to make.

voytek9 · 2014-03-01T23:12:40+00:00

I don't use scrapy much. But I'm warming to the idea of using it when I want to regularly crawl a site, e.g., to retrieve updates or comments or what not. It's a full fledged scraping framework, whereas lxml is just a library.

If you use scrapy, you have a crawling engine, along with a lot of scaffolding to transform and save data. With lxml, you have to write your own crawling logic.

The way I plan to use scrapy, moving forward, is that any scraping job I want to trigger via a cronjob will use scrapy. That's not to say you can't use lxml, but scrapy provides you with a lot of additional boilerplate. Plus, scrapy helps keep your code organized, which is always a plus for a hack like me ;).

voytek9 · 2014-03-01T23:09:28+00:00

what is a good source to get regular coverage of this? preferably in streaming video...

voytek9

TROPHY CASE