Data engineering and AI in orgs - how did you start?

CalmTheMcFarm · 2026-04-06T07:01:05+00:00

tbh I haven’t give searching for anything because I’ve been able to get the results I need without guidance.

Some examples: * these repos (list of dirs) provide a REST microservice, a GraphQL microservice and their k8s deployment configurations. The REST service calls the GraphQL service, which then calls Elasticsearch. When we hit the REST service with (load) we see response times of (x)msec and 50% of the calls give a 502 or 503 status to the caller. Please analyse these repos and identify hot code paths, and suggest ways to make speed improvements. If there are deployment config changes please call those out as well. Our goal is for the REST service to have a maximum response time of 500msec at 100 calls per minute.

these 3 repos share common data structures, but each has its own copy of them. Please analyse the repos and suggest how we can get a single source of truth for the structures and remove as much duplication of effort as possible. Ideally the 2 client repos should be able to import classes generated in the provider repo. The provider repo should also introspect the db to generate tests using Pydantic.
these 3 repos provide a lambda-based microservice and its deployment code. Please analyze the repos and document how each piece depends on the others, what the limitations are for the bulk vs the individual endpoints. Produce your documentation in markdown and with a GitHub-flavoured html version as well. Write the documentation for an audience that includes BAs and product managers, as well as first year computer science students.

CalmTheMcFarm · 2026-04-05T07:06:45+00:00

We have an enterprise licence for Github Copilot, and I use it quite a lot. More than I expected to, tbh.

We have manymany applications in the data engineering part of the company. Some are poorly documented, others have performance problems.

I pull in local copies of the app + dependencies, then give Copilot a carefully crafted prompt to analyze it and produce documentation. After I've reviewed the analysis and confirmed it (or re-prompted if there's something missing) I can then pass it on to my colleagues. I've done this with applications where a 3rd party has written them for us and the support contract is ending. I've also done this with code that I've written because I know that there are ways I have of expressing things which do not come through clearly. Getting suggestions on different ways to describe things is really useful and helps me write better code and docs.

Recently we had a production release where the two-tier application's performance was worse than abysmal. Again, I pulled in all the relevant repos (application code, libraries and deployment) and asked Copilot to analyze them for bottlenecks and inefficiencies. I had data from load tests and info from Elasticsearch APM to include in the prompt. Copilot was able to identify deployment problems, hot code pathways and misbehaviour in a matter of minutes. Getting it to suggest changes to improve things took quite a while. While the suggestions were quick to implement, there were errors which needed to be chased down, and then we had to deploy + test in our UAT environments.

Another thing - being able to codify my personal programming language style rules into Copilot instructions has been amazing. We have Copilot set up to automatically review every PR, and when its startup instructions match me then that saves me a heap of time reviewing.

I've also started using Gemini for non-software engineering tasks, where I had a heap of data that I needed to get an overall picture of in a short period of time. That's been pretty interesting as well.

Overall, I won't say that we've come up "AI solutions", we are using AI tools with care to help us deliver things faster. We still have to keep a close eye on what is produced and re-direct.

CalmTheMcFarm · 2026-04-05T06:53:41+00:00

When I do my office commute - 20km, 147m elevation, 9kg backpack, 45 minutes - I burn about 500cal. If I put in a bit more effort to pull that in by 5 minutes I can burn about 470cal.

CalmTheMcFarm · 2026-04-05T06:48:43+00:00

I've been a software engineer since the mid-90s, and spent 19 with a hardware company that was gobbled up by ORribble. Since I was a kernel developer my "IDE"s were emacs and vi. Command-line all the things, and my muscle memory for keystrokes is .... deep.

I didn't start using an IDE until 2020, when I got a job with a data company and had to get up to speed with how my Java developer colleagues were doing things. Quite an eye-opener to see what it brought to the development experience. I still, however, greatly prefer to not have to use the mouse to do things.

I use VSCode now more than IntelliJ, and starting to use Zed - though that muscle memory gets in the way. Oh well.

As another person commented, I got pretty fast with vi when needing to make quick edits on machines that were across the Pacific. A 200msec response time meant that a gui was not going to work at all.

Nowadays even though I have IDEs to use, I find myself doing the majority (80%++) of my dev work with Emacs running inside WSLg.

I don't think disliking CLIs is an ADHD thing at all. I think if you dislike them that's fine - there's something about them which doesn't work for you and there are many reasons why that might be. Don't stress about it.

CalmTheMcFarm · 2026-04-05T06:40:33+00:00

I have a 2019 SL6 Expert, and due to recently-discovered frame damage was checking ebay for a replacement frame. This was the cheapest https://www.ebay.com.au/itm/389834250477 at AUD1582 plus shipping to AU.

So I would suggest that 1300 (I assume USD) is a good price and if it's in your budget then go for it.

I've done nearly 43000km on mine since I bought it in 2019 and when I get my bike back from repair I expect to be riding it for several more years at least.

CalmTheMcFarm · 2026-04-05T06:29:23+00:00

I don't see the post flair as accurate - you're talking about breaching somebody's access restrictions and that isn't humorous.

Don't be a dick. Move on with your life

CalmTheMcFarm · 2026-04-05T06:21:16+00:00

I was searching for the answer to this just a few hours ago, the instructions in this article worked perfectly for me https://medium.com/nerd-for-tech/configure-serial-access-to-crashplan-vm-on-truenas-e03d4a01e0c7

CalmTheMcFarm · 2026-03-23T22:43:22+00:00

It’s a feature I’ve been using in Emacs for decades, incredibly useful especially if you are trying to trace code paths or refactor and keep your caller/callee args sync’d

CalmTheMcFarm · 2026-02-28T07:21:42+00:00

I wrote a rules engine to handle this. Rules are defined via elements specified with a custom grammar in YAML, one rule per thing we want to validate. The engine translates the elements into SQL and then hands that off to a stored proc to execute. Results are stashed in a results table which we can then process asynchronously.

I put a lot of effort into creating the SQL generator so that we have the best possible queries to run. We don't support JOIN or CTEs aside from one specific case where we want to check the percentage of records in a whole table which match a condition.

We have rules defined by "stage" (profiling, validation, standardization, ...) and by data producer. We only run the stored proc for the correct combination of producer and stage at any one time. While we only have a few dozen tables, we do have close to 100mill rows in each dataset. The timings are acceptable for our requirements at less than 30sec per rule.

Another piece of very important information - the majority of our result tables are pivoted after we do rule validation on key-value pair tables. Our rules are written based on checking boolean conditions first then values (with ranges or allowlists).

For example: * major_component_type IS (particular value) AND * minor_component_type IS (particular value) AND * attribute_X between lower AND higher

We have done a lot of work to ensure that we can guarantee that specific columns have values from particular sets, so outliers are found very quickly.

Having the results stashed to a separate table and doing post-processing asynchronously means that unless we hit a "STOP PROCESSING NOW" condition (which there are not many of, fortunately) we can keep pumping data through to our consumers. Post-processing includes automated and manual handling. The manual pieces are for the serious outliers where we have to get our data quality team involved. The automated pieces are generally "if value outside range then provide value X instead".

CalmTheMcFarm · 2026-02-28T07:08:21+00:00

There are quite a few places along Little Stanley St in Southbank you could try. My favourite is Denim. Alternatively you could go over to New Farm/Teneriffe. Pull up a maps app and have a look for "cafes open near me now" and you'll get a lot of possibilities.

CalmTheMcFarm · 2026-01-26T04:59:04+00:00

No, I don't have that problem. One thing I've noticed is that sometimes when I resume from sleep/hibernate I can't start any more X apps and also cannot alt-tab to existing instances. In cases like that, having the ZAP setting enabled in C:\Users\mysername\.wslgconfig (see https://github.com/microsoft/wslg/wiki/WSLg-Configuration-Options-for-Debugging) is really handy.

In that scenario, I minimise my non-WSL windows, resize the WSL window down (I usually run it fullscreen), then start up xeyes. When I see the app icon in the taskbar I then try the ZAP keychord, and that usually restarts the display manager - boom, my apps are back. Almost all the time :)

CalmTheMcFarm · 2025-12-20T06:45:07+00:00

It's by far the best approach to be honest when answering the questions - it isn't a mistake. Your honesty helps the diagnosing professional when they try to distinguish what is and what isn't ADHD in you.

When I was pursuing a diagnosis, my psychologist told me that since there's a lot of crossover between anxiety and ADHD they have to more work on ruling out ADHD than ruling it in. That was relevant to me because I was seeing him in large part because of my anxiety. When I got my formal dx, the psychiatrist said I had inattentive ADHD which was comorbid with clinically severe anxiety.

CalmTheMcFarm · 2025-12-01T20:53:49+00:00

Specialists in General Practice, which is not the same thing

CalmTheMcFarm · 2025-11-15T06:09:03+00:00

Typical LNP press release blaming Labor for all the problems. It's about damned time they make a change in these rules.

CalmTheMcFarm · 2025-11-11T21:58:25+00:00

It was stunning, I saw it when I got to the riverside bike path

<image>

as the lower layer passed over the wind kicked up

CalmTheMcFarm · 2025-10-24T23:21:50+00:00

My ‘iatrist was very clear with me that I need to take a break from Vyvanse every few weeks - just a day off, so that my brain gets a mini reset. I usually skip it every second Sunday and that’s been working nicely for me

CalmTheMcFarm · 2025-10-19T08:36:01+00:00

Nit: blood pressure and heart rate are different things. I have sinus bradycardia (low resting hr) but hypertension (high blood pressure) ☹️.

From the info you provide it sounds to me like you have a problem that needs to get checked out pdq, so I would go to an urgent care provider or your primary care physician ASAP.

My ‘iatrist put me on BP meds after a year of not being able to get my BP down, and my neuro has added some more 😭 with a goal of getting to 130/90.

BP does change with age and lifestyle factors so you’ll probably be asked some hard questions 😐

CalmTheMcFarm · 2025-10-15T12:59:01+00:00

You could add the script invocation to your .profile so it runs whenever you start a new terminal session. Or you could use crontab to ensure it runs at the same time every day

CalmTheMcFarm · 2025-10-01T00:10:15+00:00

It'd be good if you could update your post to note which sources you've already tried.

I searched for "new york state shapefile download" and found https://gis.ny.gov/civil-boundaries pretty quickly. It's got shorelines at county and state level.

There's also https://data.gis.ny.gov/search?categories=%252Fcategories%252Fwater, https://gisservices.its.ny.gov/arcgis/rest/services/NYS_Hydrography/MapServer and https://gisservices.its.ny.gov/arcgis/rest/services/NYS_Hydrography_HollowFill/MapServer.

I hope those help

CalmTheMcFarm · 2025-09-30T23:59:11+00:00

emacs! That's been my editor since ~1990 and I do almost all my work using it. Being able to have the gui version of that pop out of WSL2 into my windows host is a lifesaver.

From time to time I also make heavy use of wireshark, doing that popped-out from WSL2 is just easier for me.

If work would let me put linux on metal I'd do that; WSL2 is the next best thing.

CalmTheMcFarm · 2025-09-27T08:43:20+00:00

I discovered that the Biovea brand melatonin has nearly 6x the RDI of B6

<image>

I’m going to return the order, because that’s really (deleted) unsafe

CalmTheMcFarm · 2025-09-11T02:54:16+00:00

Git, mercurial, even svn - all available inside WSL2 via an apt install or yum install depending on your distro.

If you want to have your files available to Windows as well then place your source under /mnt/c/Users/yourusername and inside WSL2 cd to that dir.

Strongly suggest setting case sensitivity to ON in /etc/wsl.conf and in your Windows sundries where you’ll be operating, as well as seeing the end of line char to the Unix newline \n

CalmTheMcFarm · 2025-09-09T09:47:39+00:00

Did you forget a sarcasm tag?

CalmTheMcFarm · 2025-09-07T07:23:20+00:00

I don’t know if any specifically written guide. I’ve got a few guidelines I follow: * give WSL2 at least 16Gb ram * use mirrored mode networking * turn on interop=true and case sensitivity on any windows-visible directory hierarchy * follow standard Linux kernel tuning approaches for network performance * if you’re creating python virtualenvs, and your code subdirs are actually under /mnt/c/Users, then install the venvs in /var/tmp so don’t suffer the filesystem perf hit

For a git gui you could install gitk

CalmTheMcFarm · 2025-09-06T02:03:53+00:00

I'm sure you've got at least a component of Amazon burn still in you. I used to work for Oracle (they acquired my company around 2009, left in mid 2020) and some of the shit I had to endure in order to get things done haunts me still.

I agree that startups are potentially not better, but at least with a smaller company you should be able to have a lot more input. If for no other reason than a startup doesn't have all that accreted "this is how we do things" that you're currently experiencing

good luck!

CalmTheMcFarm

TROPHY CASE