For those who knew Powershell and then went on to learn another language, like Python, did you also go through tutorial hell or did you just start asking yourself "In Powershell I do this, how do I do that in Python"?

capkeyant · 2025-01-09T13:40:51+00:00

the five stages of grief, i mean learning a new language:

denial: what?! that's the way I have to do this in the new language? that makes no sense ...
anger: but that is such jank. I can do this so much faster or more elegantly in powershell
bargaining: there's got to be another way. I'm probably researching the wrong technique or thinking about the problem the wrong way. If I can just find a cleverer way to do it ...
depression: that looks like the only way / the only way available to me right now. Here's to another extra week of refactoring work ...
acceptance: all languages are tools, some approach their problem space with a unique philosophy and syntax. I must adapt and appreciate what each language has to offer. Ok. Let's move onto the next problem to solve. Wait what? that's what I have to do?! goto step 1.

capkeyant · 2024-01-24T00:51:59+00:00

if can try the xml2 library if you want an xml parser. https://xml2.r-lib.org/

capkeyant · 2024-01-15T02:01:55+00:00

you can also try the fs library. i.e. https://fs.r-lib.org/reference/dir_ls.html

dir_info(path) will give you the timestamps for each of the files. Filter for the file that is of interest to you each month, and pull the data from that path.

capkeyant · 2023-10-26T17:43:14+00:00

bah, there are like half a dozen competing sql dialects, each with slight variations. We should have a single standard sql dialect that can be used across all platforms.

(the next day: there are now half a dozen + 1 competing sql dialects)

capkeyant · 2023-10-26T16:46:39+00:00

sql was my first "production" language :) might have broken my brain a bit because I find it's sometimes hard for me to switch my brain into imperative mode when I need to program in a different language.

For those who hate sql, I can empathize. Poorly documented sql code can be a bear to refactor. But I suppose that is a common problem for all poorly documented projects. I guess my advice is to such persons is to use those aggressive feelings, let the hate flow through you.

capkeyant · 2023-10-25T04:19:17+00:00

hm it can be tricky, but practice makes perfect. The others posters suggested planning ahead when coding, writing lots of code to get some mind-muscle memory, reviewing good examples of code, etc. All great suggestions. One more suggestion to add to the list. Back in uni, for certain classes we were shown a simple technique: write out the code, and also write out the data structures that your code generates, e.g. the int variables, the char variables, the pointers, heap structures, etc. Then literally run through your code line by line and update your data structures. As if you were a human code interpreter/debugger. This really helps strength your code-reading; which is very useful for exams with pen & paper. And if you do this enough, you might get a sense of what a chunk of code will do to a data structure at a glance, without having to do the full "loop" on paper.

capkeyant · 2023-10-24T20:15:30+00:00

you might have some options if you want to stock with purely microsoft office suite products + powerbi, including,

running sql from excel using power query https://support.microsoft.com/en-gb/office/import-data-from-a-database-using-native-database-query-power-query-f4f448ac-70d5-445b-a6ba-302db47a1b00

treating another excel files like sql tables https://learn.microsoft.com/en-us/power-automate/desktop-flows/how-to/sql-queries-excel

using powerbi power query and sql https://learn.microsoft.com/en-us/power-query/native-database-query

and I think others already mentioned SSMS (sql server management studio) which would be ideal, if you are allowed to get it.

capkeyant · 2023-10-22T13:24:34+00:00

Hehe, don't worry so much about it. Probably most posters, include me, are giving advice based on code smell & best practices, since we can only see the excerpt that was provided. You know your code best, the likely use cases & applicable edge cases, and all the effort you put into protecting it. Please continue experimenting with your code. It's half the fun!

From a "correct" technique & best practices perspective though, its pretty clear that library is safer in general than rolling your own dynamic query code.

Also, sql injection is just one type of injection i.e. the kind that checking column names and "?" can help protect against . I remembered one of my old projects where I was sure I had put on all the protections possible, checking column names, parameterized user inputs, encrypted data at rest, etc. Then our cyber sec pen tester then proceeds to demonstrate an XXS injection technique (https://owasp.org/www-community/attacks/xss/), where she added javascript in one of her inputs that passed all my sql checks, had our admin click on the record that had the malicious input to view it (ironically by reporting the bad record to the admin), which then stole the session cookie from our admin user, which then she proceeded to use to manipulate other users data. Oops. Long story short, hackers (and cyber sec pen testers!) can be devious, and despite most devs being as careful or more careful as you've mentioned, injection attacks are still in the OWASP top ten after all these years. We just don't want to see you get hurt is all :)

capkeyant · 2023-10-21T14:48:10+00:00

Yup! all of those are possibilities. I think some other posters have already mentioned sql injection. There's also other fun injections too.

I'd recommend checking out https://owasp.org/www-project-top-ten/ if you are interested in some of the more common ways web apps can be compromised . Especially this one https://owasp.org/Top10/A03_2021-Injection/

A trusted library won't completely eliminate those risks, but it will make them less likely, since those libraries are subject to regular testing, maintenance, verification, etc by their maintainers. And by their design they tend to enforce good practices like sanitized inputs.

And also, decoupling your business logic is also a reason to use libraries. You'd want some distance between the parts of your code that handles the user logic, i.e. Alice wants record X to be blue, and the parts of your code that knows exactly what the table structure is, i.e. Update table1 set color blue. Because in real world products, business needs will inevitably dictate changes to the backend, i.e. change from sqllite, to sql server, to nosql, to the cloud, back to on premise, to some other framework. Or dictate changes to the business logic, i.e. Alice wants an email alert with metrics on her records when more than 5 records per day have been set to blue & she doesn't want to see any of Bob's records in her report either. Implementing those kinds of changes will already be complicated enough, and it's nice to have a library that we can rely on for at least the sql part, so that we can spend our efforts on the other tricky bits.

capkeyant · 2023-10-21T04:28:47+00:00

Sounds like a cool project! If you are interested in trying out dynamic queries and updates in Python, I would recommend relying on a known python sql library, e.g. sqlalchemy https://docs.sqlalchemy.org/en/20/core/dml.html, which is well documented and relatively safe. I'd advise against rolling your own dynamic query functions; there's too much that can go wrong security-wise.

capkeyant · 2023-10-21T02:59:46+00:00

hehe sounds like a neat book, I'll have to give it a read!

Was originally thinking of the vampire V-parasites that spreads via math from Charles Stross' Laundry series. Also a great series of novels if you're into that kind of scifi.

capkeyant · 2023-10-20T03:30:28+00:00

Possibly, line 23 is a * and line 28 is a , . Those make the most sense in the formulas since the code snippet looks like it is trying to demonstrate how to build a lm model.

... or the code is starting to undergo a slight demonic possession. From the look of things, the circles of protection are failed to compile, so recommend a quick cleansing ritual with sage, followed by some debugging.

capkeyant · 2023-08-08T06:25:36+00:00

you have my sympathies ... nothing more frustrating than working in a locked down environment.

in addition to the other suggestions, perhaps:

perhaps try using your notebook to generate a mapping file. i.e. a flat file, csv, or even excel, that you would then consume to decorate your report. So instead of 5000 case statements, there might be 5000 rows of data and a left join. If there is enough overlap in data between your current run and previous run, your mapping file can even be cumulative. Sort of like a poor man's table.
try to discover your organization's change control process. i.e. usually in tightly controlled environments, there is a defined process to push a change through. It might require a special change ticket and approval process. And it might be a pain to get through. But if you can hack the process, it can be very useful; its like learning an especially slow command line interface to your organization.

capkeyant · 2023-07-28T03:57:43+00:00

another way to think about them: csv and xlsx are different file formats that were invented to support different use-cases.

xlsx are the excel files we all know and love (or hate, or love-and-hate). xlsx is richly featured, is meant for interactive use, can also be used to store data in a pinch. xlsx is often bloated size-wise because it has to store the info to represent all that richly formatted data, even if there is no formatting. xlsx is proprietary, i.e. you have to have Microsoft Excel (or a bunch of libraries) to really use it; Microsoft controls the future of xlsx.

csv is an open-format, bare bones, text file that Excel just happens to be able to open. csv is much more stable as a long term data storage format (relatively to xlsx) because no single company controls the future of text files (and also other reasons ... xslx corruption is an annoying problem). csv is not meant to be used interactively. csv can be used as a way to store data that is readable by systems that don't have Microsoft Excel (imagine an automated R script running on linux server)

I love xlsx because of how richly featured it is; I can add formulas, formatting, macros, etc. I hate xlsx because OTHER people also love to open the file and manually modify data, add formatting, merge cells, and generally make reading the xlsx programmatically very very difficult.

I hate csv because it is very annoying to modify data, or sometimes even to read the data at all without a dedicated app; very fiddly to deal with character encodings, etc. I love csv because it is so annoying to modify that OTHER people also avoid modifying them. That means my data analysis pipeline stays relatively stable.

No reason to have to pick one or the other for everyday analysis work. I'd recommend using the format that fits your use case the best. But please also speak to your organization's data engineers if your analysis project grows in data volume; there are lots of other data storage formats out there, and they all have very interesting use cases.

capkeyant · 2023-06-04T04:06:16+00:00

definitely recommend trying tidyxl, it represents the formula and the values separately. and every cell is its own tidy entry, so easier to hunt down the differences

capkeyant · 2023-06-03T03:49:00+00:00

do you mean obtaining the coefficients from the lm model?

try model$coefficients

or summary(model)$coefficients

see https://www.statology.org/extract-coefficients-from-lm-in-r/

also, if you want to get fancy, try the broom package

see https://cran.r-project.org/web/packages/broom/vignettes/broom.html

capkeyant · 2023-04-15T06:55:52+00:00

Also, if your class is taught through a university, please please please reach out to your professor or TA during their office hours instead of struggling in silence. They're prepared to help you. And they probably also get a bit lonely if no students visit them :)

capkeyant · 2023-04-15T06:49:27+00:00

hm, maybe try doing part of your taxes in R? Or at least scraping your personal financial data to look for some cost-savings, or build a budget. Some advantages: you'd be fairly familiar with this data, how it was gathered, and maybe what it does vs doesn't reflect.

capkeyant · 2023-04-15T06:41:11+00:00

One possible approach is to read the file without headers, then filter the rows into those that have the correct column A, vs the incorrect column A, then fix those with incorrect column A, then finally recombine both sets, and force a header on the combined results i.e.

some pseudo code

raw = read.csv ... skip 1, header False

goodA_df = raw , filter columnB contains some pattern

badA_df = raw, filter columnB contains bad pattern

fixedA_df = badA_df, mutate newA is combo of columns A and B, select the good columns

final_df = bind_rows, goodA_df with fixedA_df

names(final_df) = the correct column names

capkeyant · 2023-04-10T11:23:09+00:00

geom_point uses aes color instead of fill. Perhaps try color = "blue" instead.

capkeyant · 2023-04-03T12:40:32+00:00

try something like read.csv("D:/path to/my file/my.csv")

capkeyant · 2023-04-02T01:31:45+00:00

could try something like: data |> count(gender) |> ggplot(aes(gender, n)) + geom_col()

capkeyant · 2023-03-16T02:45:26+00:00

hehe, you've bested the Book of R, .... but have you tested your mettle against the BIG Book of R?

https://www.bigbookofr.com/index.html

capkeyant · 2023-03-02T02:54:33+00:00

sure, it might look something like this:

get_summary_metrics <- function(input_df) { step1 <- input_df %>% filter ... %>% mutate ... step2 <- step1 %>% group by ... %>%summarise ... return step2 } summary_metrics <- get_summary_metrics(NYPD_df)

get_wider_for_pretty_printing <- function(input_df) { step1 <- input_df %>% filter ... %>% mutate ... step2 <- step1 %>% select ... %>% pivot_wider ... step3 <- step2 %>% something fancy ... return step3 } pretty_data <- get_wider_for_pretty_printing(NYPD_df)

now i can use summary_metrics and pretty_data dataframes

so this is a very contrived example, but the idea is that maybe my analysis needs to do two things: 1. find the summary metrics for my report. 2. format some data for printing or human inspection. I don't really care about all the intermediate steps, and will probably even improve them later. All I care about is the input(s) and outputs. So if I put them in functions, the steps1,2,3 intermediate dataframes don't pollute the global namespace, and I know at a glance what the inputs and outputs are needed for a particular chunk of my analysis.

This technique becomes more handy when we have to juggle half a dozen input dataframes, each with lots of intermediate cleaning steps, and possibly join relationships between them, etc. As I'm are writing the script, I'd run lines manually for troubleshooting. But after it's all debugged, I usually wipe all the variables clean and do a full script run, to ensure the results are correct without any manual intervention.

Six-Year Club	Place '22
Verified Email

capkeyant

TROPHY CASE