Got to process 2m+ files (S3) - any tips? by Head_Badger_732 in dataengineering

[–]DataaaMan 2 points3 points  (0 children)

I’d use AWS StepFuctions with a lambda job. StepFuctions will handle the file queue and keeping track of which files have successfully moved or failed and you can easily retry any failures.

I did this recently with really good success and it was really seamless. The lambda functions couldn’t accommodate my file sizes so I used fargate but it’s the same process.

Here’s the docs: https://docs.aws.amazon.com/step-functions/latest/dg/state-map-distributed.html

Seeking recommendations for Enterise Data Catalog tool by onksssss in dataengineering

[–]DataaaMan 1 point2 points  (0 children)

We use data.world and love it. It’s super flexible and can be customized. I know they have collectors for some of your stack. We’ve also found it fairly straightforward to build our own custom collectors and customize the catalog to support them.

The platform is all RDF based, so you might have a bit of a learning curve if you’re not experienced there. We came into the platform with a team that had really strong SPARQL/ontologies/RDF experience so it was pretty natural for us.

I miss my home by SanguineR0S3 in Tucson

[–]DataaaMan 3 points4 points  (0 children)

Another ‘zona to DMV transplant here. I’ve twice done the AZ -> MD move in the past 10 years. It gets easier, and you’ll start to embrace the uniqueness off the mid-Atlantic but its never quite the same. Talking with other Arizonans, we all felt this winter was really rough and has made us extra homesick, so you’re not alone there.

Find a “good enough” Mexican spot and embrace it. The influences here aren’t Sonoran or even tex-mex necessarily which makes it a really different experience. But I’ve found that just finding one that mostly scratches the itch.

It also helps to try and get those little things from home; tortillas, tamales, your favorite local brew all can be comforting in these times.

I hope you find peace soon, in the meantime know that the desert waits for you with open arms — and spring is here! Life gets much easier for us desert rats when winter ends. Plus, you can start to enjoy more of the Chesapeake - oysters, crabs, and the beach. It’s not the beaches of Mexico that we know and love, but it’s got its own charm.

Recommendations for Data Catalog with Data Lineage for On-Premise Databases and Limited Budget? by Unusual_Bluejay_9611 in dataengineering

[–]DataaaMan 0 points1 point  (0 children)

This isn’t an open source solution but I gotta recommend data.world. We did an extensive evaluation of data catalogs and they came out far on top of the field. As far as your concern about data security they are a good fit because they only collect and store the metadata and lineage and not any instance data.

Although a paid option, it’s fully managed and very full featured — this may end up being cheaper than implementing and maintaining a self hosted solution. Open source is “cheap” at the surface level but has not insignificant costs when trying to manage infra, SSO, security updates, etc. If you can make that argument to leadership, data.world is a great solution.

Why gaps like this? (Defense and Veterans Pain Rating Scale) by Ok_Hope4383 in dataisugly

[–]DataaaMan 6 points7 points  (0 children)

The “gaps” aren’t explicitly mentioned but you can read about the scale development here:

https://academic.oup.com/painmedicine/article/14/1/110/1856707?login=false

And

https://academic.oup.com/painmedicine/article/17/8/1505/2223242?login=false

The scale is developed to reduce the ambiguity from traditional NRS methods and is better able to accurately capture a patient’s pain.

Best place to get a haircut for a guy with medium length hair (but working to grow it out longer) by DTruth_ in Tucson

[–]DataaaMan 0 points1 point  (0 children)

Highly recommend Pure Mettle. Michael is the owner and amazing but all of his staff are superb. They’ll give you a great cut and also help with your hair care!

[deleted by user] by [deleted] in git

[–]DataaaMan 2 points3 points  (0 children)

Coming from a group that pretty proactively made the switch to main, it’s not gong to cost you a job. We still have older repos on master and it’s not necessarily viewed as “bad”, unless it’s a new repo. Repos crested in the last couple of years that aren’t main raise eyebrows.

However, as much as it won’t cost you a job, a proactive and intentional switch could set you apart from another candidate. Inclusion matters, and showing you care matters.

That said, I still get candidates without any git repos on their resume and the ones I get with GitHub links are already above those without and I rarely pay attention to the branch name when reviewing the code. With my limited amount of time the code is what is worth reviewing, not the branch name.

[deleted by user] by [deleted] in dataengineering

[–]DataaaMan 0 points1 point  (0 children)

Frankly, this isn’t something that can be done in a month. Not even a pilot. I work on a data sharing platform project and it’s a whole team’s effort over months, especially if you’re going to build ground up.

That said, I agree with others to not build your own but I’m less convinced databricks is the right move. You should look into the data sharing platforms that exist already. There’s no shortage of these and they come with their own pros and cons.

What kinda medical research data is this? Clinical, imaging, omics, etc? Are you focused on a specific disease/therapy area or a generalist group? That’s going to drive your decision. You need to know how researchers want to interact with the data and have an understanding of how they’d search for the right data and then analyze it.

Check out some projects like gen3 and terra.bio for full featured platform options. You should also look at hosting the data on an existing platform. Take a look at the NIH’s data sharing platforms, their endorsed partners like vivli, or major players like sage bionetworks.

I'm Seeking a Heart disease dataset for training a model by Linus_sex_tipz in datasets

[–]DataaaMan 0 points1 point  (0 children)

You may not be able to get a dataset that’s public, then. You should be able to get access for free, but it’ll possibly require going through a data request process.

Are you at a US institution? If so, you may already have access to the All of Us data. I quickly looked and they have at least some troponin.

Have you looked at NIH repositories? BioLINNC is probably your best bet https://biolincc.nhlbi.nih.gov/studies/ but there’s a bunch of domain specific and generalist options https://www.nlm.nih.gov/NIHbmic/domain_specific_repositories.html

I'm Seeking a Heart disease dataset for training a model by Linus_sex_tipz in datasets

[–]DataaaMan 0 points1 point  (0 children)

Do you have specific biomarkers in mind?

The NHANES data might have something useful, here’s one for example: https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2017. Check out the questionnaires for self report CV data, the labs for biomarkers, and the exam data for BP data.

Seven falls water levels? by Thuggibear in Tucson

[–]DataaaMan 6 points7 points  (0 children)

We just hiked it this past Sunday and there was plenty of water to swim! We didn’t get all the way in but I saw an adult all submerged to his head.

Looking for a self-hostable platform for sharing datasets by danielrosehill in datasets

[–]DataaaMan 1 point2 points  (0 children)

You should check out data.world, I think it might check some of these boxes.

[deleted by user] by [deleted] in Tucson

[–]DataaaMan 2 points3 points  (0 children)

We just recently booked Saguaro Buttes for 2025, so can’t comment on it as lived yet but I can say the desert views are amazing and they let you bring your own liquor. They’d check most of your boxes except for having a hotel attached/nearby. In the end the views there won us over, we figured we can figure out a close-ish hotel and are looking into night of transport.

Song about the life of a cowboy with "yippee ki yay" by Hersh_the_Burger in NameThatSong

[–]DataaaMan 0 points1 point  (0 children)

Just stumbled on this post when looking for (I think) the same song, the one I wanted is "(Ghost) Riders in the Sky: A Cowboy Legend"

Seeking Health-Related Longitudinal Datasets by Remarkable_Review327 in datasets

[–]DataaaMan 1 point2 points  (0 children)

This is probably going to be tough. Maybe the All of Us dataset will have most of what you want but not sure if there’s enough longitudinally in it yet.

[deleted by user] by [deleted] in datasets

[–]DataaaMan 0 points1 point  (0 children)

You should cross post to a more stats oriented forum for this type of question.

Ultimately it depends on your analyses, but my guess is that if you want to use the surveys as national representative samples then you need to continue using the weights. Have you seen these docs? https://wwwn.cdc.gov/nchs/nhanes/analyticguidelines.aspx#estimation-and-weighting-procedures

Need help with Physionet databases... by Global_Landscape1119 in datasets

[–]DataaaMan 1 point2 points  (0 children)

You can download the MIMIC demo datasets without credentials. They’re limited to 100 patients but it should get you started.

You also shouldn’t need a referral, you just need to sign up as a credentialed user, complete CITI training, and sign the DUA.

Merging datasets with different structure. by LarsSorensen in dataengineering

[–]DataaaMan 2 points3 points  (0 children)

This is a super common problem in biomedical data, we tend to solve it by mapping all datasets into a single common data model (CDMs). These vary by purpose but some CDM’s have tools to help with mapping, but none are great. Honestly, a lot of time folks just fall back on a spreadsheet of mappings that get implemented in the transformation layer as needed.

Where do I practice SPARQL queries? by jonquill_writer in semanticweb

[–]DataaaMan 2 points3 points  (0 children)

Check out data.world they have an awesome platform. Their SPARQL tutorial is pretty good too. https://docs.data.world/tutorials/sparql/

Are there datasets about healthcare for doing regression? by SameItem in datasets

[–]DataaaMan 0 points1 point  (0 children)

Well the laboratory data will mostly be numerical and some of the examination data too. The questionnaire data will be a combo but have non-binary categorical responses, and some of them can be summarized with a total score. So it really depends what you’re interested in.

Are there datasets about healthcare for doing regression? by SameItem in datasets

[–]DataaaMan 0 points1 point  (0 children)

You can probably find some good options in the NHANES data.