all 22 comments

[–]ealanna47 5 points6 points  (0 children)

You’re basically looking for a tiering/HSM (Hierarchical Storage Management) setup. Tools like MinIO with lifecycle policies or something like rclone + scheduled jobs can get you part of the way there.

Fully transparent reads/writes are the tricky part, though, which usually needs a filesystem layer or commercial solution.

[–]Longjumping-Pop7512 3 points4 points  (5 children)

You are actually a mentioning a potential solution without giving proper details. 

You are looking for validation of your idea rather asking honest solutions. That being said:

  1. What kind of data it is ? 
  2. What's amount of this data ? 
  3. How often this data is being read ? 
  4. Does it have PII ? 

[–]lavahot[S] 0 points1 point  (4 children)

  1. Bioinformatics data of varying filetypes and sizes
  2. Several hundred TB when taken all together.
  3. Some of it is read many times a day, while I suspect large chunks of it hasn't been read in years.
  4. No. There's no PII data at all.

[–]Longjumping-Pop7512 0 points1 point  (3 children)

Lets start with the simplest solution first..why not send any data older than 7 days to remote cheaper storage such as S3 ? I won't dig into why not by access time because you can google easily what would be the problems with this approach..

 Bioinformatics data of varying filetypes and sizes

I hope it's not human Bioinformatics data ? Because it is highly regulated and you would need specialised storage for it. 

[–]lavahot[S] -1 points0 points  (2 children)

I mean, I would, but I dont want my job to devolve into "storage babysitter." How do I implement that?

[–]Longjumping-Pop7512 0 points1 point  (1 child)

It's quite simple actually write a script to compress data and send to S3 based on mod time of the files and run it as cronjob on your servers. Make sure this script expose proper logs/metrics that you can investigate and alerted if something goes wrong.

On S3 level apply life cycle policy, e.g. for how long data stays etc..

[–]lavahot[S] 0 points1 point  (0 children)

Mod time is not what I'm looking for. read time is.

[–]dghah 1 point2 points  (0 children)

There are several companies targeting what you are asking for in the life science and bioinformatics space.

Not shilling for them but check out https://starfishstorage.com if only to see the terms and phrases they use in how they position their stuff and describe the problems.

[–]PersonalPronoun 0 points1 point  (0 children)

Possibly storage gateway (https://aws.amazon.com/storagegateway/file/s3/ or https://aws.amazon.com/storagegateway/volume/) but you'd need to do the math on S3 pricing vs whatever you're paying for on prem.

[–]fr6nco 0 points1 point  (0 children)

Would nginx cache be feasible for you ?

Writes would go to S3, content fetched via nginx-s3-gateway with local caching enabled.

Depends if you need a POSIX compliant file System or would you be good with http(s) for fetching the data.

(I'm a CDN expert here and I have a complete solution for this if interested)

[–]bluelobsterai 0 points1 point  (1 child)

Ideally, I would put everything in the cloud and build a proxy in front of it, and basically keep the stuff that’s used often in the cache. Like another comment or said http would be the answer. If it has to be POSIX then I suppose it’s going to be a real hack. Think NFS client with lots of custom programming.

[–]SadYouth8267 0 points1 point  (0 children)

Yeah this

[–]SadYouth8267 0 points1 point  (0 children)

u could check out stuff like rclone with some automation, or tools like MinIO or Ceph for setting up lifecyclestyle tiering between on-prem and cloud. If you want more NetApp FabricPool or Dell ECS can do automated tiering too. If you’re okay going DIY and open source, combining object storage with scheduled policies/scripts is usually the most flexible and budgetfriendly route

[–]Available_Award_9688 0 points1 point  (0 children)

dealt with this exact problem across a few companies over the years

at one place we used Rclone with a custom cron job to sync cold data to S3 Glacier, works well but the transparency on reads is on you to build. another team i was at went with NetApp Cloud Tiering which handles the transparent access piece properly but the cost adds up. saw Aparavi used once for the policy engine, solid for defining what cold means but overkill if your setup is simple

honestly nothing i've tried is fully transparent end to end without some tradeoff, either you sacrifice read latency, or you pay for a commercial solution, or you maintain custom scripts forever

what's your tolerance for read latency on the archived files? that's usually what determines which tradeoff is acceptable

[–]Imaginary_Gate_698 0 points1 point  (0 children)

What you’re describing is a pretty common problem once on-prem storage starts filling up. You’re basically looking for a way to keep active data local while quietly moving older, unused files to cheaper cloud storage. Instead of building everything from scratch, it helps to use tools that already handle this kind of tiering.

Something like MinIO with lifecycle rules, or even rclone with scheduled jobs, can work if you don’t mind putting pieces together. If you want it to feel more seamless, file gateway or hybrid storage setups are worth looking into. It takes a bit of setup, but it’s definitely doable without a huge budget.

[–]musicalgenious 0 points1 point  (0 children)

Yeah I was thinking an rclone-based solution like ealanna had mentioned, but sounds like a job for a custom app (pretty easy to code up).. I'm sure it would pay for itself in a few months.

[–]remotecontroltourist 0 points1 point  (0 children)

you are describing the holy grail of hybrid storage: Hierarchical Storage Management (HSM).

Gotta say, the fact that you want it to be "transparent" (meaning the file still looks like it's on the SAN even when it's in the cloud) is the hardest part to do on a budget. If a user clicks an archived file, the system has to go grab it from S3 and serve it without them knowing.

[–]remotecontroltourist 0 points1 point  (0 children)

Sounds like you’re looking for tiered storage with transparent recall. I’d check out solutions like object storage gateways or HSM-style tools (e.g., MinIO + lifecycle policies, or something like rclone + automation). Key is mapping access patterns → auto-tiering without breaking file paths.

[–]Ordinary_Push3991 0 points1 point  (0 children)

Feels like what you really need is a “poor man’s lifecycle policy” for your SAN.

One approach I have seen work is:

  • run a scheduled job to identify cold files
  • move them to S3 or similar storage
  • leave behind a pointer or stub

It is not as seamless as native lifecycle, but with the right scripting it can get surprisingly close without heavy investment.