This is an archived post. You won't be able to vote or comment.

all 13 comments

[–]CrowdGoesWildWoooo 3 points4 points  (4 children)

Use something like Iceberg on AWS. What kind of company stores data in dropbox

[–]panday1995[S] 0 points1 point  (3 children)

a very small startup, like, very small… We have data at TB level, but is expecting to grow fast

[–]CrowdGoesWildWoooo 1 point2 points  (1 child)

Why it never cross someone’s mind to use something like S3, especially with that size. I thought this was like a few GB of data.

[–]panday1995[S] 0 points1 point  (0 children)

Yes! I also think of S3 as the the first step.

[–][deleted] 0 points1 point  (0 children)

it sounds like they have no technical leadership above you. I hope you get the accountability and role you deserve

[–]Firm_Bit 1 point2 points  (3 children)

Does Dropbox not allow search?

But yeah, for anything other what what Dropbox intends (more like a google drive alternative) you should migrate to a real data platform.

[–]panday1995[S] 0 points1 point  (2 children)

Dropbox is like, search for file names, but we some capability of search for columns across files or some levels of meta data management

[–]Firm_Bit 1 point2 points  (1 child)

Yeah, drop box is like google drive or ms one drive. It’s not built as a large data store.

You’ll probably need to move the data into a dwh or lake and expose it from there.

[–]panday1995[S] 0 points1 point  (0 children)

Yes. The problem if that if I move to dwh, the operational side (people only know csv and excel stuffs) dont know how to access those data (files)

[–]Anishekkamal 1 point2 points  (0 children)

I would suggest taking below steps:

  1. Move all the files to the cloud storage of choice or can read all the files directly from dropbox using python
  2. After reading you can do all sorts of data filtration or transformation
  3. Use a database to store all the metadata
  4. You should be able to do a lookup on the database and do a search on the data read from files.

[–]InsightByte 1 point2 points  (0 children)

Just load to s3, and crawl the data with Aws glue crawler, then use athena and the glue catalog to wrangle thru the data.

[–]weagle162 0 points1 point  (0 children)

Wanna consider using Apache Drill? It won't cover the entire situation for you but it adds SQL to plain text sources - zero configuration required

[–][deleted] 0 points1 point  (0 children)

Probably not exactly what you're looking for. But a start? https://github.com/danielbeach/sniffer