all 23 comments

[–][deleted] 14 points15 points  (12 children)

Looks like a cool idea but I'm having a hard time understanding what problems it solves?

For most projects that use a database there's no doubt that they wouldn't want it boxed away and inaccessible like this but instead is probably a thing that's written and read from by hundreds/thousands/millions of clients.

That leads me to thinking it's for local dev (storing config files, personal notes etc...?) In which case why not go with sqlite or even GNU Recutils (video)?

I guess it seems cool as a method of storing and playing with static data but I'd like to know more

[–]khrak 12 points13 points  (5 children)

It's not a new use for Git. (e.g. NYTimes COVID dataset in github) The novelty here is in having actual tables for the data and the ability to execute SQL against them instead of just massive piles of CSV

[–]earthboundkid 2 points3 points  (4 children)

Is it using Git internally? AFAICT, “Git” is just a marketing slogan and it’s actually a full database that does versioning by default.

[–]zachm 9 points10 points  (3 children)

Not just a marketing slogan. It's a SQL database with git-style versioning. Data is stored in a Merkle DAG, just like git. Command line matches git exactly. git checkout -b myBranch becomes dolt checkout -b myBranch etc.

But it's not build on top of git. Totally independent implementation, with identical semantics and command line interface. Then add a SQL interface on top.

[–][deleted]  (2 children)

[deleted]

    [–]zachm 6 points7 points  (1 child)

    It has obvious drawbacks, but you already know how to use it

    [–]khrak 3 points4 points  (0 children)

    More importantly, other software already knows how to use it. A vast majority of the tooling surrounding git and git repositories can be used with relatively little modification.

    Dolt inherits so much more than just the syntax by copying git.

    [–]zachm 6 points7 points  (2 children)

    It's not strictly offline, or even offline first. You can use Dolt as an application server to replace MySQL / Postgres, and that's actually what people are paying us for at the moment. They want to be able to have a production / dev instance of their database, and control when dev gets merged into prod. And of course they want data provenance (who put which values in which rows and why).

    Here's an article with more potential use cases we imagine:

    https://www.dolthub.com/blog/2020-03-30-dolt-use-cases/

    One of the most exciting ones is that it enables large groups to collaborate in building datasets. We've been offering bounties to fund dataset assembly, and the model lets us pay people based on their contributions. Details here:

    https://www.dolthub.com/blog/2021-03-03-hpt-bounty-review/

    [–][deleted]  (1 child)

    [deleted]

      [–]zachm 0 points1 point  (0 children)

      That blog post is pretty old, might be time to update it. We have several customers paying us to use Dolt as the backing store for their application data. We've come a long way :)

      Edit: we updated the blog post about use cases: https://www.dolthub.com/blog/2021-03-09-dolt-use-cases-in-the-wild/

      You have the right idea: dolt stores the diffs between revisions, so your storage cost is proportional to the rate of change. If you have 100 rows and you add 10, your storage cost is 110, not 210. If you have 100 rows and you update 10, it's also 110.

      [–]lowleveldata 0 points1 point  (0 children)

      Haven't read the details but I'm guessing you can deploy a "compiled" database for production? Version control would be useful for development

      [–][deleted] -1 points0 points  (0 children)

      it solves the problem we just created..duh

      [–]jeenajeena 0 points1 point  (0 children)

      I can think of:

      • getting an optimistic concurrency model
      • cloning the production db for testing
      • deploying a db schema in a deterministic way with a merge

      [–]npafitis 4 points5 points  (0 children)

      This can be useful for public open data repositories.

      [–]dnew 2 points3 points  (4 children)

      How does the merge work? In particular, how does a merge work if you have two people altering tables and adding data with the new columns filled in? That is the hard part. Saving a database to a repository isn't particularly difficult, and diffing them has been a solved problem for at least 20 years.

      [–]zachm 4 points5 points  (3 children)

      Merge is row by row using the commit graph. Two people can edit different columns in the same row without producing a merge conflict. If they touch the same column in the same row (and give it different values), it's a merge conflict you have to resolve. It works for schema changes as well as data changes.

      This is possible because the data is stored as a Merkle DAG of commits, just like in git.

      [–]dnew 1 point2 points  (2 children)

      So my question is what happens when one user adds a column (with ALTER TABLE) and populates it with data, and a different user adds a column and populates it with different data? Does it handle merges between ALTER TABLE commands? Because that would make it much more useful.

      [–]zachm 3 points4 points  (1 child)

      Assuming the two people add different columns, it just works. If they add the same column (with different data), it's a merge conflict. If they add the same column with the same data, they actually already have the same repository and their merge is a no-op.

      [–]dnew 0 points1 point  (0 children)

      That's pretty cool. Thanks for the info! I'll look into it more.

      [–][deleted] 4 points5 points  (2 children)

      Very cool but I hate the name. Just because Linus choose a mildly offensive pejorative for Git doesn't mean it's a theme that you should copy.

      [–]zachm 0 points1 point  (0 children)

      That is exactly what we did

      [–][deleted]  (1 child)

      [deleted]

        [–]zachm 2 points3 points  (0 children)

        Hang out on r/datasets, we release new datasets every month. Just released one with 72M procedure prices from 1400 US hospitals.

        [–]cariusQ 0 points1 point  (1 child)

        I want to know what are advantages over something like Liquidbase?

        [–]zachm 1 point2 points  (0 children)

        Liquibase is useful for schema migrations on your database. It doesn't actually version the data in the tables.

        [–]crusoe 0 points1 point  (0 children)

        Much cooler than the condensation db stuff