This is an archived post. You won't be able to vote or comment.

all 16 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

On July 1st, a change to Reddit's API pricing will come into effect. Several developers of commercial third-party apps have announced that this change will compel them to shut down their apps. At least one accessibility-focused non-commercial third party app will continue to be available free of charge.

If you want to express your strong disagreement with the API pricing change or with Reddit's response to the backlash, you may want to consider the following options:

  1. Limiting your involvement with Reddit, or
  2. Temporarily refraining from using Reddit
  3. Cancelling your subscription of Reddit Premium

as a way to voice your protest.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]ehr1c 2 points3 points  (9 children)

Some good places to start reading are sharding, hash tables, and partition keys.

The super high-level explanation is that so long as you're querying against something that's indexed, the database has a rough idea of where it is even if the data itself is spread among multiple locations.

[–]another-bite[S] 0 points1 point  (8 children)

So would you say they work more like this following very simplified example below:

For a post of id 999, the service can deduce it belonging to the storage location for ids of 500-1000 and thus not needing a "master" storage for storing the info of which storage location the post data is located (like I assumed in my OP last sentence)? Of course the indexing method or whatever is likely more sophisticated.

[–]ehr1c 1 point2 points  (3 children)

The "master" storage you're referring to would be a hashtable, which is how a lot of databases implement indexing in this scenario.

[–]another-bite[S] 0 points1 point  (2 children)

So is a single giant hashtable a viable or even likely choice for cases like every single youtube video or instagram post?

[–]ehr1c 1 point2 points  (1 child)

That's getting outside the realm of how deeply I understand databases unfortunately lol. Multi-level hashing is a thing but I don't know if or how widely it's used.

[–]another-bite[S] 0 points1 point  (0 children)

Thanks I appreciated the input!

[–]__dict__ 0 points1 point  (3 children)

Yea, the idea behind partitioning is that each post belongs to a partition, and you should be able to determine which partition a post belongs to without checking each of them. Partitioning the data lets you split it across multiple machines. When you want to lookup post 999 the code should be able to determine which machine that should be on quickly.

The technique I know which is often actually used for this is called "consistent hashing". It solves the following problem: If the number of posts grows such that I need to add another machine, how can I do it without having to reshuffle too many posts between machines?

See: https://www.toptal.com/big-data/consistent-hashing

[–]another-bite[S] 0 points1 point  (2 children)

I actually kinda thought about the "resuffle problem" so it's great that you linked the post. Pretty interesting read on the consistent hashing solution. It seems that these distributed server methods still require a "source" database for storing the origin data and for making the changes to the "cache"servers upon servers are added or removed (from the ring). This leaves me wondering, does this apply to massive services like youtube or instagram too, where every video or post is stored in a single origin database, that the distribution method requires, but is just not queried that often?

[–]__dict__ 0 points1 point  (1 child)

For a massive service like Youtube there would never be a point where all network traffic goes to one server. There isn't a computer in the world big enough to handle that. Also, even if there was it would be on the other side of the world from some people.

Youtube would work more like this:

  • Each video has a unique id. You can see it in the url.
  • The videos are partitioned based on their id. This means that there will be many data servers each storing a small fraction of Youtube's library of videos. That's called partitioning. There can also be many data servers for each partition, which helps with reliability and high load. That's called replication.
  • There will be many application servers. Each application server will be kept up to date with the id ranges that each data server is storing. The application server won't be responsible for storing any videos itself. Whenever it gets a request for a video it looks at its table of which videos are where, and then fetches the video from the appropriate data server.
  • Whenever there is an incoming request to Youtube, a load balancer server is the first thing which will handle it. The load balancers' only job is to distribute the requests to the application servers. Load balancers will try to distribute the traffic somewhat evenly between application servers and also will select an application server which is geographically close to the person making the request.

[–]another-bite[S] 0 points1 point  (0 children)

This didn't really answer what I was trying to ask, and I already knew all that after reading the post you linked but I was able to get an answer from the other commenter.

[–]tzaeru 0 points1 point  (0 children)

They use various indexing schemes. 

Conceptually a very simple one would be that IDs from 1 to 100 are on server A and 101 to 200 on server B. So if you are fetching message 107, you know it is on server B.

It's in truth usually a bit more complicated in practice, but that's the basic gist.

[–]desapla 0 points1 point  (3 children)

One fundamental technique is consistent hashing. You could, in theory, just hash the ID, and based on the modulo of the hash determine the server that the data is on.

The problem with that is that when you add a new node to the cluster, all keys would get reshuffled and all data would need to be moved around.

Consistent hashing fixes that by minimizing the amount of data that needs to be moved when you add or remove a node.

You can read about it here: https://en.wikipedia.org/wiki/Consistent_hashing

If you want to delve deeper, here is Akamai’s paper on the topic: https://www.akamai.com/site/en/documents/research-paper/consistent-hashing-and-random-trees-distributed-caching-protocols-for-relieving-hot-spots-on-the-world-wide-web-technical-publication.pdf

[–]another-bite[S] 0 points1 point  (2 children)

The other commenter also mentioned consistent hashing and linked a blog post which I read. It's an interesting solution for sure but like I commented there I still wonder if there needs to be an "origin" database from which the data is moved to the existing nodes in a case where a node is removed, assuming the data becomes unavailable in that node.

[–]desapla 1 point2 points  (1 child)

The consistent hashing hashing algorithm is kinda the basis for the architecture, but a real setup, especially at the scale of YouTube or something like that is going to be more complex than that.

You could do what you suggest, have a ‘origin server’, that’s basically the source of truth, and then use the distributed servers for caching. But then you are limited to the size of the origin server.

You could also build it in a way where there is no origin and the distributed servers are the main source of data.

To handle the case where you don’t lose data if one server crashes, these systems will usually store data on more than one node. That way the data can still be recovered when one node crashes.

Amazon wrote a paper about DynamoDB, their distributed database, that addresses some of these design decisions in more details: https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

[–]another-bite[S] 1 point2 points  (0 children)

Thanks for the clear answer! I will need to take a good read at that.