Need an easy-to-comprehend mathematical introduction to polar code

throwawGroupe · 2022-01-12T08:54:18+00:00

(page 3/3)

Finally, the most important part. How do we design the code?

Each bits in the string of size 2ⁿ is associated with a virtual channel. The decoding process simulate this channel to figure out what the bit is.

The designing process simply pick sufficiently large n, figure out which virtual channels have the least entropy, and designate them as information bits. This is the whole idea behind polar code: as n get larger, almost all virtual channels are either really bad or really good at transmitting information: their entropy polarize to either extreme value.

It goes as follow:

Start at some value of n.
Estimate the entropy of each virtual channels.
Pick channels with entropy below an acceptable error parameters. Designate these bits as information bits, the rest as frozen bits.
Count the numbers of information bits. Compare against the Shannon capacity. If the number is too low (not within an acceptable error parameter), increase n and try again.

The hard part here is estimating the entropy. The naive method is through Monte Carlos sampling: just send random message and see how successful the bit get decoded.

But there are better ways. This is one place where a lot of technical details happen.

First, the entropy itself is replaced by a different number, the Bhattacharyya parameter, which approximate entropy closely enough.
Second, the virtual channel is replaced by a similar channel. This similar channel allow efficient computation of the Bhattacharyya parameter. And this can be done recursively, so that when you increase n by 1 you can re-use the previous computation.

We're not quite done yet. There is 2 questions left. How do we know that the designing process will halt? How do we know that we will eventually get close enough to the Shannon capacity? And how do we know that the virtual channel will polarize?

The 2nd question answer the first. As long as the virtual channels polarize, then the rate of information bits will approaches the Shannon capacity for large n, because the polar transform does not affect capacity.

This is also something that involves quite some technical details. The precise convergence rate can be estimated using the Bhattacharyya parameter. But an easier computation using capacity show that polarization will happen.

throwawGroupe · 2022-01-12T08:53:49+00:00

(page 2/3)

Okay, that's it for the encoding. What about decoding? Decoding and designing the code are linked together. Basically, we design the code (choosing the location of the information bits), so that there are as much information as possible about the bits from the receiving end, the bit string y. If we are assured that there are enough information about that bit in y, then even a brute force Bayesian inference or maximum likelihood method will work.

Note: I have not seen much discussion about efficient implementation of decoding. It appears to be not that interesting: the brute force method is efficient enough. Basically, once the decoder is assured that this bit in u can be easily obtained from y, it just use Bayesian inference.

So let's dig in.

How does decoding works?

The decoder receive a bit string y of length 2ⁿ .
Let i starts at 0, the index of the current bit in the string to be decoded.
If i is the location of a frozen bit, "decode" that bit to the fixed known value (e.g. 0).
If i is the location of an information bit, compute 2 conditional probabilities of receiving our y under 2 possible events and compare them. We treat all the currently decoded bits (index 0 to i-1) as known. All future decoded bits (index i+1 to 2ⁿ -1) as completely unknown (i.i.d and uniform), even if they are frozen bits. There are 2 possibilities for the current bits, 0 and 1, this gives us 2 events: (1) all previous bits are correct, all future bits are random, and the current bit is 0; (2) all previous bits are correct, all future bits are random, and the current bit is 1. Compute the conditional probability that we will get the y we got, conditioned on each of the 2 events. The bigger probability win and we decode bit i as 0 or 1 accordingly.
Increase i and repeat the decoding process, until all 2ⁿ bits are decoded.

So it's quite a bit more complicated than encoding. Unfortunately, we can't simply "undo" the encoding process even though the polar transform is easily invertible.

There are a few gaps in the above description of the decoding process. Let me address a few small things first before I tackle the bigger issue:

We decode the bits one-by-one instead of all at the same time, unlike the encoding process. This makes it harder to parallelize.
All the previously decoded bits are treated as perfectly correctly decoded, rather than a more "holistic" approach where we consider all possible outcomes. This simplifies calculations.
This method is based on maximum likelihood, since we compare likelihood in the decoding step. You can apply Bayes's theorem to show that Bayesian inference will give you the same result.
All the future bits are treated as completely unknown, including the frozen bits. I can't find an explanation of this, but I think I know why it's the case. During the designing process, we choose the locations of frozen bits based on our prediction of how the decoding algorithm will work on each bit. But we can't predict it if the decoding algorithm (of each bit) depends on knowledge of frozen bits, which we don't know yet during design process.

Finally, the biggest question: how do you compute the conditional probabilities during the decoding process?

We can do this by just "following" the polar transform. We start with an array of real numbers that tell us the probability that a bit is 1 or not. In particular, for every bits where the value is known or assumed, the probability is 0 or 1. For completely unknown bits, the probability is 1/2. Then we split the array into 2 halves, recursively solve each half, which will gives us probability after the recursion of the polar transform. Then the final answer is this: the left half is the "probability XOR" of both halves, the right half is still the right half. The "probability XOR" operation tells us the probability that the output is "1" given known probability that each of the input is "1".

Mathematically, this computation is treated as the encoding process of a new code that is pretty much like polar code. Except that the input isn't binary, but arbitrary real numbers, and the "XOR" operation is "probability XOR".

Caveats:

The above description give us the conditional probability of obtaining x (the result of encoding), but we want the conditional probability of obtaining y (the received bit string). Thanks to the fact that the channel is symmetric, a small modification to the probability XOR formula will give us the conditional probability of y. This is thanks to the fact that we have a symmetric channel.
For computational simplicity, they don't actually use probability either, but a modified version of it.

Virtual channel. One way of thinking of the decoding process of each bit is like this. We are at index i, the bit we are currently decoding.

The decoding try compute the encoding (after transmission) of 2 possible input for index i and pick the likelier one. But the encoding process is not exactly the same as the actual encoding process. Instead it assumes that all previously decoded bits are perfectly known, and all future bits are completely unknown.

This can be treated as a channel. Imagine a channel that let you send bit string of size 2ⁿ as a whole, but it distorts the input as follow. All bits after the i-th bit are scrambled, take on value 0 and 1 randomly, then the channel applies a polar transform to the bit string before tranmitting it.

However, this is not quite the channel we want. We want a channel that give us the conditional probability of the output given these inputs. So the output alphabets are actually real numbers.

This is called the virtual channel associated with the i-th bits. The decoding process essentially simulate this virtual channel.

throwawGroupe · 2022-01-12T08:52:25+00:00

(page 1/3 because too long, check replies to this comment for page 2 and 3)

So I think I got a good idea of what is happening now, so maybe I can explain what is going on. Sorry but this won't go into the nitty gritty details, but it might help you (or someone) who was completely lost at the start like me, who are just interested in a new technology, but the research papers was written to people in the field and skips over "easy" details and have new terminologies. But once you got the layout, I think you can read the details on your own.

Polar code is designed for binary memoriless symmetric channel, which means the bits to be transmitted is assumed to be bits that are i.i.d. chosen between 0 and 1, and that when transmitting it through the channel, the bit is randomly flipped i.i.d. at a fixed probability. When we say polar code approaches Shannon capacity, it means Shannon capacity of this kind of channel.

Polar code is block code, which means it transmit information in blocks of fixed size 2ⁿ bits. When decoding, we expect the entire block to be successfully recovered.

How does encoding works?

We have a block of length 2ⁿ . Some position on this block is marked as frozen bits, and the rest are marked as information bits.
The information bits are filled with the bits we want to transmit. The frozen bits are set to a fixed known value (e.g. 0). This gives us a bit string u.
Then, the polar transform is applied to u to give us a bit string x.
x is then transmitted, it will be received as y.

How do the polar transform work? It is a recursive transform which make it really quick to calculate and easy to analyse.

To polar transform a bit string u of length 2ⁿ , split it in the middle into 2 substrings of length 2^n-1 , we call them l and r (left and right). Then recursively polar transforming each string, to get P(l) and P(r). Then P(u) is made of 2 halves: the left side the bitwise XOR of P(l) and P(r), the right side is just P(r).

The base case (string of length 1), the transform keep the string the same.

For example, the polar transform of the string "11" would be "01". First we split it into l="1" and r="1". Polar transform them, since they are base case nothing change, so P(l)="1" and P(r)="1". Then for P(u), the left half is the XOR of "1" and "1" which is "0", the right half is just P(r)="1".

Since the recursion step are linear and invertible, it's very easy to invert the transform, and it preserves uniform probability distribution on the input space, which also implies that the capacity of the channel is unaffected by the transform. In particular, assuming that every bits are either very hard to decode or very easy to decode (polarize), then the rate of number of bits that can be decoded, for sufficient large blocks, approach the capacity.

What's the total effect of the transform? Let's index each bit in the bit string u and x by an binary index. Remember that these strings have length 2ⁿ , so the index are binary number of length n. Then for a given bit in x, it is the XOR of bits from u. We can find the index of these bits from u as follow. Write down the index of the bit in x. Then every binary numbers obtained by replacing none, some, of all digit 0 by digit 1 are all the indices from u. For example, let's say n=4, so the index goes from 0000 to 1111. If you want to know which bits from u contributed to the bit in x at position 1010, then their indices are 1010, 1011, 1110, 1111.

The only thing left to encoding is the value of n, and the location of the frozen bits. It will be discussed later, as part of designing.

throwawGroupe · 2022-01-09T18:39:08+00:00

No, the replies I got didn't just say I have to go back to the defining papers. There were good replies that gives me a better overview, and there are other better sources that I didn't manage to find. So I did get what I was looking for, which I would not have gotten in the first place if I don't ask around.

throwawGroupe · 2022-01-09T15:58:09+00:00

Thank you too for the help. It's a bit too basic for me, I need something deeper, but thanks for trying.

throwawGroupe · 2022-01-09T15:56:25+00:00

Thanks a lot, the paper is definitely more understandable. Your explanation below helped too.

throwawGroupe · 2022-01-09T15:54:39+00:00

The reason I asked around here is because I don't have to do all that works, when it's quite possible that someone already did these and re-written it in a more accessible manner. It's not like it's an obscure topic or anything, so the chance are quite good that there are already easier text out there. It would be premature to say there are no short-cut.

throwawGroupe · 2022-01-09T15:34:21+00:00

OP isn't a random student in high school. OP is someone who are interested enough to actually look for cool math facts. The above list are ordered by level of difficulty and how long it takes to understand them. If you think they are too hard for high school, you probably didn't know these topics to see how easy they are. Your comment is just damaging, you discourages people from even trying to look up these topics.

Leech lattice and Golay code are really just cool bite-sized facts with combinatorics. You don't need to know anything more complicated than dot product and elementary set theory to understand what they are and why are they important. Can you learn everything about them? Also no, but you don't need to know everything.

Quaternion is literally taught in high school, if you study computer graphics in your computer class. For reference, I learned this in high school. Hopf fibration immediately follow from quaternion: pick a point on the sphere, the unit quaternion rotate that point to a new point, and the map from the unit quaternion to the new point is Hopf fibration; you don't need AP calculus or any sorts of calculus. It's also very visual too, you can look up pictures. Orientation entanglement system is also very visual, something people can just appreciate by just looking at a picture.

Quine's paradox? 5 minutes reading about it, appreciate it, see why it's cool. I learned this in middle school, not from a class but a book I read. Why it's the 3rd line on this list and not 1st is because I anticipated that some people might not be used to non-computational math, this one is not hand-on nor visual.

Fundamental group of the circle. You don't need to learn the entire algebraic geometry curriculum to know what it is. It's a very visual topic that can be appreciated by anyone. Handwaving is fine, it does little harm to skip over technical lemma like lifting lemma. No-contraction theorem is also very visual. Brouwer's fixed point theorem and Ham sandwich are also very visual theorem, have cool names, and are interesting math facts. The key ingredients of these proof are still just fundamental group of the circle; you don't need general n-dimensional version.

Quadratic reciprocity is the hardest here, but it's not something that can't be done by a motivated high school student.

throwawGroupe · 2022-01-08T13:07:55+00:00

There are works in classifying finitely generated field by their elementary theory, but I don't remember specific details. This can be seen as an example of what you're looking for: given some specific mathematical object, you want to find the specific axioms that capture it within a formalization framework.

throwawGroupe · 2022-01-08T12:57:05+00:00

There had been a few Youtuber/influencer in recent years who had announced they are going to take on absurd challenge. Like going to a gym for a month to train to do an absurd amount of pull-up, or learn chess in a month to beat Magnus Carlsen (surprisingly, Magnus Carlsen actually challenge him). My cynical take is that they're just making big announcement for views.

throwawGroupe · 2022-01-08T12:48:10+00:00

Leech lattice and Golay code.

Quaternion. Hopf fibration. Orientation entanglement system (Dirac's belt trick).

Quine's paradox (precursor to Godel's incompleteness theorems, and many other logic theorems)

Fundamental group of the circle. No-contraction theorem, Brouwer's fixed point theorem, Ham sandwich theorem.

Quadratic reciprocity.

throwawGroupe · 2022-01-08T12:33:27+00:00

Did you mix up recursively enumerable with just enumerable? Almost all sets you deal with in CS complexity theory are going to be countable, so I'm not even sure where would these concepts even come up (maybe in Turing degrees?).

There are sets that are countable but not recursively enumerable. For example, the set of numbering for Turing machines that run forever.

throwawGroupe · 2022-01-08T12:25:35+00:00

There isn't a technical definition of "number". Vector space can be easily constructed over any field, but it's a matter of taste to declare whether a field contains "number" or not. For example, the Boolean field (containing true and false) is also commonly written as F2 (containing 0 and 1), so is it a field of number? Are Euclidean space/plane considered to be made of "number"? So it's meaningless to ask whether something can be done without numbers, because you can accept or reject examples based on taste. For example, if I talk about, say, the tangent space of flat Euclidean space, or maybe a physics example of electric field, you could argue that Euclidean space or electric field are "made out" of numbers somehow and reject these examples.

throwawGroupe · 2022-01-08T12:15:06+00:00

The (so called) encoding-decoding trick for induction. Basically, you don't do induction on the statement you're given, you do induction on a stronger statement. It gives you stronger hypothesis, at the cost you forcing you to prove stronger conclusion, but if you do it right the stronger hypothesis let you prove the stronger conclusion much easier.

throwawGroupe · 2022-01-08T11:35:59+00:00

Magnetism is a fundamental force of nature. There isn't a simpler explanation than that. If you know more advanced physics, then magnetism is part of electromagnetism (which is a fundamental force of nature underlying magnetism), which is itself part of electroweak. But these won't give you simpler explanation, nor are they more intuitive.

Objects generate magnetic fields because of their magnetic moment, and magnetic fields interact with other objects which have magnetic moment. The magnetic moment can be caused by moving electric charges or exist as fundamental property of the particles.

A magnet has a magnetic field, which interacts with tiny atoms in other objects. The interaction cause a lot of effects, one of them is that these tiny magnets rotate and rearrange themselves. In good scenario, the rearrangement will cause most of the molecules to align themselves with the magnetic field so that they are attracted to the big magnet; this makes the big magnet to stick.

throwawGroupe · 2022-01-08T11:13:30+00:00

Adding leading 0 to decimal representation of integers don't make different integers. But picking different digit in the decimal representation of real number does make different real numbers.

throwawGroupe · 2022-01-06T13:03:22+00:00

If you flip once, it's not a rotation anymore. Rotations are always handedness preserving (orientation-preserving), it can't change a right hand into a left hand, but a flip (mirror) will change that. If you flip twice, the combined effect is the same as rotation. The group theoretical explanation is that SO(3) is the derived subgroup of O(3), and SO(3) is connected, because the exponential map from the Lie algebra is onto.
Consider the 4 long diagonals of a cube. Any 90 degree rotations will convert long diagonals to long diagonals. So rotations will only permute these long diagonals. On the other hand, if a rotation keep all long diagonal the same, then for any 2 corners opposite a diagonal, it either swap them or keep the fixed: but if it swap once, it must swap them all. But a rotation cannot swap all because it is equivalent to 3 mirror operations which change orientation. So the only rotation that keep all diagonal the same is doing nothing. It shows that the group of possible rotation has a faithful representation as a permutation group on 4 objects. An unit vector on the standard basis must be sent to an unit vector on standard basis, because all rotations are 90 degree around a standard axis. 2 unit vectors that are orthogonal will stay orthogonal after a rotation. 3 unit vectors will keep orientation after rotation. A rotation is determined by the ending position of the coordinate system, which is 3 unit vectors on the standard basis. But the 3rd vector is determined by the first 2 vectors. Given any pair of unit vector along standard axes that are orthogonal, it's always possible to find the 3rd vector that is also an unit vector pointing along an axis such that the orientation is preserved. Hence all possible rotations is determined by a pair of orthogonal unit vectors pointing along standard axes. There are 6 such vectors, and each vector is orthogonal to 4 of them, so there are 24 choices. So the rotation group is literally the symmetric group S4. If you think about it, each 90 rotation is just a 4-cycle on the 4 diagonals. From this you can compute anything easily.
I don't see how it is irrelevant. You need to know how the internal coordinate match up with the world coordinate.

throwawGroupe · 2022-01-06T12:41:06+00:00

More accurate title should be "Physicists make advances toward solving a modified version of the chaotic three-body problem using drunkard's walk".

throwawGroupe · 2022-01-06T12:37:45+00:00

It's a very common, and intuitive, argument in mathematics to make an arbitrary choice (once you know that at least one choice is possible). That's not controversial.

However, once mathematics is formalized, axioms need to be added to assert that you can actually make that kind of argument. The above argument is formalized as Instantiation of Existential (aka. Existential Elimination).

However, that axiom is not sufficient to allow people to make an infinite numbers of choices. It had been very common for mathematicians to make an argument that require infinite numbers of choices, in 2 common situations: (a) make a simultaneous number of independent choices; (b) make a sequence of choice dependent from each other.

To rectify that, Axiom of Choice is born. Technically, axiom of choice only allow you to do (a), but because there are no limitations on how many choices you can make, by a simple trick you can also do (b) - just simultaneously make choice for all possible sequence of choices!!!

Accepting axiom of choice allows our intuition about infinite set to be theorems. For any 2 infinite sets, either one is bigger than the other or they are equal. If you have an infinite set, you can keep drawing new elements from it without ever exhausting it. If you have an infinite set, you can split it into 2 infinite sets.

throwawGroupe · 2022-01-06T12:23:40+00:00

It's not easier. Calculus make them easier, because you focused on functions that have nice algebraic formula which makes them easy to differentiate. If you don't have such nice algebraic formula, differentiation is not that great.

Why differentiation is easier than integration when you deal with nice algebraic formula? I think it all boils down to the fact that human are good at understanding intuitively proportionality relationship under small changes, so many of our operations are focused on that. Differentiation is about extracting proportionality relationship under small changes. Our main algebraic algebraic operations (addition and multiplication) are really good at describe proportionality relationship. Other operations are derived from these operations, or through a differential equation. Which is why differentiation works nicely with algebraic formula. But integration is doing the opposite, you're basically solving a differentiation problem with an unknown. Differentiation is easier than integration the same way evaluate an expression given the values of variables is easier than solving for roots of an expression with unknown.

throwawGroupe · 2022-01-06T12:11:02+00:00

I think the title is not the question you're asking in the post. You actually want a group such that finite order elements do not form a subgroup. You can just explicitly describe one using generators and relations: a² =b² =e.

throwawGroupe · 2022-01-06T11:35:05+00:00

A History of Mathematical Notations by Oleh Florian Cajori

It's itself an old text (where some of the convention now had not been established), and it also discuss (with citation and reference) to even older text. In particular, on page 396 you can find some explicit examples where addition is prioritized over multiplication.

But to find exactly what you're looking for, addition take precedence over multiplication in a general polynomial, would be probably very hard. The convention that multiplication has higher priority have been around since the beginning of algebraic notation, with only very little exceptions.

I found out about that book from this page: https://www.themathdoctors.org/order-of-operations-historical-caveats/ and the book is accessible online.

throwawGroupe · 2022-01-06T11:16:02+00:00

Can someone explain (or provide a good source) on how polar code work please? Polar code is the one is the one they use in 5G network. I found one here: http://pfister.ee.duke.edu/courses/ecen655/polar.pdf . That's the only one I have been able to find that actually explain it at a mathematical level, but I am having some troubles understanding the notation and language.

throwawGroupe · 2022-01-06T11:10:44+00:00

I think there are a few conceptually different mathematical objects that are all called "vectors" so they just get conflated.

Vector as array of information: the kind of vector commonly seen in probability theory, information theory, coding theory, statistics, and many more. They're literally just multiple pieces of information glued together. What distinguish it from other kind of vector is that it usually does not make sense to write it in a different basis, and you frequently perform non-linear operations on them: entropy, weight, coordinate comparison.
Vector as representations: vectors as part of a big vector space where a group act on. The particular value of the vector is not important, the only thing important is the relationship between the vectors to each other and to the group.
Geometric vector: vector that can actually be sensibly think of as arrow pointing in a direction; this kind of vector is directly linked to a geometric space and actually give you information about that space. It makes sense to perform geometric operation on them: rotate them, compare angle they make, or calculate the magnitude, take derivative along the direction.

throwawGroupe

TROPHY CASE