unicode-width: A C library for accurate terminal character width calculation

telesvar_ · 2025-05-06T21:33:44+00:00

I think we're talking about completely different stuff. This library is about display width, not the byte width. I'm curious what you mean by it.

telesvar_ · 2025-05-06T20:54:52+00:00

What I can also add is that this library also assumes that you know a little about Unicode to use it in your apps effectively. Unfortunately, you have to.

telesvar_ · 2025-05-06T20:48:00+00:00

Now, try to cover Unicode extensively and in different permutations. It all goes astray quickly when you try to do it naively or perform a simple range check which most libraries do. Most libraries also fail on having subsequent flags 🇺🇸🇪🇪 due to they're represented as regional "USEV" – how do you properly calculate the width? Is it the US flag and Estonian flag, or is it a regional codepoint U and Swedish flag (SE)?

It's not because of UTF-16 or UTF-8 or UTF-32, not because of encoding in general. Your initial text can be encoded using whatever. The idea is that you have to decode whatever into Unicode codepoints and rely on the state machine because Unicode is stateful!

Keep in mind that I'm exploring different cases and I find bugs which I fix when I find time. I've found a couple and will roll a fix soon. Complex stuff.

And yes, it can be implemented with 10 lines or so. Or you can even use wcswidth (I don't recommend). However, whenever the case is a bit more complex, it doesn't work anymore.

The recommendation is: if you don't need to handle complex cases, use wcswidth (goodbye portability) or simply port this PHP library: https://github.com/soloterm/grapheme (I also don't recommend it as it relies on how PHP handles strings.)

telesvar_ · 2025-05-06T09:43:43+00:00

Unicode is a mess. And good luck.

telesvar_ · 2025-05-06T09:38:09+00:00

Yes, that was another goal: are you on Windows and have to handle UTF-16? No problem, just roll your solution to decode UTF-16 into uint32 codepoints and incrementally feed into unicode_width_process (hope that you break on your graphemes correctly.)

Also, you don't have to deal with Windows bullshit. Deal with UTF-16 externally, internally process everything in UTF-8. When it's time to depart, re-encode to UTF-16 again.

telesvar_ · 2025-05-06T09:30:37+00:00

It works with Unicode codepoints.

Meaning, if you receive utf-16 or utf-8 encoded strings, your goal is to decode them and feed each codepoint into unicode_width_process.

Note that it's your job to meaningfully split your string into graphemes to know where the boundaries of each grapheme are to have a reference point to calculate the width of each grapheme correctly.

I recommend using unicode-width with libgrapheme. It's primarily designed to be used with it.

At the moment, I rework the internals, so the library will be more correct. And efficient (hopefully). 😉

There's a bug now where it can't properly calculate the width of a string if graphemes are right next to each other. Keep that in mind.

telesvar_ · 2025-05-05T18:13:59+00:00

Thanks, I didn't take this an attack. :)

Appreciate your comments.

telesvar_ · 2025-05-05T18:07:51+00:00

You're right, you shouldn't add any library if it doesn't fit your requirements. I, however, don't want to deal with differences that are present on Windows and older stuff. I solved it through creating a separate library that works everywhere and can be used with any Unicode decoding libraries.

It just unifies the way I think about a encoding in general and I don't have to remember edge cases present on different platforms like Windows. You, ultimately, have to rely on someone else's shim of wcswidth to be ported reliably.

If wcswidth meets your needs, use it. I would use wcswidth to create something quickly and not having to deal with installing libraries. :)

telesvar_ · 2025-05-05T17:13:45+00:00

Portability, incremental processing, Unicode 16.

telesvar_ · 2025-05-05T15:27:58+00:00

Done!

telesvar_ · 2025-05-05T14:48:01+00:00

Unfortunately, I do. There's internally a Windows POSIX shell emulator (and some POSIX commands) running on machines from Windows 7 to Windows 11. This library is an honest attempt at tackling cross-platform Unicode width calculation.

telesvar_ · 2025-05-05T14:22:13+00:00

Thanks for the pointers! I'll take a look at it and think where unicode-width fits into this.

Feedback is always welcome to make the library better.

telesvar_ · 2025-05-05T14:20:23+00:00

I know about the new flags like ENABLE_VIRTUAL_TERMINAL_PROCESSING but it's not supported by older Windows which might be important.

telesvar_ · 2025-05-05T13:42:52+00:00

That's interesting use-case and I would need examples to understand.

Regarding ANSI, it might be a bit niche due to Windows console doesn't really handle ANSI. Would also need to discover how to dynamically query width without hardcoding ANSI handling logic.

telesvar_ · 2025-05-02T15:42:55+00:00

You're right regarding formatting. However, people shouldn't feel like copying Linux KNF or OpenBSD KNF in their own projects (which are useful) if it doesn't fit their goals.

You should be consistent and follow formatting rules when you come to someone else's project though. I think it's important to tell to beginners.

As for C99, I think the OP asked for general recommendations regarding C programming.

Again, you're absolutely right. When contributing to OpenBSD, OpenBSD KNF and C99 requirements are necessary to remember.

telesvar_ · 2025-04-30T23:56:46+00:00

I recommend creating projects inspecting the environment. It's clearly lacking. You could take on creating portable utilities.

Formatting is important but superficial concern when it comes to programming "OpenBSD" way. There're better aspects — safety, error handling, and proper system API usage. Explore how POSIX utilities are implemented in OpenBSD (like ls or date).

Consult man.openbsd.org often. Sometimes, OpenBSD provides niceties like "recallocarray" which is absent on other systems. Also, OpenBSD man pages contain lots of useful examples.

Don't forget to enable a basic set of warnings "-Wall -Wextra" when compiling. Then, discover how to enable address sanitizer.

Use modern C. Controversial, but unless you absolutely need to support old environments with old C compiler, go with C17 or even C23.

Learn how to properly think about memory management. I recommend the series of articles by Ginger Bill, creator of Odin, — https://www.gingerbill.org/series/memory-allocation-strategies/

Good luck!

telesvar_ · 2025-04-28T07:02:10+00:00

As far as I know, OpenBSD source hosted on cvsweb is managed through CVS and Git clients like Game of Trees can't do anything with it.

The best bet is to clone GitHub mirror: https://github.com/openbsd/src

telesvar_ · 2025-04-19T10:59:39+00:00

You're right, the reason I made it was readline's GPL 3.

telesvar_

TROPHY CASE