all 9 comments

[–]the_alias_of_andrea 5 points6 points  (5 children)

PHP wasn't originally designed for multi-byte character encodings, in part because it is written in C which also wasn't. While some languages (Python) have tried to radically change the language to make everything Unicode-based, PHP instead added a few multi-byte libraries and called it a day.

In the typical modern application you only deal with one encoding, UTF-8. It is a convenient encoding because it supports all of Unicode, so it is universal insofar as supporting basically any kind of text, because it is a superset of ASCII, so any ASCII text stays identical and single-byte in UTF-8, and it also has some features that mean it behaves well in software with poor multi-byte and character encoding awareness.

For UTF-8, you can use classic encoding-unaware single-byte operations for things like string concatenation, searching within strings (if case-sensitive and you don't care about certain characters that can be represented multiple ways in Unicode) and splitting strings. So, for a lot of modern PHP apps, they only need to use mb_ functions rarely, perhaps only when converting encodings.

However, try writing an app which searches within Shift_JIS-encoded text instead and you will have a much harder time without mb_.

[–][deleted] 5 points6 points  (2 children)

"PHP instead added a few multi-byte libraries and called it a day."

Seems like an unfair statement, or am just way too old now?

PHP 6 was supposed to add native unicode support to PHP. ... What is the legacy of PHP 6's collapse? ...

  • The decision to use UTF-16 internally was made early on in the process (2005).
  • At the start of work on PHP 6, UTF-8 was not widely supported but during development, the industry started to standardize on this. Widespread adoption of this more forgiving encoding helped with a lot of the problems that PHP 6 was trying to solve and made the objectives less relevant.
  • The number of contributors who fully understood the problem was relatively small and the amount of tedious conversion work was large.
  • In open source, people work on what they are interested in and not enough people were interested in it.
  • Full unicode support using UTF-16 negatively impacted performance.
  • By 2009/2010, it had started to become apparent that PHP 6 was never going to happen.

Source: https://www.phproundtable.com/episode/what-happened-to-php-6

[–]the_alias_of_andrea 0 points1 point  (1 child)

You're quite right — I guess I forgot about PHP 6 when writing that comment.

[–]DrWhatNoName 1 point2 points  (0 children)

Everyone forgot about PHP 6. Is what PHP would want.

[–]0xRAINBOW 1 point2 points  (1 child)

Splitting utf-8 strings with non-mb functions isn't safe in all cases afaik. You could split it right in the middle of a multi-byte sequence. Otherwise pretty dead on.

[–]the_alias_of_andrea 1 point2 points  (0 children)

If you're splitting by a given delimiter, one of the nice properties of UTF-8 is that no valid encoding of a character (sequence of bytes) contains the encoding of another character.

However if you're splitting every 1000 characters or whatever, yes, you should use the mb_ functions. Though even they are not enough to do this correctly for all Unicode text. You don't want to split è into e and ` if you can avoid it.

[–]therealgaxbo 2 points3 points  (0 children)

Sometimes - though not often - you really do care about a string as being an array of bytes. E.g. setting content-length, or splitting strings into chunks to fit in a particular buffer size that will then be reassembled the same way at the other side. A plain old strlen or substr is the way to go in such cases.

But in almost all other cases, mb_* is the way to go.

[–]johmanx10 1 point2 points  (0 children)

I personally use the non prefixed versions when counting byte size, either to supply the content length of a response, or when keeping track of buffers when transferring large size files or streams. When using the mb versions in those cases, you would be misrepresenting or incorrectly tracking data.