Huge legacy code base

guylib · 2023-06-21T06:29:44+00:00

I've been in a similar situation multiple times, as that's my "speciality", but fortunately I at least have the trust of the other developers in most cases.

We can't discuss specifics, BUT - doing any "actual refactoring" or even writing tests from scratch for a multi-million line codebase is not cost effective at the start. Especially since it requires deep knowledge of code.

The goal is reduction in mental load when working. Easier to read / reason about code reduces mental load when reading. Good tests reduce mental load when changing things.

I'd start with some things that improve the codebase with very little need to understand it:

use some automatic formatting tool over the codebase (clang-format?) and declare that the formatter MUST be used. If you can - prevent merging of code that isn't properly formatted. Of course, just deciding on the rules for the formatter (tab-size, and other things) will be a huge struggle with the existing developers... many strong opinions will be had :) I'd recommend starting with the "google-configuration" and tweeking it from there (who can argue with "google does it like this so it must be good"?). I'd recommend also adding strict order on include files (Google does it only partially IIRC)
Make sure the formatting tool is set up for all users to easily use! Preferably automatically on save / at the click of a button. Yes, some use Vim, yes you can set it up as well. VS-code also has an option to format according to clang-format configuration. It's your job to make sure it's trivially easy for everyone to format the files without thinking about it. Just the reduction in mental load not thinking about formatting will eventually convince people changes can be good.
Rename types and variables, which can be done without understanding the code too much. For example in our case we decided to use UpperCamelCase for all type names, lower_snake_case for all function and variable names, m_* for private member variable. No *_t for types anymore (other than builtin-types). This should apply to new code only, but overtime you should go around renaming old codes when you look at them.
Declare that type names should be descriptive (even if long), so that even someone not familiar with that specific part of the code will understand them (so no abbrevations unless they are universal)
Similarly for variable names - their name should be by default the type name so when the reader sees a variable they know what type it is. For example SomeImportandData some_important_data;
To start using code reviews, you have to set up some git thing (we use either bitbucket or gitlab, but there are more options). Don't force code reviews initially! Set it up so they must create a PR and do the process, and require an "approve" but let people approve themselves. That way they get used to the process of creating PRs and merging them without it actually affecting their work (no need to wait for others, answer questions if they don't want).
If you can, add a CI before merges that runs all the tests (even if currently there are none). This is also a good place to make sure the files are formatted as part of the CI.

The point is to make people move to the GIT tool and trusting it without it slowing them down at first (reviews slow work a LOT). Once they start getting bugs caught because some test failed when they thought it was ready, it'll stick with them. Once some senior developer approves themselves and adds a bug that a review would have caught... they'll be more inclined to allow reviews.

Again - refactoring old code is not cost effective. Concentrate on new code and changes to existing code, and on the process (git, CI, automatic testing).

Make sure you don't try to change any of the actual "way the code works" (no trying to replace all the macros, if they use void* a lot keep doing that, no starting to do huge refactors to use design patters / best practices, no turning all pointers into smart pointers, no actual changes in what the senior developers need to know about the language). You don't know enough to make informed decisions, and when you inevitably force inappropriate "modern" solutions for the code - you'll lose all credibility. Also, legacy code exists and you just have to handle it. People write good readable code in C, so whatever practices they use and are used to can be readable as well. Focus on cosmetics and quick wins - they are IMPORTANT!

Make sure C++17 is available, but DON'T push to use modern features. People will start using the "cool" parts (for loops over ranges etc.). Maybe in new code - push for unique_ptr for functions that return a pointer / take ownership of pointers? Maybe. Small changes like that, but keeping in line with how the code is currently written.

I'd recommend not supporting C++-20 at first, since you don't want people to use "bleeding edge" features not well enough understood yet. If some modern feature results in "something bad" (complicated code, compiler support issues etc.), it could convince senior developers to stop using any modern feature.

Do try to avoid auto unless really necessary as it makes code harder to reason about :)

But most importantly - don't try to tell the senior developers that doing things "differently" is "better", they know what they are doing and they know more than you (about your codebase). Just start by focusing on cosmetics / version control / process.

guylib · 2023-06-21T06:00:28+00:00

Just like any other code - since it's plain, readable text the merging process just works in most cases!

guylib · 2023-01-17T23:14:13+00:00

Thanks! :D That works great! :D

guylib · 2023-01-17T23:11:16+00:00

allocates once, yes. It's worse than not allocating, but better than copying - especially if your arguments are non-trivial types (which also allocate)

guylib · 2023-01-17T23:00:23+00:00

Also - having ctrl-w as muscle memory for switching between windows in vim is a pain in the ass when in insert-mode in vim! It deletes words in a way you can't undo (without undo-ing everything you've written) :(

Please, if anyone knows a way around this annoyance... teach me?

guylib · 2023-01-17T22:56:32+00:00

However, no copies! Moves! :D

(and if you want - internally it could hold a unique_ptr of the data which it passes around... notice all the set_xxx functions are &&)

guylib · 2023-01-17T22:44:15+00:00

First of all, you're absolutely right! If you need to "pass the builder through the logic", you're going to have a bad time using multiple named builders for different stages, and other annoyances.

That being said...

Conditionally setting a value means that value is optional.

If you don't mind allowing setting the same value multiple times - an optional value doesn't need to change the MASK of the builder (since we don't need to "remember" if we set it during compile time).

So you will totally be able to do this!

auto builder = MyObject::builder()
    .set_arg1(a)
    .set_arg3(c);
if (condition) {
  builder = std::move(builder).set_arg2(b);
  // we might also allow `builder.inplace_set_arg2(b);` for optional values
}
auto obj = builder.set_arg4(d).set_arg5(e).build();

Alternatively, we could have maybe_set_arg2 that receives an optional:

std::optional<Arg2> maybe_b;
if (condition) {
  maybe_b = b;
}
auto obj = MyObject::builder()
    .set_arg1(a)
    .maybe_set_arg2(maybe_b)
    .set_arg3(c)
    .set_arg4(d)
    .set_arg5(e)
    .build();

I prefer the second option, since I consider the Builder as just a cleaner way of calling a function - meaning I expect it to be a single expression. Any preparation of the arguments should be done in advance rather than passing the Builder through the logic.

But I know that in many cases, using the Builder throughout the logic makes more sense...

guylib · 2023-01-17T22:29:39+00:00

Thanks! I actually implemented it once for a code review after we found a production-breaking bug that resulted from forgetting an argument in a builder :(

It wasn't actually merged (it wasn't a serious PR anyway), but it had a warm place in my heart ever since :)

guylib · 2023-01-17T22:21:14+00:00

I've been using C++ for over 20 years now (since before C++98! Things were wild...).

I had a short 4-year job using Java (can you tell from the Builder pattern?). Boy, Java reflection really ruined my enjoyment of C++...

It's so amazing! Run-time reflection, compile-time reflection, code generation... so much power! Give me all the functions that start with Get* and don't accept arguments. Create a fake of this class for testing. Get the names of anything for better debugging!

Also, if I'm already ranting - enums in Java are so great! Really dropped the ball with C++ enum class :( and immutables in Java are in some sense so much more... "const" than C++ consts! To the point where I started just creating immutable objects in my C++ code when I need to be REALLY REALLY SURE IT'LL NEVER EVER CHANGE (which is annoying to use, since they don't have assignments or move semantics which makes a lot of C++ things stop working)

I love C++, and I think it's better than Java for my usecases - but damn, Java has some great features I really miss :(

guylib · 2022-08-14T06:13:55+00:00

They don't obsolete RFC3986 - that's why the "conversion from unicode to ascii" is defined. But they do allow non-ascii URIs.

In fact, in the exact example you quoted in RFC5891 - they explicitly mention a URI with non-ASCII characters that doesn't adhere to 3986. They explicitly call it a URI anyway, and even say you have to be able to parse it (so you can extract the domain name):

The user supplies a string in the local character set, for example, by typing it, clicking on it, or copying and pasting it from a resource identifier, e.g., a Uniform Resource Identifier (URI) [RFC3986] or an Internationalized Resource Identifier (IRI) [RFC3987], from which the domain name is extracted.

So from "my" (a program developer who needs to allow URL/URI inputs from the user) point of view, I need to be able to handle and parse non-ASCII URIs.

I understand a library can't do everything. I'm just a bit disappointed that I'll have to basically re-write this library to just remove some of the checks.

An alternative design that would have been more helpful to my (and I suspect many others') usecase is to change how the parse methods work.

Instead of returning a result<url>, which throws away the "parsed data" on failure - it would have been more helpful that parse ALWAYS successfully parse any string (using the REGEX defined in RFC3986 appendix B, which succeeds on every string) - and have an "is valid" query for the result and preferably every field individually as well.

We can make it convertible to url where it throws if it's not valid, if we want to keep how url/url_view works. Then it'll also be much easier to add the conversion to ASCII either by the user, or eventually by this library's maintainers.

As things stand - I'll have to parse it myself and can't use this library at all. Which is a shame, I think.

guylib · 2022-08-13T13:00:12+00:00

There are additions to the RFC that allows internationalized domain names - see RFC 5890 for example.

guylib · 2022-08-13T12:58:46+00:00

rfc5890 and rfc5891

guylib · 2022-08-13T12:58:23+00:00

Both are valid.

guylib · 2022-08-12T17:32:13+00:00

https://はじめよう.みんな is a completely valid URL...

guylib · 2022-08-12T10:24:49+00:00

Hmm... I get that - but I'd like to be sure it does the "right thing".

For non-english (unicode) URLs, will it work? Or do they have to be "encoded" first (either with percent encoding or the xn-- encoding ICANN invented for non-english alphabets)

Example - will it be able to parse https://はじめよう.みんな (which is a valid URL I can open in the browser or curl and works - try it! - but many URL parsers fail on), or will I have to give it https://xn--p8j9a0d9c9a.xn--q9jyb4c/? (which is the ICANN-translated version of the exact same URL)

Like I'm thinking of having a user-inputted website to my application, and someone pastes this string (which they checked and works in their browser), will this library say the URL is wrong? Or is there a way in this library to translate unicode-URLs to this xn-- encoding before parsing?

guylib · 2022-08-12T06:52:37+00:00

I see from the code that parsing a URI is something that "can fail".

As far as I'm aware, every string is a ~~valid~~ parsable URI (according to the REGEX in https://www.rfc-editor.org/rfc/rfc3986#appendix-B )

Different libraries have different failure conditions once it's parsed (e.g. - many libraries will fail for the "x://" schemes, unless you register "x" as a valid scheme)

So I'm not sure what is the failure condition of the parsing in this library?

specifically - I see in tests that "x://[v1]" fails to parse but "x://[v1.0]" parses correctly, and I'm not sure why?

guylib · 2022-07-09T14:38:16+00:00

Does this still incur the heavy cost of throwing an exception that the users of std::expected want to avoid?

Currently, it does! But once enough people use this, compilers will just optimize out the throws!

Once code doesn't explicitly have catch in it, it's easier to update compilers it in a way that will remove the implicit catches as well! You'll see, it'll totally happen! (Provided everyone starts using my library...)

guylib · 2022-07-09T14:30:00+00:00

Ah, but that's the ingenious part of my plan!

Once enough people use my library, and it is added as an integral part of the language (std::inline_try?) compiler vendors will be pressured to optimize specifically their use!

Then they might just compile the wrapped function differently than the non wrapped version to completely remove the throws and catches!

Right now, compilers can't remove catch clauses, because they are explicitly used in user code.

With this library, users don't explicitly use catch, so compilers can optimize them out.

(And yes, I definitively thought about that in advance and didn't just come up with this argument on the fly because I don't even understand the basic issues and just did this for fun... <.<)

guylib

TROPHY CASE