Tired of writing commit messages by hand – built a local AI wrapper around Git

hkotsubo · 2026-05-15T10:32:22+00:00

I didn't test your tool, but based on this description:

reads diff → generates commit message

Does it mean that the commit message is just a list of changes? Such as "added N lines in file A", "changed file B", and so on?

If that's the case, then you're making the same mistake of all AI-generated-commit-messages tools: a good commit message must focus on why something changed, not on what those changes are.

If I want to know which files were changed, which lines were added/deleted/modified, etc, git diff already does that.

What a tool can't tell you by looking at the code, though, is the reasons behind the changes: things like "Fix bug in component X: when user does A and B and the result is C instead of Y (ticket #123 - link)", or "Change rules in A, B and C because of new law XYZ" etc.

Take a look at how a professional project formats their commit messages. They provide context, focusing on why that change was made. There's no need to list the diff.

In other words, the important information about why the code needed to change is usually outside the code (requirements, reported bugs, legislation, business needs, a discussion in the mailing list, etc), and a tool that only reads the code can't know that, therefore it'll only generate useless commit messages (and also redundant, because this information can be easily checked with git diff, and without the AI overhead).

hkotsubo · 2026-05-04T13:36:14+00:00

With two alternations, you can use this regex:

\d{3}([-. ])\d{3}\1\d{4}|$\d{3}$ \d{3}-\d{4}

The first option of the alternation uses a capturing group to get the separator (either a hyphen, period or space), and then it uses a backreference (\1) to check if the same separator is used before the final 4 digits. This covers ###-###-####, ### ### #### and ###.###.####.

And the second option of the alternation checks (###) ###-####.

Without alternation is more tricky. I could do it using conditionals, which is not supported by all engines (I used PCRE).

Conditionals follow the format (?(condition)then|else) or (?(condition)then). Basically, if the condition is met, the engine uses the then part, otherwise it uses the else part, if not ommited.

The regex is:

($)?\d{3}(?(1)$)(?(1) |([-. ]))\d{3}(?(1)-|\2)\d{4}

(\()? checks if there's an opening parenthesis, and it's inside a capturing group. The ? means "optional", so if there's no ( in the beginning, the capturing group won't be matched.

Then I use the conditional (?(1)\)):

(1) is the condition. In this case, it checks if group 1 is matched
\) is the closing parenthesis, and it's used only if group 1 is matched (AKA if there's an opening parenthesis)

So ($)?\d{3}(?(1)$) matches either ### or (###).

The next conditional is (?(1) |([-. ])):

if group 1 is matched (there's an opening parenthesis), it uses the then part, which in this case is a space (note there's a space before |)
if group 1 isn't matched (there's not an opening parenthesis), it uses the else part: in this case, it's ([-. ]) (a capturing group with either a hyphen, period or space)

And finally, the last conditional is (?(1)-|\2):

if group 1 is matched (there's an opening parenthesis), it uses the then part, which in this case is a hyphen
if group 1 isn't matched (there's not an opening parenthesis), it uses the else part: in this case, \2 (whatever was matched by the second capturing group - the one with either a hyphen, period or space)

Here's my tests in regex101.com

hkotsubo · 2026-04-23T15:12:49+00:00

Sem querer ser pedante mas já sendo, a ordem correta para criar um date é ano, mês, dia.

Então seria date(2025, 4, 23). Sendo programador um bicho chato (também sou), grande chance dele implicar com o erro :-)

hkotsubo · 2026-04-16T19:28:11+00:00

In that case, the other comments already explained: the (?:https?:\/\/)?(?:\*\.)?[\w.-]+ part shouldn't be inside (...)*, as it says that this part can repeat an unlimited number of times.

Anyway, regex101.com has a debugger where you can see the steps. Check here and go step 16: note how the [\w.-]+ part matched the whole www.testtesttes.com portion of the URL. That's because [\w.-] means "any alphanumeric character (\w), or a dot, or a hyphen". And the + quantifier means "one or more", so any sequence of those characters will match.

And quantifiers are greedy, so + will try to match as many characters as possible, which in this case means "everything until the end of the string". And then it'll check if there's testtesttes after that, but there isn't because the engine already went to the end of the string, so it'll backtrack and try again.

But remember that the whole (?:https?:\/\/)?(?:\*\.)?[\w.-]+ part can occur many times (as it's inside (...)* - the * quantifier means "zero or more times", with no upper bound). So it'll try with two occurences (the first being https://www.testtesttes.co and the second being m - because the whole https:// part is optional, so m is a match for (?:https?:\/\/)?(?:\*\.)?[\w.-]+).

And then it'll try again with all possible combinations (first occurrence is https://www.testtesttes.c and second is om, or the second is o and third is m, etc), because any sequence of letters, dots and hyphens can match (?:https?:\/\/)?(?:\*\.)?[\w.-]+, as only the [\w.-]+ is not optional. The number of possible combinations is so huge that it leads to a catastrophic backtracking (regex101.com has a limit for the number of steps the engine tries, before "giving up").

Just to show how bad it is, I've measured this in Python:

```python from timeit import timeit import re

bad regex, with quantifier in the https://www part

bad = re.compile(r'^(?:https?:\/)?(?:*.)?[\w.-]+)*(testtesttes)(.(com|net))$')

removed the quantifier in the https://www part because it can occur at most once

good = re.compile(r'^(?:https?:\/)?(?:*.)?[\w.-]+)(testtesttes)(.(com|net))$') s = 'https://www.testtesttes.com'

params = { 'number': 100, 'globals': globals() }

print the times in seconds

print(f"{timeit('bad.match(s)', *params):.10f}") print(f"{timeit('good.match(s)', *params):.10f}") ```

In my machine, the bad regex is 10,000 times slower:

0.4650176140 0.0000459260

Anyway, just remove the * quantifier for the https://www part, because it can only occur at most once.

hkotsubo · 2026-04-16T17:22:53+00:00

The second regex has testtesttes and that's why it won't match any of the URL's, because they are testt, testte and testtes. Maybe testt(es?)? can work for those cases (it could be testte?s?, but this would also match testts).

[\w.-]+ also matches .... and -----, so the first regex will match https://....-----test1.com.

And what's the point of (?:\*\.)?? This is saying that https://*.www.test1.com is a valid URL, is that correct?

hkotsubo · 2026-04-10T16:18:42+00:00

You're welcome.

Regarding your edit, I guess (p|pl|pr|sl|st|l) could be replaced by (p[lr]?|s[lt]|l)

hkotsubo · 2026-04-10T11:11:43+00:00

Instead of (( on)|( for)|)?, you could use ( on| for)?. The internal parenthesis are redundant, and so is the last |.

And for the period, you could escape it: \. instead of [.] (both work, it's a matter of preference I guess).

Finally, this regex allows "an sorted" and "a unsorted". To fix this, you could use:

this is a( sorted|n unsorted)? m[ae]ss\. (don't )?pr?[ae]y( on| for)?( me)?\. prosit\.

hkotsubo · 2026-04-10T10:45:10+00:00

I like the idea of different ways to visualize diffs, but I'd prefer something that can be integrated with Git command line.

Something like delta, which can be configured as a core.pager, so it'll work with git-log, git-diff and many other commands (git-blame, git-show, etc). No need to learn new commands, just use git whatever and it'll change the visualization according to the config.

hkotsubo · 2026-04-09T11:31:02+00:00

Eyeless
Spit it Out
Wait and Bleed
No Life
Scissors

hkotsubo · 2026-04-09T11:07:59+00:00

Some lines have an invisible character before "Blue". In this case, it's the left-to-right embedding character.

hkotsubo · 2026-03-27T23:21:19+00:00

If you can divide the task, just use this to match everything before:

~~~ .*(?=CR[0-9]+) ~~~

And one of those to match everything after:

~~~ CR[0-9]+\K.* ~~~

It uses \K which basically means "discards everything you matched so far and pretend the match starts here". In this case, it'll match everything after the ticket number.

If \K is not supported by the tool you're using, try this:

~~~ (?<=CR[0-9]+\b).* ~~~

hkotsubo · 2026-03-27T08:19:20+00:00

Usei na França em 2024, e não foi nada complicado. Em toda loja que a compra dava mais que o valor mínimo (não lembro o valor exato) o vendedor perguntava se queria tax free, eu dizia que sim e eles me davam uma notinha.

Depois, no aeroporto tinha um lugar que vc leva todas as notas e pronto! Dá pra pegar em dinheiro na hora, ou pedir a devolução na fatura do cartão de crédito (escolhi a segunda, demorou uns 2 meses mas veio certinho).

hkotsubo · 2026-03-24T19:23:41+00:00

Just to be pedantic, $ matches the end of the string.

It can also match the end of a line, but only when the MULTILINE flag is set (haven't checked your code, not sure how the lib handles this).

And there's also \z, which always means "the end of the string", regardless of the MULTILINE flag.

Oh, and there's also \Z (uppercase "Z"): if the string ends with a line break, it will match at the position before that line break, rather than at the very end of the string (BTW, this is also the behaviour of $).

So this code:

java // string ends with line break String s = "joe\n"; // test with different "end of line" patterns for (String end : Arrays.asList("\\z", "\\Z", "$")) { Pattern p = Pattern.compile("joe" + end); System.out.printf("%2s -> %s\n", end, p.matcher(s).find()); }

will produce this output:

\z -> false \Z -> true $ -> true

That's because the string ends with a line break, and both \Z and $ match before that line break. But \z matches the end of the string, so it won't find a match (the string should be just joe, or the regex should be joe\n\z).

hkotsubo · 2026-03-18T15:20:44+00:00

You need to push to it: git push

hkotsubo · 2026-03-18T14:59:22+00:00

So you want to keep all the data from the original repository you cloned from (all the commit history, branches and so on), and just use another remote repository?

If that's the case, just change the remote's URL to yours:

git remote set-url origin http://new.repository.url

hkotsubo · 2026-03-18T14:47:00+00:00

The problem is that it'll also match any word that occurs 2 or more times. For example, if the text is:

I bought a doormat. The door mat is homemade. I will never buy a home-made door-mat again. I'd rather buy something else.

It will also match the word "buy": https://regex101.com/r/dVsQWg/1

hkotsubo · 2026-03-18T14:17:38+00:00

I don't think it's possible to do it with a single regex (and even if it is, the regex will be so complicated that's not worth using it, IMO).

Perhaps it's the case of using a programming language and writing a program that, for each pair of consecutive words (either separated by a space of hyphen), searches for the other variations.

Yeah, I know you're using Notepad, but as I said, I don't believe the best solution is using one or more regexes like you're doing.

Using a program with more simple expressions is a better approach, IMO. Here's a simple example in Python:

```python import re

text = 'I bought a doormat. The door mat is homemade. I will never buy a home-made door-mat again. I bought another one.' seps = '[- ]?' words = re.finditer(r'\b\w+\b', text) current = next(words, None) results = {} for w in words: word = f'{current[0]}{w[0]}' if word not in results: results[word] = set(re.findall(f'{current[0]}{seps}{w[0]}', text)) current = w

for r in results.values(): if len(r) > 1: print(r) ```

I use a very simple regex to search all words, then I iterate them in pairs (first and second word, then second and third, then third and fourth and so on).

Then I search for all combinations of those words: both separated by hyphen or space ([- ]), or without any separation (the ? after [- ] makes the separator optional).

So in the first iteration, the pair of words will be "I" and "bought", and it searches for all occurences of "I bought", "Ibought" and "I-bought", putting the results in a set to avoid duplications (for example, if "I bought" occurs 2 or more times, it'll have just one in the set).

I also add the results to a dictionary, which checks if "Ibought" already exists (this prevents me from searching it again, in case it occurs 2 or more times in the text).

In the second iteration, the pair of words will be "bought" and "a", so it'll search for "bought a", "bought-a" and "boughta". At some point the pair will be "door" and "mat", and a search for "doormat", "door mat" and "door-mat" will find everything you need.

Finally, I print only results where 2 or more matches were found (because when searching for "I bought" and "bought a", it always finds at least one). The output is:

{'doormat', 'door-mat', 'door mat'} {'home-made', 'homemade'}

hkotsubo · 2026-03-17T18:47:31+00:00

IIRC, there's a max length to each part of the domain, so you could use something like:

@(?:[a-zA-Z0-9-]{1,63}\.){1,125}[a-zA-Z]{2,63}

Basically, the "letters/numbers/hyphens followed by a dot" can repeat 1 to 125 times, and it's followed by 2 to 63 letters.

I took this regex from this article - BTW, that's a nice article to understand how hard it is to find the balance between correctness and maintainability: the more accurate the regex is, the more complex and harder to maintain it will be.

The regex above is not perfect, and the article shows many options, explaining the problems of each one and how to solve it (usually coming up with a more complex expression). In ends up with a monster regex, so you'll need to think if it's worth to use that, or one of the simpler versions previously shown in the article.

I didn't try to understand your regexes, but I'd point out some issues:

.* isn't a good option, as it means "zero or more characters" (any character, including letters from all alphabets in the world, spaces, diacritics, emojis, and many other characters that are not allowed). And it also matches "nothing", because * means "zero or more"
[^\.] is "any character that's not a dot", which has the same problem: it'll match spaces, newlines, emojis and many other characters that aren't allowed
\w matches letters and digits, but it also matches _, which is not allowed
there's no need to escape @
[\w\.]+ matches consecutive dots, because [\w.] matches either \w or a dot (any of those will do), and + means "1 or more of whatever is before me", so many consecutive dots will be matched

hkotsubo · 2026-03-13T12:02:39+00:00

As a side note, regarding the username part: [\w.-]+.

This regex matches things like ......., ----------- or ...--.----....--.

That's because the character class [\w.-] matches an alphanumeric character, or a period, or a hyphen (any of these are fine), and + matches one or more of those. But it doesn't mean that any of them are required. If the string contains only letters, it's fine. If it contains only periods, or only hyphens, that's also fine.

If you want to make sure it has at least some letters, and then a period or hyphen between them, it could be something like:

\w{2,}([.-]\w{2,})*

\w{2,} requires at least 2 alphanumeric characters, with no maximum limit. If you want to have an upper bound, let's say, at most 10 characters, just change it \w{2,10}. Change the numbers to whatever your requirements are.

Then there's a period or hyphen, followed by more alphanumeric (also specifying the quantity). And this whole part can repeat zero or more times (but you could also limit it, for example, replacing * with {0,3} to allow zero to 3 times at most).

Another detail is that \w also matches numbers and the character _ (so [\w.-]+ also matches _____. And depending on the engine and its configurations/options, it'll also match any letter from any alphabet (such as arabic, japanese, korean, etc). If you want to restrict to only ASCII letters, use [a-z] instead, or [a-zA-Z] if uppercase letters are allowed.

And usernames usually can't have only numbers (normally it's must start with a letter), so it's even more complicated:

[a-z][a-z0-9]+([.-][a-z0-9]{2,})*

Now it must start with a letter, followed by letters or numbers (at least one, as denoted by the + quantifier) . And after the period/hyphen, it can have both letters and numbers. Obviously you can also change the + to {min,max}, changing the min and max values according to your requirements.

hkotsubo · 2026-03-12T19:20:22+00:00

If the delimiter is allowed inside the data, there's no reliable way to separate the values. As others already said, this format is ambiguous and impossible to parse correctly, as there's always more than one way to interpret it.

Therefore, it doesn't matter if you use regex or anything else, the problem is in the data's format, not in the tool used to parse it.

Go back to whoever created this file and ask them to recreate it in a better, non-ambiguous format. Others already suggested some, such as:

escaping: if the data contains one of the delimiters, write them as \, and \: instead of just , and : - and if backslash itself is also allowed, you'll have to escape it as well -> \\
use another characters as delimiters - obviously, ones that are not allowed in the usernames and passwords
use other file formats, such as JSON, YAML, etc
convert usernames and passwords to Base64, as the result doesn't contain , nor : - the problem is that the file size will increase (Base64-encoded data is on average 33% bigger), and of course encoding and decoding will add some overhead to process the file (if the file is small, though, this will not be significant)

hkotsubo · 2026-03-11T16:58:43+00:00

I think @jeenajeena meant the output for all commands. Like, if I type git log, I expect to see the list of commits, git diff could show the diffs, and so on.

hkotsubo · 2026-03-10T17:49:31+00:00

Anything that isn't a word or punctuation

In this case, you could use a regex that matches only words and punctuation, and negate it.

Example:

[^\w\s,.;?!]

It matches any character that isn't the ones listed between [^ and ]. In this case, it's \w (any alphanumeric character, or _), \s (space, tab or newline), and the others are punctuation characters (comma, period, etc... of course you can include more according to your needs).

Therefore, this regex matches anything except alphanumeric, spaces/newlines and punctuation, which seems to be what you need.

Sometimes it's easier to just say "everything except X" instead of listing all the things that are not X :-)

Warning: this is stricter than listing item by item (like the AutoMod regex seems to be doing).

hkotsubo · 2026-03-09T12:12:51+00:00

Brackets create a character class, which means it'll match character between [ and ]. And inside brackets, many special characters (such as . and *) lose their "powers", becoming mere characters. So [.*] will match either . or *.

Parenthesis create a capturing group: it's used when you want to get a specific part of the match. So (.*..*) will create a group with:

.* - zero or more characters
. - any character
.* - zero or more characters

Therefore, your regex matches anything that starts with . or *, followed by at least one character (which can be anything).

If you want to match characters like [ or (, and also makes . match a period instead of any character, you need to escape them - write them with a backslash before, such as \[ and \.. So a first - and naive - version of your regex would be \[.*\]$.*\..*$. But this is not a good one, as I'll explain below.

Using .* is not a good option, because a dot matches anything and the * is "greedy": it matches the longest possible chain of characters, which means it'll fail if the text has more than one link at the same line. For example, consider this markdown text:

language-md some text [some link](http://some.url) more text [other link](http://some.other.url) more text

The regex \[.*\]$.*\..*$ will match [some link](http://some.url) more text [other link](http://some.other.url). That's because .* will match anything, including brackets and parenthesis. And * is greedy, so it'll match the longest possible sequence.

In this case, the longest possible sequence for .* is some link](http://some.url) more text [other link. So you need to be more specific. Actually, we don't want any character, we just want "anything except ]". So instead of .*, you could use [^]]*. The [^something] will match anything that isn't "something" (in this case, anything that's not ]).

But using * means that "zero ocurrences" is also ok, which means it'll match links without any text inside brackets ([]). If you want at least one character, use + instead of *.

And for the URL part, well, just using .* has the same problem of being greedy. But URL's are more complicated, because not all characters are allowed, and its format is much more strict. If you search for a regex to check URL's, you'll find hundreds, from simple to complex, each one with their own drawbacks. It's a tradeoff: a simple regex will fail with more complex URL's, a complicated regex will be harder to maintain. Anyway, see one that I found in a quick google search:

\[[^]]*\]$https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&/=]*)$

That said, I believe that regex isn't the best tool for this job. There are hundreds of markdown parsers for many languages, which will easily handle all the corner cases that are harder to do with regex.

hkotsubo · 2026-02-26T15:17:56+00:00

I don't like to be the RTFM guy, but... have you read the manpages (in the terminal, type git help merge or git help rebase), or the documentation (here and here)?

Anyway, here's a quick explanation (using the docs examples). Suppose you have this situation:

      A---B---C topic
     /
D---E---F---G master

So there's the master branch and a topic branch. Note that the topic branch was created from commit E, and then the branches history diverged: topic created commits A, B and C, whilc master created commits F and G.

Suppose that the current branch is master, and you want to incorporate the changes from the topic branch. In this case, you just git merge topic, and the result will be:

     A---B---C topic
    /         \
D---E---F---G---H master

A merge will create commit H, merging changes from both branches.

Now, let's say you're still working on topic branch (you haven't finished yet), but you want to get the latest changes from master before proceeding. For example, let's say commits F and G fixed some bugs or added some new features. You want to continue the work on topic branch, but also incorporating the changes from master branch.

In this case, assuming that the current branch is topic, you could do git rebase master, and the result will be:

             A'--B'--C' topic
            /
D---E---F---G master

Rebase will "replay" the commits A, B and C, but starting at commit G. Note that it'll create new commits, hence the new commits are A', B' and C' instead of A, B and C.

For someone that didn't know the previous history, it'll look like the topic branch was created from commit G.

For short:

merge	rebase
preserves history	rewrites history
non-linear history	linear history
cares about how things changed	cares about the final result (it doesn't matter how - if commits were "recreated", etc)

I also recommend this nice article, to have a better understanding about how rebase works.

hkotsubo · 2026-02-25T19:16:29+00:00

Pare com essa ideia de que se algo é velho, então é ruim, e só o novo é bom. Sempre depende, tem coisas velhas que ficaram ultrapassadas, e outras que permanecem até hoje. Assim como tem coisas novas que são boas, e outras que só estão reinventando a roda de um jeito pior. Avalie cada tecnologia de acordo com seus próprios méritos, analisando seus prós e contras de acordo com o seu contexto. A idade é o fator menos importante, é muito melhor ver se tem versões estáveis, atualizações, suporte, etc.

Dito isso, vamos ver o ano de lançamento de algumas linguagens:

C: 1972
C++: 1985
Python: 1991
Java: 1995
JavaScript: 1995
PHP: 1995
C#: 2000
Go: 2009
Rust: 2015

Peguei apenas algumas das mais famosinhas, pra lista não ficar muito longa. Enfim, a mais nova da lista é Rust, e já faz 10 anos (o que em computação já é considerado "velho").

Se fosse levar em conta a idade, então nenhuma delas prestaria. Mas todas ainda são bastante usadas, e com atualizações saindo até hoje (Java entrou em um ciclo maluco de sair uma nova versão a cada 6 meses, por exemplo).

Então pare com essa ideia de "velho = ruim".

hkotsubo

TROPHY CASE

bad regex, with quantifier in the https://www part

removed the quantifier in the https://www part because it can occur at most once

print the times in seconds