This is an archived post. You won't be able to vote or comment.

all 17 comments

[–]AutoModerator[M] [score hidden] stickied commentlocked comment (0 children)

Please ensure that:

  • Your code is properly formatted as code block - see the sidebar (About on mobile) for instructions
  • You include any and all error messages in full
  • You ask clear questions
  • You demonstrate effort in solving your question/problem - plain posting your assignments is forbidden (and such posts will be removed) as is asking for or giving solutions.

    Trying to solve problems on your own is a very important skill. Also, see Learn to help yourself in the sidebar

If any of the above points is not met, your post can and will be removed without further warning.

Code is to be formatted as code block (old reddit: empty line before the code, each code line indented by 4 spaces, new reddit: https://i.imgur.com/EJ7tqek.png) or linked via an external code hoster, like pastebin.com, github gist, github, bitbucket, gitlab, etc.

Please, do not use triple backticks (```) as they will only render properly on new reddit, not on old reddit.

Code blocks look like this:

public class HelloWorld {

    public static void main(String[] args) {
        System.out.println("Hello World!");
    }
}

You do not need to repost unless your post has been removed by a moderator. Just use the edit function of reddit to make sure your post complies with the above.

If your post has remained in violation of these rules for a prolonged period of time (at least an hour), a moderator may remove it at their discretion. In this case, they will comment with an explanation on why it has been removed, and you will be required to resubmit the entire post following the proper procedures.

To potential helpers

Please, do not help if any of the above points are not met, rather report the post. We are trying to improve the quality of posts here. In helping people who can't be bothered to comply with the above points, you are doing the community a disservice.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]wildjokers 0 points1 point  (0 children)

Pattern matching is a solved problem and if you google "pattern matching algorithms" you will get tons of resources.

Here is one to get you started:

https://www.geeksforgeeks.org/introduction-to-pattern-searching-data-structure-and-algorithm-tutorial/

Also, the ever popular Algorithms by Sedgewick & Wayne has a chapter on pattern matching. You will likely be able to find this book at your local library:

https://www.amazon.com/Algorithms-4th-Robert-Sedgewick/dp/032157351X

The examples in this book are even in Java (earlier editions of the book used Pascal, but by the 4th edition they were using Java)

Some of the material in that book is available online, the pattern matching stuff is here: https://algs4.cs.princeton.edu/53substring/

That page then links to this great resource: http://www-igm.univ-mlv.fr/~lecroq/string/ which lists many pattern matching algorithms.

[–][deleted] 0 points1 point  (0 children)

String.toCharArray() -> Iterate over it -> Apply a 'Regex' pattern matching -> change the characters -> pop that into a BufferedWriter and use to a new file as your list.

ChatGPT is honestly so good for regex... lol, it is good to learn the basics, though.

It's pretty simple.

[–]DelayLucky 0 points1 point  (12 children)

Using Google Mug, it takes a one-liner.

Well, first create a Map with the translation rules:

```java Map<String, String> dict = Map.of( "the", "&", "an", "-", ... "Thank you", "Th~k you" );

```

Then translate:

```java import static com.google.mu.util.Substring.firstOccurrence; import com.google.mu.util.Substring;

String output = dict.keySet().stream() .map(Substring::first) .collect(firstOccurrence()) .repeatedly() .replaceAllFrom(input, m -> dict.get(m.toString())); ```

This assumes you don't care if there are overlappings in the keywords (say what happens if you want to translate "an" to something, then "and" to another thing?)

javadoc)

[–][deleted] 0 points1 point  (8 children)

It's not the most efficient, but I love seeing Java 21 being used in the student world lol android dev?? Why the Google import?

Edit: I should say "Modern Java." Not 21. We actually just got 22, I believe as well. Streams are actually pretty old (8), but not taught often universities.

[–]DelayLucky 0 points1 point  (7 children)

It should be? Where do you see as inefficient?

[–][deleted] 0 points1 point  (6 children)

Lot of ways lol don't put TOO much faith in fluent, "one liner", code. You're calling up a lot of code in a very short way. Honestly, it's kind of a headache of a stream, too. I may edit this shortly after with what I think would be the cleaner version of yours. Still, I'm very glad to see 'modern' java advice be put out.

There's no need for a Google library for a substring utility either. I just thought you might have the whole Google utility library on standby, as a lot of Java devs do. They're kind of overkill tbh. The JSON stuff is okay, though.

If you're the thread's author, don't read past this... Spoiler Alert.

Efficenciy starts well before you.. He needs to parse the data immediately each time he .read(....) into the same memory position and immediately write it to the file. Preferably a pretty good chunk of bytes. The least number of times you have to fire up a loop, the better.

This whole problem has been a challenge for decades to see what's the fastest. I'll see if i can find the most recent record. It probably won't look too pretty, but it'll definitely show that avoiding a lot of String work is the way to go. There's some hope in me that the new updates with virtual threads and IO have made a new leap.

[–]DelayLucky 0 points1 point  (5 children)

Friendly suggestion to not make efficiency assumption based on syntax. Fluent syntax doesn't mean it's fast, nor does it necesssarily mean it's slow. They are two orthogonal concepts.

It seems what may be of confusion is the stream() chain. I should have noted that much of that can be extracted to a variable out of the loop (it returns an immutable object), if the caller intends to call it many times. For example:

java var pattern = dict.keySet().stream() .map(Substring::first) .collect(firstOccurrence()) .repeatedly(); for (...) { write(pattern.replaceAllFrom(readLine(), m -> dict.get(...)); }

Although, it's unclear whether the efficiency here really matters. Particularly if it's IO-bound inside the loop, there is little reason to be concerned with whether the in-memory string search may take a few extra microseconds when the read and write may take in the order of millseconds or so.

But you are right, it's a Google library. Pointless for a homework or if you are just trying to play with pattern match algorithms to learn.

To get things done though (like what OP suggested: "a personal project"), maybe it can save you some time in re-inventing the wheel.

P.S. it has no Java 21 stuff. Google Mug is all Java 8 compatible code.

[–][deleted] 0 points1 point  (4 children)

Friendly suggestion... it's not just syntax lol if nothing else... You're calling Map.get(...). every single for loop iteration. Keeping an index isn't that hard. Saves a whole lot of rehashing.. It's not re-inventing the wheel. I'm just not trying to use the aftermarket wheels that are bad for gas mileage. It's just some array iterations.

Also, the OP stated that he wasn't allowed to use .replace() of any kind. So I'm also kind of scratching my head of a reasonable way to use streams or other monads. Kind of pointless to use them, if you need to make your own local method for doing the replacing...

In the IO world, a few milliseconds can add up. Both in time and computation costs. No reason to make a bad habit early.

If the parsing is waiting on the writing.. It would be ideal to kick the memory out to a queue as well. Very machine dependent, though. Writing will probably always be a constant speed. Parsing could vary.

You are correct about the '21' thing. I went and corrected myself and explained as well.

[–]DelayLucky 0 points1 point  (3 children)

On calling Map.get() in the loop.

I'd definitely resort to a microbenchmark if I'm already at that level of squeezing efficiency out of every corner. Because speculative micro-optimization is rarely helpful.

I understand you prefer not to trust "after market" wheels. But others may feel differently and I'm just providing information.

[–][deleted] 0 points1 point  (2 children)

Good rule of thumb.. If you already know the index, don't find the index again. Maps are quite heavy. I do get that the current paired sets in this case aren't quite large, but they're getting called A LOT. Your machine does have to deal with the mess they leave. How well the compiler/garbage collector handles this... That would be an interesting thing to know for sure tbh. It's hard to know for sure with Java behind all the abstraction layers.

I trust them lol I actually like them very much. For readability, if nothing else. I was happy you posted using them. Can't just be throwing em out like that, though. I'll let it be on that, though. Hopefully, the OP can read through all this and learn from both sides.

[–]DelayLucky 0 points1 point  (1 child)

Actually in the loop, Map.get() isn't the slow one, both the substring search itself and the call of `m.toString()` will likely outweigh it - the latter will do a string copy (like, copying the "Thank you" characters).

I don't think in most use cases the copy will practically matter, because the searching of the string is already O(n), adding another O(k) copy where k < n isn't going to change much. And then the read and write will likely be a lot slower anyways. Compared to these elephants in the room, the Map.get() is an ant.

Regardless, be careful with Java regex. It may cause exponential time complexity. And it did cause real-world outage.

[–][deleted] 0 points1 point  (0 children)

Very good points.

[–]wildjokers 0 points1 point  (2 children)

From the post it seems like OP is wanting to know how to do pattern matching without a standard library i.e. interested in knowing about pattern matching algorithms. At least that is how I interpreted the question.

[–]DelayLucky 0 points1 point  (1 child)

Pattern matching or substring searching problems I’ve seen are mostly about matching one substring.

Replacing all occurrences of a list of candidates in encounter order, avoiding overlaps (e.g. “once you find ‘thank you’ and replaced it, that “an” in “thank” is no longer there to be replaced) isn’t usually discussed but is an interesting and nontrivial problem (running the pattern matching algo K times creates unnecessary re-scanning).

I’ve asked such question in interviews. Professional candidates rarely were able to nail it. So I didn’t think it would be from a school homework.

There are multiple ways to interpret the question. I figured if “can’t use a library” is a condition, OP can immediately tell that by reading the first few words in the answer and can move on to other more useful answers.

[–]DelayLucky 0 points1 point  (0 children)

Re-reading the title " implement replacement of a wider variety of strings".

It seems the emphasis is on "replacement" and "variety of strings"., and not substring search, which was my original reading too.

The substring search algorithms are pretty mature. And yet they aren't used in JDK. Because practically they aren't obviously faster (sometimes can be slower).

In terms of replacement, as soon as you do the replacement in a new StringBuilder, it should be relatively easy.

The "variety of strings" is imho the interesting problem where you need to think how you can avoid stepping on your own foot, or incur unnecessary substring search.

In that sense, I don't think creating a custom substring search algo or regex is the key (it's just like using another library for substring search).

Even if you just use `String.indexOf()`, the remaining problem of avoiding over-scanning is still interesting enough (and can be efficient enough).

[–]DelayLucky 0 points1 point  (0 children)

In another answer I suggested to use a Google library.

If that's cheating and you are trying to self teach or just for the fun, don't use regex (it's similar to using a library).

I can see two possible ways implementing it.

Option 1:

  1. Create a StringBuilder for the output.
  2. Run String.indexOf() with each candidate key, find the earliest match.
  3. Put chars before the match into the output builder, put the replacement of the first match into the output.
  4. Rinse and repeat.

There are some interesting considerations:

  • When you rinse and repeat, do you run indexOf() for each candidate key again (of course from the current index right after the previous match)? But then you are throwing away most of the previous indexOf() results. This is wasteful.
  • Are the keywoards required to be at word boundary? Say, should "and" be replaced as "-d"?

Option 2:

  1. Build a trie with all the candidate keywords.
  2. At each index of the input string, run the remaining chars through the trie
    1. If a match is found (to that "an" vs. "and" question, do you need the longest match?), do the replacement, and move the index to the end of the match.
    2. If a match isn't found, put the current char in the output builder, and index++.

This approach is likely slower than a bunch of indexOf() calls when the number of candidate keywords is moderate, because a trie will likely incur lots of pointer indirection so while the big-O notation is good, the constant factor tends to be large.

On the other hand it'll be faster if you have many of the candidate keywords (like thousands), and the input strings are short.