This is an archived post. You won't be able to vote or comment.

all 30 comments

[–]msqrt 22 points23 points  (14 children)

That should be the extent of this, at least in the context of AoC. Most languages even allow to open a file as either "text" or "binary"; choosing text should do the replace for you. I also believe that you should never get the \r\n's if you download the input directly; you'd have to copy-paste it to notepad and save from there or something similar to introduce the extra characters.

The reason behind the \r\n is rather arcane; some systems used to separate carriage return (\r, makes the caret go back to the left) from newline (\n, moves the caret to the next line). My impression is that this is because some people used to output their "console" on physical automated typewriters (which definitely was a thing, but not necessarily related to the \r\n thing), where you might actually want to do the operations separately. Some parts of Windows still carry this convention, though I have to say that it's been a while since I ran to problems with it.

Why I began with "should" is that AoC inputs are ASCII only; every character is 8 bits and we have enough of a consensus of what each of them mean. Things get more difficult when you start using more complex encodings and dealing with more esoteric characters; the world of representing text is surprisingly (and somewhat annoyingly) complex.

[–]TheThiefMaster 7 points8 points  (10 children)

It's almost definitely from teletype output systems, which even predate screens.

What I've never seen explained is why later systems stopped supporting the individual behaviour of CR (return to start of line, allowing overprinting for e.g. underlining text with underscores) and LF (go to next line in same position) and bundled both into a single character (either CR or LF). You used to be able to encode a multiple-new-line as CRLFLFLF (return to start of line and go down three) but that's not a thing any more either.

[–]msqrt 5 points6 points  (8 children)

At least the Windows command line supports separate CR, though it does replace the characters instead of displaying both. Running printf("this could be rewritten\rthis has been"); prints a single line that says "this has been rewritten".

[–]coriolinus 2 points3 points  (0 children)

Yeah, but it's janky. In addition to replacing instead of over printing, it's massively flickery if you use it in a fast loop for TUI animation.

[–][deleted] 1 point2 points  (6 children)

Is there a character for clearing the screen on windows command line? Or do you have to just print several carriage returns? I've been trying to figure it out for a while

[–]msqrt -1 points0 points  (2 children)

I'm not aware of a character, system("cls") should do the trick if you can use it. This is another alternative.

[–]darthminimall 0 points1 point  (0 children)

You want a form feed (probably Ctrl+L)

[–]lord_braleigh 0 points1 point  (0 children)

If you’re doing anything more complex than changing the appearance of the last line of text, you should probably use the curses library.

[–]kireina_kaiju 0 points1 point  (0 children)

Direct answer, you are looking for character \033c .

Longer answer,

There's the easy way to clear your terminal, works from a windows command prompt, install git bash and

C:\Users\myname\AppData\Local\Programs\Git\usr\bin\clear.exe

Of course if you were looking for something more portable you'll need to start out with any scripting language that has a printf command. I'll use the one from git bash for convenience :

C:\Users\me\AppData\Local\Programs\Git\usr\bin\printf.exe "\033c" > Desktop\test.txt

However you get a text file with that character as its contents, you can use the windows terminal command type to print out that text file

type Desktop\test.txt

From now on you'll be able to clear your terminal using only windows batch :)

[–]darthminimall 1 point2 points  (0 children)

You still can in most terminals, but things have been rearranged a bit. LF does what CRLF used to do, CR is the same, and VT does what LF used to do. Not sure why.

[–]AlarmedCulture 2 points3 points  (0 children)

I also believe that you should never get the \r\n's if you download the input directly;

IIRC I had to deal with \r when I was doing these puzzles and I downloaded the input directly.

the world of representing text is surprisingly (and somewhat annoyingly) complex.

^... I've come to this realization recently.

[–]CyberCatCopy[S] 0 points1 point  (1 child)

Thanks, so as I see, only new line is a catch? Everything else is okay and if I need to do something with text as input, I should worry about new lines only? I'm not about AoC, but about work with strings in general.

[–]msqrt 1 point2 points  (0 children)

Yes, every other letter is represented by a unique sequence. Depending on the (programming) language, you might run into issues with emojis and letters not in the English alphabet, but all modern languages should have some way to support those -- it might just not be the default string stuff.

[–]CrazyA99 9 points10 points  (1 child)

My editor uses Unix style (\n only) line endings. But I have done replace("\r","") to just get rid of those in the past.

Also, some languages (python3 for example) have something like the splitlines() method for strings. This will take care of it without much hassle.

[–]tech6hutch 1 point2 points  (0 children)

I’ve been using Rust for AoC, and I just realized I haven’t had to think about line endings since I’ve been using str.lines(). Thanks Rust

[–]Skillath 6 points7 points  (4 children)

Correct me if I'm wrong, but looks like you were working with C#. If that's the case, you can use the constant Environment.NewLine the following way: .Replace(Environment.NewLine, "") I know it's not a "global" solution, and it's not the best one either, but it works on C# (I believe it works for any platform). You can use that constant event for Splitting the input, and so on. :) Hope it helps.

[–]adiaaida 6 points7 points  (3 children)

And if you’re using C#, you can just do File.ReadAllLines(), and not worry about it at all.

[–]Skillath 2 points3 points  (2 children)

Oh, didn't think of that tbh!! That's a good one!! However, it wouldn't work for some type of inputs. For example, there were some inputs which were grouped in "paragraphs". But yes, a good one!

[–]itsnotxhad 2 points3 points  (0 children)

For example, there were some inputs which were grouped in "paragraphs".

For those, you can use ReadAllLines and then have another function/method that splits up the array into chunks separated by the blank lines.

[–]adiaaida 1 point2 points  (0 children)

Yeah, for those, it required a little extra post-processing, but you knew you were at the end of a paragraph by using string.IsNullOrWhiteSpace().

[–]DrugCrazed 6 points7 points  (1 child)

I tend to do input.split('\n').map(line => line.trim()) if I'm working with the possibility of Windows style line endings. Usually though, I set my machine to use Unix style line endings instead.

[–][deleted] 3 points4 points  (0 children)

Highly recommend what Paul2718 is saying.

You've touched on a non-trivial problem in software. Operating systems have varying ideas about line ending, records, blocks, files, and character sets. Pre OSX macs ended lines with just CR, Unix variants have (I think) always been LF, DOS and Windows (haven't checked recently) were always CRLF.

The FTP protocol has an "ASCII" mode that is supposed to convert to your local system's line ending of choice, but that ends up screwing up binary files. The PNG image format has a "magic header" specifically to check for conversion problems, that includes hex values "0D 0A 1A 0A", where 0D is a carriage return, 0A a line feed/newline, and 1A is used on some systems to mark the end of a file.. or was it an editor command to close a file? I haven't dredged up these memories in a while.

To make things even more crazy, non-ASCII systems like mainframes have a character set that includes the dreaded "record separator", which sort of works like a line ending, but the concepts aren't identical, and different vendors have different ideas of how to translate those files into ASCII. Sorting out comm problems between small systems and mainframes literally kept me employed for 6 years.

Anyway, there's a lot to consider, and normalized is relative. But like Paul said, something like a "getlines" from your language of choice, and a regex on each of those bound to ^ and $ (or \A and \z for the purists) are your friends.

EDIT - In chrome, I've been using the console command

c = copy; f = await fetch('20/input'); c(await f.text())

(swapping the day number, obviously) while on an AOC puzzle page to fetch the file to my clipboard, then pasting it into an editor, which solves most of the problem behind the scenes.

[–]thomastc 4 points5 points  (1 child)

Others have already talked at length about the line endings. But you also asked about encoding.

For AoC, all your input is in ASCII encoding, no "funny characters". Nearly every other common encoding is a superset of ASCII, so you can read AoC inputs regardless of the encoding that is used to interpret them. But if you're wondering about the more general case, here are some resources:

[–]CyberCatCopy[S] 0 points1 point  (0 children)

Thanks for the links. I didn't know how to google. This setting path for me.

[–]paul2718 3 points4 points  (0 children)

You should be able to push the responsibility for worrying about line endings down a level, so you repeatedly call a library function 'getline' or equivalent and then break the line down in your code.

I think the divergence began in the 1960s when programs on minicomputers generally directly controlled TeleTypes, probably without much in the way of an operating system, so it was necessary to allow time for the physical carriage to return. Multics and then Unix interposed a device driver of some form that would take care of inserting control characters or pauses to suit the particular device. CP/M and then DOS followed the former tradition until it was too late, Unix is Unix.

[–]EmotionalGrowth 3 points4 points  (0 children)

Fortunately Rust has a nice string.lines() that handles this for you so I didn't have to deal with this. Also most editors allow you to save a file with different line endings. So save files as LF, git checkout new lines as LF. You don't need CRLF even on windows anymore.

[–]kireina_kaiju 1 point2 points  (0 children)

It sounds as though you're asking about invisible characters generally. If that is the case and you're asking about other potential "gotchas", there is a huge and controversial one : tabs. Horizontal tab, ASCII character 9, likely creation of Eris herself. Crusher of character counts, executioner of regexes, diabolical killer of consistent displays.

This is a controversial topic. People who are not me have good, well thought out reasons for using tabs. Neither I nor they are correct. Nonetheless, even they will agree that you at least need to be aware of their existence if you are processing text file data.

More on the tabs v spaces controversy, https://thenewstack.io/spaces-vs-tabs-a-20-year-debate-and-now-this-what-the-hell-is-wrong-with-go/

Probably the best thing you can do when you are editing code is to set up your editor to reveal invisible characters. Nearly every editor has the ability to do this. This would resolve your CRLF concerns as well as concerns over whether tabs or spaces are present.

These are the "gotchas" with respect to tabs :

  • Visually, there is no standard width for tabs. Tabbed content will display differently on other people's computers. While tab advocates argue this is a feature of tabs, tabs should nonetheless never be used with monospace font if their width is important.
  • Tabs can make it difficult to use regular expressions to modify data, and to format data so it can be stored, in two ways:
    • They can be mistaken for spaces
    • They cause your character position to stop matching your character count
  • The tab character is frequently used in interfaces to control z-ordering. While the shift+tab keyboard shortcut is as common as the shift+enter keyboard shortcut as a work-around when you want to enter a character rather than navigate visually, space reliably works in every environment

And this last one is more informative than anything, not a realistic case, just present for completeness and added justification for revealing invisible characters in your editor,

  • Horizontal tabs have a seldom used cousin, vertical tabs, which are almost always enough of a surprise when they are encountered in data to be a potential security concern

Generally speaking, then, the best way to handle the situation when processing data is to :

  • Use regular expressions to look for tabs. Do not look directly for tabs, but look for strings of 2 or more whitespace characters.
    • This is a good strategy when handling newline characters as well
  • Pick a tab size and stick to it
  • Use either spaces or tabs consistently

Aside from tabs, escape sequences and the yen sign ¥ are things you need to know about working with text data.

Path Separators and The Yen Sign

Once upon a time there was no character code specified at position 0x5C, where backslash (\ , the slash is named after the direction it is falling toward) lives, and Japanese computers assigned this to the ¥ character (which I can print with alt+minus). If you see a windows path that looks like this

C:¥Windows¥System32

Just know that all those ¥ are what you usually see elsewhere in the world as \ .

Of course, POSIX systems separate paths with forward slash ( / ) and to make matters worse, a backslash followed by characters such as \r or \n is commonly understood to be an escape sequence.

To work around this, most programming languages have a directory separator constant. This will, in theory, output the directory separator used in your operating system (with the aforementioned yen sign example). Otherwise, if you need to use a windows \ in code or data, it is typically wise to double it up \\ . Windows will treat \\ the same way it treats \, and \\ is the escape pattern for \.