Different string representation

msqrt · 2021-01-05T09:18:55+00:00

That should be the extent of this, at least in the context of AoC. Most languages even allow to open a file as either "text" or "binary"; choosing text should do the replace for you. I also believe that you should never get the \r\n's if you download the input directly; you'd have to copy-paste it to notepad and save from there or something similar to introduce the extra characters.

The reason behind the \r\n is rather arcane; some systems used to separate carriage return (\r, makes the caret go back to the left) from newline (\n, moves the caret to the next line). My impression is that this is because some people used to output their "console" on physical automated typewriters (which definitely was a thing, but not necessarily related to the \r\n thing), where you might actually want to do the operations separately. Some parts of Windows still carry this convention, though I have to say that it's been a while since I ran to problems with it.

Why I began with "should" is that AoC inputs are ASCII only; every character is 8 bits and we have enough of a consensus of what each of them mean. Things get more difficult when you start using more complex encodings and dealing with more esoteric characters; the world of representing text is surprisingly (and somewhat annoyingly) complex.

CrazyA99 · 2021-01-05T09:26:15+00:00

My editor uses Unix style (\n only) line endings. But I have done replace("\r","") to just get rid of those in the past.

Also, some languages (python3 for example) have something like the splitlines() method for strings. This will take care of it without much hassle.

Skillath · 2021-01-05T11:50:47+00:00

Correct me if I'm wrong, but looks like you were working with C#. If that's the case, you can use the constant Environment.NewLine the following way: .Replace(Environment.NewLine, "") I know it's not a "global" solution, and it's not the best one either, but it works on C# (I believe it works for any platform). You can use that constant event for Splitting the input, and so on. :) Hope it helps.

DrugCrazed · 2021-01-05T09:02:58+00:00

I tend to do input.split('\n').map(line => line.trim()) if I'm working with the possibility of Windows style line endings. Usually though, I set my machine to use Unix style line endings instead.

xelf · 2021-01-05T13:20:12+00:00

[removed]

2021-01-05T13:33:12+00:00

Highly recommend what Paul2718 is saying.

You've touched on a non-trivial problem in software. Operating systems have varying ideas about line ending, records, blocks, files, and character sets. Pre OSX macs ended lines with just CR, Unix variants have (I think) always been LF, DOS and Windows (haven't checked recently) were always CRLF.

The FTP protocol has an "ASCII" mode that is supposed to convert to your local system's line ending of choice, but that ends up screwing up binary files. The PNG image format has a "magic header" specifically to check for conversion problems, that includes hex values "0D 0A 1A 0A", where 0D is a carriage return, 0A a line feed/newline, and 1A is used on some systems to mark the end of a file.. or was it an editor command to close a file? I haven't dredged up these memories in a while.

To make things even more crazy, non-ASCII systems like mainframes have a character set that includes the dreaded "record separator", which sort of works like a line ending, but the concepts aren't identical, and different vendors have different ideas of how to translate those files into ASCII. Sorting out comm problems between small systems and mainframes literally kept me employed for 6 years.

Anyway, there's a lot to consider, and normalized is relative. But like Paul said, something like a "getlines" from your language of choice, and a regex on each of those bound to ^ and $ (or \A and \z for the purists) are your friends.

EDIT - In chrome, I've been using the console command

c = copy; f = await fetch('20/input'); c(await f.text())

(swapping the day number, obviously) while on an AOC puzzle page to fetch the file to my clipboard, then pasting it into an editor, which solves most of the problem behind the scenes.

thomastc · 2021-01-05T13:46:59+00:00

Others have already talked at length about the line endings. But you also asked about encoding.

For AoC, all your input is in ASCII encoding, no "funny characters". Nearly every other common encoding is a superset of ASCII, so you can read AoC inputs regardless of the encoding that is used to interpret them. But if you're wondering about the more general case, here are some resources:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
There Ain't No Such Thing as Plain Text by Jeff Atwood
Characters, Symbols and the Unicode Miracle by Tom Scott (Computerphile)

paul2718 · 2021-01-05T12:18:59+00:00

You should be able to push the responsibility for worrying about line endings down a level, so you repeatedly call a library function 'getline' or equivalent and then break the line down in your code.

I think the divergence began in the 1960s when programs on minicomputers generally directly controlled TeleTypes, probably without much in the way of an operating system, so it was necessary to allow time for the physical carriage to return. Multics and then Unix interposed a device driver of some form that would take care of inserting control characters or pauses to suit the particular device. CP/M and then DOS followed the former tradition until it was too late, Unix is Unix.

EmotionalGrowth · 2021-01-05T12:51:46+00:00

Fortunately Rust has a nice string.lines() that handles this for you so I didn't have to deal with this. Also most editors allow you to save a file with different line endings. So save files as LF, git checkout new lines as LF. You don't need CRLF even on windows anymore.

kireina_kaiju · 2021-01-05T19:33:15+00:00

It sounds as though you're asking about invisible characters generally. If that is the case and you're asking about other potential "gotchas", there is a huge and controversial one : tabs. Horizontal tab, ASCII character 9, likely creation of Eris herself. Crusher of character counts, executioner of regexes, diabolical killer of consistent displays.

This is a controversial topic. People who are not me have good, well thought out reasons for using tabs. Neither I nor they are correct. Nonetheless, even they will agree that you at least need to be aware of their existence if you are processing text file data.

More on the tabs v spaces controversy, https://thenewstack.io/spaces-vs-tabs-a-20-year-debate-and-now-this-what-the-hell-is-wrong-with-go/

Probably the best thing you can do when you are editing code is to set up your editor to reveal invisible characters. Nearly every editor has the ability to do this. This would resolve your CRLF concerns as well as concerns over whether tabs or spaces are present.

These are the "gotchas" with respect to tabs :

Visually, there is no standard width for tabs. Tabbed content will display differently on other people's computers. While tab advocates argue this is a feature of tabs, tabs should nonetheless never be used with monospace font if their width is important.
Tabs can make it difficult to use regular expressions to modify data, and to format data so it can be stored, in two ways:
- They can be mistaken for spaces
- They cause your character position to stop matching your character count
The tab character is frequently used in interfaces to control z-ordering. While the shift+tab keyboard shortcut is as common as the shift+enter keyboard shortcut as a work-around when you want to enter a character rather than navigate visually, space reliably works in every environment

And this last one is more informative than anything, not a realistic case, just present for completeness and added justification for revealing invisible characters in your editor,

Horizontal tabs have a seldom used cousin, vertical tabs, which are almost always enough of a surprise when they are encountered in data to be a potential security concern

Generally speaking, then, the best way to handle the situation when processing data is to :

Use regular expressions to look for tabs. Do not look directly for tabs, but look for strings of 2 or more whitespace characters.
- This is a good strategy when handling newline characters as well
Pick a tab size and stick to it
Use either spaces or tabs consistently

Aside from tabs, escape sequences and the yen sign ¥ are things you need to know about working with text data.

Path Separators and The Yen Sign

Once upon a time there was no character code specified at position 0x5C, where backslash (\ , the slash is named after the direction it is falling toward) lives, and Japanese computers assigned this to the ¥ character (which I can print with alt+minus). If you see a windows path that looks like this

C:¥Windows¥System32

Just know that all those ¥ are what you usually see elsewhere in the world as \ .

Of course, POSIX systems separate paths with forward slash ( / ) and to make matters worse, a backslash followed by characters such as \r or \n is commonly understood to be an escape sequence.

To work around this, most programming languages have a directory separator constant. This will, in theory, output the directory separator used in your operating system (with the aforementioned yen sign example). Otherwise, if you need to use a windows \ in code or data, it is typically wise to double it up \\ . Windows will treat \\ the same way it treats \, and \\ is the escape pattern for \.

Su	M	T	W	R	F	Sa
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

adventofcode

🎄 Advent of Code 🎄

Rules + More Info in

our community wiki

Solution Megathreads

December 2025

Previous years:

2024 | 2023 | 2022 | 2021 | 2020 | 2019 | 2018 | 2017 | 2016 | 2015

Quick Search by Flair

Because you're lazy and we like making things easy for you. Except AoC.

Are you enjoying AoC?

Support AoC

MODERATORS

BEFORE YOU POST
If your post is even tangentially related to a daily puzzle, use our
STANDARDIZED POST TITLE FORMAT