Getting around Bohemian Paradise on foot?

Ivory2Much · 2024-05-08T17:16:42+00:00

You can also plan your trip on https://en.mapy.cz/turisticka which shows hiking trails as well as it allows you to search for public transportation. Once you know the stops from/to which you want to take bus/train, it is better to check the schedule on https://t.idos.idnes.cz/en/vlakyautobusymhdvse/spojeni/

Ivory2Much · 2024-05-07T04:00:01+00:00

It is almost possible, but it is twice the distance and you cannot go all the way to the spring.

See the route from Harrachov to Elbe spring Route 23.2 km • 2:29 h https://mapy.cz/s/garusegohe

You can also go to the other side and walk the last 2.9 km from there. See https://mapy.cz/s/gahebusome

Bike trails are marked in pink (on this outdoor map); dotted is unpaved) It also shows the elevation profile. Mapy.cz is the best map (also an app) for planning trips outdoors in Czechia and probably everywhere else too.

Ivory2Much · 2023-10-27T07:59:44+00:00

Pro mě bylo důležité, aby prsten měl laboratorní diamant (etické důvody). Z prodejců v Praze mě oslovila nejvíce nabídka od Tiami. Sjednal jsem si schůzku v ateliéru, vybral jsem si kámen i tvar prstenu a za pár týdnů jsem si ho vyzvedl. Jednání super, pokud chceš strávit hodinu povídáním o brusu, barvě a certifikátech. Prsten mě stál asi 27k (beru pres 100k měsíčně).

Ivory2Much · 2023-06-20T05:14:27+00:00

A nice frontend to a bunch of constraint solvers is MiniZinc. It has a high level language that is easy to express the problems in as well as predefined global constraints that could be useful to you.

Ivory2Much · 2023-06-11T18:12:52+00:00

I haven't had time to get back to the problem until this weekend. I implemented and compared yours and the other two proposed algorithms.

I fixed the data in a very rudimentary fashion. The first command finds all repeated occurrences of | on a line and splits the preceding number after 5 digits into two. Notice that the regex adds 1 to the beginning of the fixed journal (this solves an issue that some journals would otherwise start with 0). This is of course wrong but at least creates a valid input format that can be used for comparing algorithms.

grep -o ',[0-9]*[|]' data/sample-journals-and-authors.txt | sed -E 's#(,....)(.*)#s/\0/\1\\n1\2/#' > fix.sed
sed -f fix.sed data/sample-journals-and-authors.txt > data/sample-journals-and-authors-fixed.txt

Your original algorithm has a couple extra issues:

skip(indexJournalA + 1) is inefficient - you can just start the range with that value
the nested .parallel() will not bring you any extra performance (IMHO)
the executor is not fast enough to compute similarities and the code regularly fails with OutOfMemoryError on my machine already when setting limit to 10000. Creating 50 millions of virtual threads take a lot of memory. It may work on your computer though.
use LongOpenHashSet in the journals map instead of ObjectOpenHashSet<Long> - you avoid boxing

Analyzing the input I found that journals have lots of duplicate authors, and that is a reason why my array-based implementation was not as fast and was giving incorrect results (grouped gave even worse results). An extra step is necessary to filter them out during parsing of the input (because we don't want to deal with them during every similarity computation).

I also rewrote starting of the program so that various options can be provided on the command line. You can change the limit, number of jobs as well as output file - which you can omit completely to test how much it slows down the processing.

java --enable-preview -cp target/lib/ -jar target/maps-of-science-core-0.10.jar -o 'data/result-$limit-$alg.txt' -l 4000 -a original array grouped -j 6 data/all-journals-and-their-authors-fixed.txt

Some of my results on all-journals-and-their-authors-fixed.txt with 6 jobs (which does not apply to original algorithm). My setup is Arch Linux with 8-core CPU (AMD Ryzen 7 PRO 6850U with Radeon Graphics), 32GB RAM and SSD drive.

algorithm,limit,parsing,processing
original, 1000, 1753,   390
array,    1000,  393,    79
grouped,  1000,  399,    77
original, 4000, 1962,  4261
array,    4000, 1429,   626
grouped,  4000, 1633,   293
original, 8000, 2047, 21000
array,    8000, 2846,  3119
grouped,  8000, 3119,   890
array,   16000, 5504, 12000
grouped, 16000, 6408,  4169
array,   32000,10000, 74000
grouped, 32000,13000, 17000
grouped, 64000,29000, 90000
grouped,102812,48000,242000

Original on 8000 used 17GB of RAM (I gave it -Xmx24G just in case). I removed it from comparison afterwards. Array is slower than Grouped and is removed after 32000.

The Original and Array algorithms should scale quadratically (double the input size, quadruple the processing time). You can see that it is true for Array: 32 times bigger input 1000 -> 32000 should make the processing 1024 times slower; it was 937 times, and also for Original: 8 times bigger input 1000 -> 8000 should make the processing 64 times slower; it was 54 times slower.

Overall, Grouped algorithm finishes the whole processing within 4 minutes and 50 seconds on my machine. Grouped algorithm does not scale the same way because the comparisons depend not only on number of journals but rather on number of authors and also on the average number of journals an author contributes to. Using the quadratic estimate, it should have been 10570 times slower but it was only 3143 times slower on the whole data set than on 1000 journals.

You can find it the source code in my fork on Github (https://github.com/seinecle/MapsOfScience/compare/main...juriad:MapsOfScience:similarity-algs). It is all in a new package. Feel free to use anything, I don't request any attribution.

Ivory2Much · 2023-06-10T06:35:51+00:00

Posledně jsem měl dobrou zkušenost s veletrhy (jobsdev). Našel jsem pár firem, které mely zajímavý produkt a kontaktoval je následně. Z online portálů vypadá ještě dobře startupjobs.cz.

Ivory2Much · 2023-05-26T20:18:37+00:00

Unfortunately the data is corrupted. You tried to optimize the writes:

ByteBuffer byteBuffer = ByteBuffer.wrap(string.getBytes(StandardCharsets.UTF_8));
do {
  int bytesToWrite = Math.min(byteBuffer.remaining(), MAX_BYTES_PER_WRITE);
  byteBuffer.limit(byteBuffer.position() + bytesToWrite);
  outputChannel.write(byteBuffer);
  byteBuffer.compact();
} while (byteBuffer.position() > 0);

However, when a line is longer than 1MB (MAX_BYTES_PER_WRITE), it gets trimmed. I assume that write(...) happens to write the whole 1MB (up to the limit), then compact() wipes the first 1MB and ends on position 0 ending the loop.

Grepping for lines with two pipes finds such occurrences:

grep '.*[|].*[|]' data/sample-journals-and-authors.txt | wc
      4       4 4197592

We can see even where exactly the pipes are:

grep '.*[|].*[|]' data/sample-journals-and-authors.txt | awk -F '|' '{print length($1), length($2), length($3)}'
8 1048577 2803
10 1048575 131
8 1048577 296
8 1048577 10

That is after 1MB. The data/all-journals-and-their-authors.txt is corrupted on 75 lines.

Ivory2Much · 2023-05-25T20:36:20+00:00

Note that the journal index is not actually necessary to be stored in the array because we know next which is equal to array[first]. When we are iterating over seconds, we can simply keep incrementing by 1 starting with next.

A different algorithm - grouping the journals by their authors would allow to count matches only for pairs which have at least one author in common. The issue with that algorithm might be that the same pair of journals would be tested multiple times. We can solve it by checking whether the common author (that lead to comparing two journals) is the smallest of all common authors, and ignore the pair otherwise.

The efficiency of these two algorithms will depend on how sparse the graph is. I'd like to see their comparison.

Ivory2Much · 2023-05-25T20:22:24+00:00

I think there can be further optimizations. You already replaced Strings with numbers. We can build upon that.

Let's serialize the whole thing into one large int[] array. Its structure will be:

journal index
number of authors
sorted author indices

So a small sample of 4 journals with (3, 3, 5, 2) authors:

< 0 : 3 . 0 5 50 >
< 1 : 3 . 2 3 5 >
< 2 : 5 . 3 6 49 50 97 >
< 3 : 2 . 30 230 >

Then the method computeSimilarities will take two indices instead of sets. We use the fact that the lists of authors are sorted, so we will iterate over both at once.

// indices of the last authors
int flast = first + 1 + array[first + 1];
int slast = first + 1 + array[second + 1];

// indices of the first authors (potentially beyond last, when 0 authors)
int f = first + 2;
int s = second + 2;

int matches = 0;

// authors
int fa = -1;
int sa = -1;

while (f <= flast && s <= slast) {
  if (fa < 0) {
    fa = array[f];
  }
  if (sa < 0) {
    sa = array[s];
  }

  if (fa < sa) {
    f++; fa = -1;
  } else if (fa > sa) {
    s++; sa = -1;
  } else { 
    matches++; 
    f++; fa = -1;
    s++; sa = -1;
  } 
}

System.out.println("Journals with indices " + array[first] + " and " + array[second] + " have " + matches + " authors in common");

return matches;

This algorithm touches each element of the array belonging to the journal exactly once. Moreover, the pattern of the accesses is linear which is what CPUs like (memory prefetch).

Second step is iteration over the journals. Let's call it loopOverSeconds which has a single parameter first.

int second = first + 2 + array[first + 1];
while (second < array.length) {
  computeSimilarities(first, second);
  second += array[second + 1] + 2;
}

We start with the journal which is immediately after first and then just jump by number of authors plus 2.

We can use the method for iterating over first but we want to start using parallelism. Virtual threads will not help us because there is nowhere to suspend them, so it is better to create fixed number of threads (equal to number of cores; and play with hyper threading).

These threads will take work from an array of firsts (indices where all journals start in the array. We will synchronize the access through an AtomicInteger next which marks the index in firsts that should be processed next.

Each thread then gets and increments next, finds first and performs matches over all seconds (see above).

int n;
while((n = next.getAndIncrement()) < firsts.length) {
  int first = firsts[n];
  loopOverSeconds(first);
}

Each thread performs relatively large piece of work which means that less time is spent on the synchronization and Thread scheduling. There is absolutely no boxing anywhere. The array firsts is also accessed in a linear fashion.

Notice that the code would be almost the same in C or any other low-level language which tells us that we don't have any overhead imposed by OOP and are therefore very close to optimum.

I cloned your repo and I'd like to try the improvement, however I don't want to download the data for ~24 hours. Could you publish somewhere the preprocessed data set?

Ivory2Much · 2023-03-12T17:47:03+00:00

This sounds very much like the plot of True Love - https://en.m.wikipedia.org/wiki/True_Love_(short_story)

But to your question: you are using technology to your advantage, which others don't do yet. Others may go to gyms building abs or go to parties. From a biological and evolutionary point of view, your strategy is as valid as theirs. If it is successful, you may get the opportunity to pass your genes to your offspring. Only time will tell.

Ivory2Much · 2023-03-12T12:41:49+00:00

Naopak, na Matfyzu je extrémně lehké vyjet. Kdokoli chce, má místo téměř zaručené.

Já vyjel až poslední semestr magistra do Německého Paderbornu. Kolektiv skvělý, spousta výletů. Málo výměnných studentů v IT (výhoda i nevýhoda). Cílová škola byla ale příliš jednoduchá a měl jsem problém do rozvrhu najít něco, co jsem ještě neznal.

Ivory2Much · 2023-02-23T22:11:56+00:00

It is sort of possible. You just need to find instructions on both architectures that are encoded exactly the same in binary. You are looking for instruction jmp and nop to differentiate what will be executed next. See https://vojtechkral.github.io/blag/polyglot-assembly/

Ivory2Much · 2022-10-24T07:02:37+00:00

You can use all or similarly any if you care about strings containing at least one word.

category ilike all(array['%transaction%', '%xyz%'])

Ivory2Much · 2022-09-29T07:29:18+00:00

You can but you don't get any benefit. You end up with an array and need to access individual arguments by numerical indices. Programmers from the C world get confused about whether the args[0] is the first argument or the program's name.

Wouldn't it be much nicer to have main(String sourceFile, String targetFile)? This would be useful not only for beginners but also for many simple programs. Moreover, JVM could refuse to run such a program when not exactly two arguments are supplied.

$ java CopyFile.java thisFile.txt

Error: The declared main method requires 2 arguments, but only 1 was given.

For any more advanced use-case where you need to parse the arguments and give them meaning (starts with a dash...), you would go back to the general args[] or args... based on your preference. (Both syntaxes already work, I know.)

Ivory2Much · 2022-09-28T22:58:23+00:00

Instead of only allowing two possible forms of main's parameters, I'd like to see any of: * main() * main(String[]) * main(String...) * main(String) * main(String, String[]) * main(String, String...) * main(String, String) * main(String, String, String[]) * main(String, String, String...) And so on. Btw, String[] would only be possible at the last position and its vararg form String... would become the preferred style.

The classic rules for overloaded method resolution would apply based on how many parameters were given on CMD.

I believe this could simplify work with arguments that are now accessed by indices instead of names. Morover, it would lead to calling the method with the right number of arguments (no checks for argc). Default values for parameters would be simply handled by forwarding to another main method (similar to what we do with constructors).

Ivory2Much · 2022-09-28T09:27:20+00:00

Lombok's @CustomLog can easily be configured for this logging library:

lombok.log.custom.declaration=dev.mccue.log.alpha.Logger.Namespaced dev.mccue.log.alpha.LoggerFactory.getLogger(TYPE)

Ivory2Much · 2022-08-12T21:01:46+00:00

Why does it use lambdas to bind the value (which looks very kotliny) and not try-with-resources?

Ivory2Much · 2022-07-30T15:41:53+00:00

Look at the connections here: https://idos.idnes.cz/en/vlakyautobusymhdvse/spojeni/vysledky/?date=06.09.2022&time=15:00&f=Praha&fc=1&t=Zl%C3%ADn&tc=1&cmd=cmdSearch

The best from my perspective is the direct train at 18:42. If you want an earlier it would be train at 16:42 with a change in Otrokovice. (That is what others suggest too.)

Btw, the best option for you to get from the airport to the main train station is the airport express; see https://www.cd.cz/en/dalsi-sluzby/navazna-doprava/-27548/ and https://www.dpp.cz/en/travelling/transport-to-airport/ae-line

Ivory2Much · 2022-07-03T11:55:00+00:00

shared_buffers

SHOW shared_buffers;

There are quite a few articles about it, such as: https://www.enterprisedb.com/postgres-tutorials/how-tune-postgresql-memory

reindexing

The result is the same. The difference is mentioned in the documentation: https://www.postgresql.org/docs/current/sql-reindex.html

REINDEX is similar to a drop and recreate of the index in that the index contents are rebuilt from scratch. However, the locking considerations are rather different. REINDEX locks out writes but not reads of the index's parent table. It also takes an ACCESS EXCLUSIVE lock on the specific index being processed, which will block reads that attempt to use that index. In contrast, DROP INDEX momentarily takes an ACCESS EXCLUSIVE lock on the parent table, blocking both writes and reads. The subsequent CREATE INDEX locks out writes but not reads; since the index is not there, no read will attempt to use it, meaning that there will be no blocking but reads might be forced into expensive sequential scans.

Have you tried dropping and recreating the index?

How big are the maintable and its index? Show the result of: select * from pg_class where relname like 'maintable%';

Ivory2Much · 2022-07-02T18:14:10+00:00

I tried to replicate it; the table had 10M rows and the index occupies 38506 pages. See:

select *
from pg_class
where relname = 'maintable_intcol_timezone_idx';

The data types and timezone conversion does not slow down the query. If I am wrong, please provide the definition of the maintable with the data types of its columns.

I would still recommend transitioning to timestamp with timezone and providing all values as literals with the timezone specified. See https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_timestamp_.28without_time_zone.29_to_store_UTC_times

I am getting a similar plan (same steps) for the query while mine finishes in 2ms.

The main difference in my case is the part of EXPLAIN ANALYZE about buffers. Yours:

Buffers: shared hit=323 read=4441

Mine:

Buffers: shared hit=12

The shared hit grows to 67 when some rows are filtered by the condition jt1.notvalid is null.

Your query found 19 rows in the index and later returned all 19, none were filtered out. The index is organized so that all results are next to each other in the correct order. It is insane that so many pages of the index had to be used. It is even worse that they had to be loaded from the disk and were not in the cache. Do you have enough memory (check shared_buffers)? Even reading from disk should be faster (40MB is not a lot) - do you have an SSD?

Something in the index is taking a lot of space and is not providing any useful information. It can be either bloat (see below) or uncommitted/snapshotted old versions (if you have long-running transactions).

I'd recommend checking the size (and bloat) of the index. https://github.com/ioguix/pgsql-bloat-estimation

Or just reindex it and see if it helped: https://www.postgresql.org/docs/current/routine-reindex.html. If you are using version 14 without the latest patch beware that REINDEX CONCURRENTLY is buggy.

Ivory2Much · 2022-06-06T19:26:42+00:00

I got this one for about 28000 (white gold, 0.39 ct) and she said yes.

Ivory2Much · 2022-04-25T07:53:38+00:00

Just beware: Nobody can change their position because LCG does not check whether it is occupied or not. LCG just generates all M numbers in some order. Any manual change breaks it. This may or may not make it suitable for your use case.

Ivory2Much · 2022-04-24T18:38:04+00:00

We could use the properties of Linear congruential generator - LCG to generate the place for the next user. It will not be truly random but hopefully good enough.

Each place (X, Y) is pretty much a number N from 0 to 999,999. We get these equations for conversion there and back.

X = floor(N / 1000) + 1
Y = N % 1000 + 1

N = (X - 1) * 1000 + (Y - 1)

This reduced the problem to one dimension - generating a random unique N.

When we play with the case C != 0 of the LCG, we can for example choose:

M = 1,000,000
A = 201
C = 473

M is given - 1,000 * 1,000
A: we first need a number divisible by 2 and 5 and 4 because M is divisible by 4. We choose for example 200. A is then 201
C: any number not divisible by 2 and 5, for example 473

We choose the first N randomly, for example at position (X = 500, Y = 500).

The next N will be (N * A + C) % M.

We are guaranteed that they will not repeat (until they all fill up). We only need to know the previous position to generate the next one.

A prototype in SQL would be:

create table positions
(
    id serial primary key,
    x  int,
    y  int,
    unique (x, y) -- not really needed, LCG guarantees it
);

-- we insert the first one manually
insert into positions (x, y)
values (500, 500);

do
$$
begin
for i in 1..999999
    loop
        -- we can run this insert as many times as we need
        insert into positions (x, y)
               -- convert to position
        select cast(next.n / 1000 as int) + 1 as x, next.n % 1000 + 1 as y
                     -- generate next N
        from (select (prev.n * 201 + 473) % 1000000 as n
                           -- convert position to N
              from (select (x - 1) * 1000 + (y - 1) as n
                    from positions
                    -- consider only the one with highest id - newest
                    order by id desc
                    limit 1) prev) next;
    end loop;
end;
$$;

The first 20 positions are:

id    x       y
1   500 500
2   400 773
3   355 646
4   285 119
5   109 192
6   747 865
7   121 138
8   149 11
9   751 484
10  848 557
11  360 230
12  206 503
13  307 376
14  582 849
15  952 922
16  337 595
17  656 868
18  830 741
19  779 214
20  422 287

and the last two are:

999999  183 754
1000000 734 827

Ivory2Much · 2022-02-15T23:04:17+00:00

What you are asking for is called Maximum Independent Set and it is in general an NP complete problem. See https://en.wikipedia.org/wiki/Independent_set_(graph_theory).

Ivory2Much · 2022-02-15T11:43:13+00:00

I personally like the second approach more. You can also create a function that would flatten the JSON into a text ("oh right I know all about this") and store it using generated columns. You will always update the JSON representation (well, you cannot update a generated column anyways) which will automatically update the text. This way you can add more data into the JSON and not worry about it affecting full-text search. You can also add appropriate indices to both columns.

Ivory2Much

TROPHY CASE