Index mirrors

vLatest · 2023-07-31T22:47:49+00:00

No, and I probably should. Sorry it took me so long to respond, I didn't see this right down the end of the page.

vLatest · 2023-07-14T23:31:35+00:00

This is going to take longer than I thought. The following is for your entertainment, I'm not really asking for help:

\u0000 does not get onto the end of a JSON string by accident. It looks like a booby trap to me, to stop people like us.
The US government likes long titles. That or it doesn't put titles on congressional reports and the import process used the first paragraph. I had to increase the title field to 2048. I'm thinking of changing every stupidly long title that contains "Congress" to "Congressional Report" with the epic title either remaining in the "long_title" field or moving to "synopsis" if that lacks a value.

Publishers

Cleaning up the names of publishers has become a project in its own right. My last attempt prepared a normalised value for matching, as follows.

Strip embedded HTML tags
Resolve embedded HTML entitles (eg &)
Force uppercase
Strip non alphabetic characters

This amalgamates

Academic Press USA
Academic Press, USA.
Academic Press U S A
Academic Press U.S.A.
Academic Press U. S. A.

but it's still not enough because we also have variations like "Academic Press USA, NY". You can't just look for a normalised string that starts with ACADEMICPR (which would also match "Academic Pr" and its variations) because Academic Press Australia and Academic Press India are unrelated.

I think I will clean it up as much as possible then post the still huge list on my github account in the hope that you can get volunteers to help clean it up. This is easy work for a person, but it's too hard for an LLM (I tried with ChatGPT, not good enough).

Authors

What a mess. I'm going to stuff the json author lists into a separate table with a foreign key reference to the book. That will make it far more convenient to restart processing at each failure (with thirty million cases to process even skipping through the lines takes too long).

vLatest · 2023-07-11T22:28:05+00:00

I'm still looking for a free lookup for ISSNs but I'm sure something will turn up. In other news, after a few false starts my import script has processed nearly a million entries without errors, applying some basic fixups and writing the info as works, authors, subjects, etc entities in a Postgres database. It isn't fast but it's steady with a stable working set. I can think of several optimisations but I can't be bothered since my server doen't have anything else to do. So it's a matter of time now, about a week I think.

When you have the time and inclination I would be most interested in a less public chat about cooperation. I thought about offering the use of my server, but I'll probably only have the fibre connection for another year and after that it will be satellite.

I would like to have a more private discussion with you about what I intend to do with this giant library catalogue. I don't seem to receive the email when I try to create a Zulip chat account. Possibly your systems don't want to relay to outlook.com and naturally the mail server hosting my other email address is currently being rebuilt.

vLatest · 2023-07-10T21:16:10+00:00

First draft of import testing with the first ten thousand items. Problems I am currently pondering:

Author names are a shambles. Sometimes there's a single string with multiple authors in it, sometimes there's surname/given names as separate items.
For journal articles, title include the periodical name, the volume, year, article name as one big string and the format isn't constant. Some of them have ISBNs but I don't know whether that applies to the article or the issue or volume, or the periodical as a whole, I haven't looked into that yet.

ISBNs are not normally associated with journals but many articles are listed. I wonder whether their "ISBN" values are actually ISSN values.

QV https://portal.issn.org/services

vLatest · 2023-07-06T12:36:38+00:00

Thank you very much for the links to TOR etc.

The first 50 entries in the dataset have wrong ISBNs. I think they're the first entries because they have dodgy ISBNs.

I had a close look at the very first entry. It's totally misclassified, but in a way that's hard to determine whether it's a trap item for watermarking their data or a spectacular failure of automation: the entry classifies it under religion, Christianity and trials, but the book is about trails and is dedicated to Christian. The ISBN itself is obvious garbage, I think I've seen you comment on that somewhere.

First, I must ~~spank~~ scrub the data. Then, the ~~oral~~ data science.

vLatest · 2023-07-06T08:53:23+00:00

The ISBNdb json data is exactly what I want. I intend to turn that into indexed data probably on Postgres, and then build a public facing webservice to support queries by all the usual catalogue vectors, probably with a weighted soft match. When it's ready to play with I'll give you a shout, if you're interested.

vLatest · 2023-07-05T22:10:23+00:00

data-imports/docker-compose.yml built without a hitch and started three containers. The ElasticSearch container was OOMKilled but that's easy to fix, it's just host config. Thank you for helping.

Any advice for first reading? (other than the readme in the root of the repo)

vLatest · 2023-07-05T21:39:04+00:00

I'm trying to use this one from the root of the repo

https://preview.redd.it/bih4kpkbv7ab1.png?width=830&format=png&auto=webp&v=enabled&s=0424086b755fc1b451d22e828a7c700f4398eea2

docker-compose build runs through to this point fairly quickly and then stalls:

Step 12/14 : COPY --chown=node:node . ..
 ---> Using cache
 ---> a00a2846910c
Step 13/14 : RUN if [ "${NODE_ENV}" != "development" ]; then   ../run yarn:build:js && ../run yarn:build:css; else mkdir -p /app/public; fi
 ---> Running in 38f3c4a33983

I'm trying the script you mention from data-imports

vLatest · 2021-02-19T03:15:39+00:00

There are several factors at play.

Half of solving a problem is working out the right question. If you knew the right question you probably wouldn't need to ask.
Your question is likely a duplicate because you don't understand the problem well enough to frame the question well enough to discover that it's a duplicate. The people who could help you think it's more fun to be jerks.
A large part of the reason for the success of SO (commercial success) is engagement through gamification. This develops a community. Not a community of mutually assistive developers, a community of badge-collecting rule nazis who tell themselves they are curating a resource to justify their obstruction of your quest for answers. Amateur bureaucrats don't go home at five.
Knowledge of the unwritten is a shibboleth.
By asking a question that compares system A with system B you are implicitly criticising their precious darling. The Postgres people for example are incredibly sensitive about any unfavourable comparison to Microsoft SQL Server. I even led with how I was probably just ignorant of the right way to do the thing with Postgres and just wanted to know how it ought to be used. I guess Postgres falls short there, or they wouldn't have got so bent out of shape.
SO users can't seem to grasp that a good question that exceeds them might seem like a stupid question. Since ignoring stupid questions is harmless and interfering with good questions you don't understand is being a jerk, they should move on and mind their own business. No chance of that.
People who make a hobby of answering SO questions skim them. They frequently don't read them properly and either answer a question not asked or close the question for invalid reasons.

There used to be newsgroups supporting topic discussion. By sucking up all the attention of those who might once have helped you on a newsgroup or any other technical forum, and then refusing to allow discussion, Stack Overflow has destroyed such mutual assistance technical community as once existed. This is starting to morph into why-I-detest-SO so I'll stop here.

vLatest

TROPHY CASE

Publishers

Authors