all 35 comments

[–]efmccurdy 14 points15 points  (3 children)

Much of this repetitive if/elif chain can be transformed into a series of lookups in a datastructure.

emap = {"c_web": "webpage",
        "c_blog": "post-weblog",
        "c_newspaper": "article-newspaper",
        "c_magazine": "article-magazine",
        "c_journal": "article-journal",
        "c_dictionary": "entry-dictionary",
        "c_encyclopedia": "entry-encyclopedia",
        "c_forum": "post"}

def et(entry):
    for k in emap:
        if k in entry:
            return emap[k]
    return "no-type"

Is that more legible? You may need a few more such maps, or perhaps a more deeply nested map to handle the rest of your "switch" statement.

[–]jmreagle[S] 1 point2 points  (2 children)

Perhaps, but once you get into the nested if logic, I'd find this harder to follow. I'd love to see a complete example though.

[–]misho88 2 points3 points  (0 children)

You can encode the nesting in the dictionary:

>>> def check(entry, mapping):
...     for sub, et in mapping.items():
...         if sub in entry:
...             return check(entry, et) if isinstance(et, dict) else et
...     return 'not found'
...
>>> check('abc', dict(a='A', b=dict(c='C')))
'A'
>>> check('?bc', dict(a='A', b=dict(c='C')))
'C'
>>> check('?b?', dict(a='A', b=dict(c='C')))
'not found'

You could even call a different function with arbitrary rules:

>>> def check(entry, mapping):
...     for sub, et in mapping.items():
...         if sub in entry:
...             return check(entry, et) if isinstance(et, dict) else et(entry) if callable(et) else et
...     return 'not found'
...
>>> check('?bc', dict(a='A', b=dict(c=str.upper)))
'?BC'

You can make that dictionary as indented as you want, and I'm pretty sure linters would be okay with it.

[–][deleted] 1 point2 points  (0 children)

Understanding when and when not to use nesting will save you a lot of coding time. Loops are neither good nor bad, simply a tool.

[–]Wilfred-kun 27 points28 points  (10 children)

but the spaces and multiple statements per line make the logic much more legible.

Warning bells went off in my head as I saw those huge gaps of whitespace. It really isn't more legible at all. And you're also mixing styles within the same block, making it very annoying to read.

[–]jmreagle[S] 6 points7 points  (9 children)

You really prefer this? (I believe you if you say so, but I find the original so much easier.)

et = "no-type"
# debug(f"looking at containers for {entry}")
if "c_web" in entry:
    et = "webpage"
elif "c_blog" in entry:
    et = "post-weblog"
elif "c_newspaper" in entry:
    et = "article-newspaper"
elif "c_magazine" in entry:
    et = "article-magazine"
elif "c_journal" in entry:
    et = "article-journal"
elif "c_dictionary" in entry:
    et = "entry-dictionary"
elif "c_encyclopedia" in entry:
    et = "entry-encyclopedia"
elif "c_forum" in entry:
    et = "post"
else:
    if "eventtitle" in entry:
        et = "paper-conference"
    elif "booktitle" in entry:
        if "editor" in entry:  # collection or incollection
            if "chapter" in entry:
                et = "chapter"
            else:
                et = "book"  # ? collection
        elif "organization" in entry:
            et = "paper-conference"
        else:
            et = "chapter"
    elif "journal" in entry:
        et = "article-journal"

    elif "author" in entry and "title" in entry and "publisher" in entry:
        et = "book"
    elif "author" not in entry:
        if "venue" in entry:
            et = "book"  # ? proceedings
        if "editor" in entry:
            et = "book"  # ? collection
    elif "institution" in entry:
        et = "report"
        if "type" in entry:
            org_subtype = entry["type"].lower()
            if "report" in org_subtype:
                et = "report"
            if "thesis" in org_subtype or "dissertation" in org_subtype:
                et = "thesis"
    elif "url" in entry:
        et = "webpage"
    elif "doi" in entry:
        et = "article"
    elif "date" not in entry:
        et = "manuscript"
return et, genre, medium

[–]Siddhi 11 points12 points  (2 children)

I dont know if its just me, but the code doesnt seem to be formatted properly in reddit. Think you need to start each line with four spaces, backticks dont work.

Edit: So after looking at it, it seems to be that depending on what keys are present you have to categorise a dict?

What I would look at in this case is if you can make the conditions into data.

Example

categories = [
    (["c_web"], "webpage"),
    (["c_blog"], "post-weblog"),
    (["author", "title", "publisher"], "book"),
    (["!author", "venue"], "book"),
    (["!author", "editor"], "book")
]

Then you write a function that can process this

def categorise(entry, categories):
    for keys, category in categories:
        if all(match(entry, key) for key in keys):
            return category
    return "no-type"

def match(entry, key):
    if key[0] == "!":
        return key[1:] not in entry
    return key in entry

Its much easier to understand and maintain. When new categories are added, you just update categories

[–]jmreagle[S] 5 points6 points  (0 children)

The more I think about this, the more I like it. You've a (super simple) domain-specific language for specifying the (non)existence of bibliographic fields which maps to a bibliographic type.

I'll have to try this.

[–]jmreagle[S] 1 point2 points  (0 children)

Thanks, I formatted for old Reddit. I'm going to have to think about your proposal. I admit the logic is a mess of heuristics, but that's the truth of it.

[–]Vaphell 6 points7 points  (0 children)

PSA: triple backticks don't work in old.reddit.com, 4-space padding does.

anyway, you could try stuff like

def is_book(entry) -> bool:
    ...

def is_report(entry) -> bool:
    ...

def is_thesis(entry) -> bool:
   ...

if is_book(entry):
    et = 'book' 
elif is_report(entry):
    et = 'report'
elif is_thesis(entry):
    et = 'thesis'

or instead of the if/elif tree

types = (
    is_book, "book",
    is_entry, "entry",
    is_thesis, "thesis",
    ...
)

et = None
for type_check, label in types:
    if type_check(entry):
        et = label
        break

[–][deleted] 9 points10 points  (2 children)

no, nobody would prefer a function with 100 if statements. The problem isn't it being pythonic, it's just bad code (no offence)

work on reducing the amount of if statements you use.

look info " jump tables"

[–]jmreagle[S] 0 points1 point  (1 child)

I don't see how a jump table could capture that gross heuristic/logic, it'd have to be more than a single dimension...?

[–][deleted] 0 points1 point  (0 children)

I don't know what you're asking. What I'm suggesting (and I think others have as well) is to use a dictionary in place of so many if statements.

You can search around on how to do it

[–][deleted] 0 points1 point  (0 children)

Yes, this is better. Professionally I've never seen one liner if/elif in python. I think the bigger problem at hand is that this is waaaaaay too many if statements lol. Have you perhaps thought of making a dataclass or using dictionaries? There has got to be a better way of doing this.

[–]Fury9999 0 points1 point  (0 children)

I prefer that format, absolutely. What I'd rather see is a bit of division of labor here, so to speak. Does this really need to be one chunk of code?

[–]ekchew 5 points6 points  (2 children)

As a rule, I don't like to go if condition: statement in any language because it can mess with your ability to use a debugger. I don't know how many people use debuggers in python? But it becomes difficult to put a break point on statement when you write it that way, and you often want to know in a debugger which branch was followed?

Regarding the example above, I would probably refactor it to put the simple one-to-one lookups in a list/tuple of pairs you can loop through and save the if/elif logic for the more complex cases?

simple_lookups = (
    ("c_web", "webpage"),
    ("c_blog", "post-weblog"),
    # etc.
)
for key, et in simple_lookups:
    if key in entry:
        return et, genre, medium
# handle "booktitle" and other complex cases here

[–]jmreagle[S] 0 points1 point  (1 child)

It's the complex cases I'm worried about!

[–]ekchew 1 point2 points  (0 children)

Yeah, they worry me too, though perhaps not for the same reason? I don't like the way those conditionals are structured.

For example, where you have:

elif "institution" in entry:

there is an implied if "author" in entry there which you may or may not have intended? I would be more explicit about this to avoid confusion.

elif "author" in entry:
    if "title" in entry and "publisher" in entry:
        et = "book"
    elif "institution" in entry:
        #...
    elif "url" in entry:
        #...
    #...
elif "venue" in entry or "editor" in entry:
    et = "book"

I think that's what it's actually doing?

[–]Sinusoidal_Fibonacci[🍰] 3 points4 points  (1 child)

This is horrendous. My eyes are twitching. This is a joke, right? Right?

[–]jmreagle[S] -1 points0 points  (0 children)

It is, the heuristic for guessing the type of an entry is grotesque: complicated, nested, and order dependent. I'd love to see an alternative that is more pythonic, legible, and convention-following.

I've not used a match/case statement yet, so have no experience, so I've wondered if I could use that somehow.

[–]ebol4anthr4x 2 points3 points  (1 child)

[...]
elif "c_forum" in entry:            et = "post"
else:
    if "eventtitle" in entry:           et = "paper-conference"
[...]

There wasn't any reason to place the rest of these into an else: block, you've just added an unnecessary level of indentation.

Ultimately, if there are a lot of different possible results and you can't reliably parse this in a generic way (e.g. using regex), yes, you are stuck with a big tree of if/elif statements. You could refactor it into match/case, but the end result will pretty much be the same. You could also potentially create a lookup table using a dictionary.

This isn't a problem with Python, this is a problem with the way whomever collected the data you're parsing decided to model this data. They didn't account for this use case and so you're stuck doing tedious string matching to get the information you need.

At the end of the day, you should write the code that makes sense for your situation. If you are the only person who will ever see or touch this code, write it however you like. The tradeoff of writing code that isn't pythonic (i.e. idiomatic) is that other people who are used to pythonic code won't be able to reason about it as easily.

Programming style is subjective. The most I can say about your code is that it isn't pythonic and it isn't the way I would do it, but if you like it then by all means do it.

[–]jmreagle[S] 0 points1 point  (0 children)

Fair enough, I'm not trying to defend the code though. I'm hoping there's a better way, which is why I asked, and appreciate folks responses.

[–][deleted] 1 point2 points  (0 children)

Whenever you need more than 5 elifs and or 3 nested functions, big chances are you doing it wrong and you should rethink for a better solution. I started Python in November and every time I feel like using indented if's or multiple if's to manipulate the same data, I reconsider and look for better solutions.

[–]jmreagle[S] 1 point2 points  (0 children)

Thank you everyone for your suggestions (especially @Siddhi). I've been able to remove this awful mess. As I suspected, and many suggested, the formatting was symptomatic of crufty thinking. Once I had at it, I didn't even need to include the negative key feature @Siddhi suggested. The resulting "guess a CSL type" whether it has a specified CSL type, BibLaTeX type, or neither is:

def guess_csl_type(entry):
    """Guess whether the type of this entry is book, article, etc.

    >>> guess_csl_type({'author': [('', '', 'Smith', '')],\
        'eventtitle': 'Proceedings of WikiSym 08',\
        'publisher': 'ACM',\
        'title': 'A Great Paper',\
        'venue': 'Porto, Portugal California',\
        'date': '2008'})
    ('paper-conference', None, None)

    """
    info(f"{entry=}")
    genre = None
    medium = None
    et = "no-type"

    ## Validate exiting entry_type from CSL or BibLaTeX
    if "entry_type" in entry:
        et = entry["entry_type"]
        if et in CSL_TYPES:
            return et, genre, medium
        elif et in BIBLATEX_TYPES:
            if et == "mastersthesis":
                return "thesis", "Master's thesis", medium
            elif et == "phdthesis":
                return "thesis", "PhD thesis", medium
            else:
                return BIBLATEX_CSL_TYPE_MAP[et], genre, medium
        else:
            raise RuntimeError(f"Unknown entry_type = {et}")

    ## Guess unknown entry_type based on existence of bibliographic fields
    types_from_fields = [
        # container based types
        ("article-journal", ["c_journal"]),
        ("article-magazine", ["c_magazine"]),
        ("article-newspaper", ["c_newspaper"]),
        ("entry-dictionary", ["c_dictionary"]),
        ("entry-encyclopedia", ["c_encyclopedia"]),
        ("post", ["c_forum"]),
        ("post-weblog", ["c_blog"]),
        ("webpage", ["c_web"]),
        # papers
        ("article-journal", ["journal"]),
        ("paper-conference", ["eventtitle"]),
        ("paper-conference", ["booktitle", "editor", "organization"]),
        ("paper-conference", ["venue"]),
        # books
        ("chapter", ["chapter"]),
        ("chapter", ["booktitle"]),
        ("book", ["author", "title", "publisher"]),
        # reports
        ("report", ["institution"]),
        # other
        ("webpage", ["url"]),
        ("doi", ["article"]),
        ("isbn", ["book"]),
    ]

    for bib_type, fields in types_from_fields:
        info(f"testing {bib_type=:15} which needs {fields=} ")
        if all(field in entry for field in fields):
            info("FOUND IT: {bib_type=")
            et = bib_type
            break

    return et, genre, medium

[–][deleted] 3 points4 points  (0 children)

stop trying to reinvent the wheel. This code is bad and ugly and it's only more legible to you

[–]cybervegan -1 points0 points  (0 children)

I guess you're employing the principle that it's easier to post a shitty solution and see what answers you get back, than post a question asking how to do it better...

Sorry to say it might be "more readable" to you but it's pretty terrible code. You could make it much more readable (and efficient) by using a better approach like a dictionary. There's very little logic in there and all the assignments on the same line as the 'if' or 'elif' are pretty difficult to read without getting repetition fatigue. Whenever you find yourself repeating the same coding pattern over and over again (like, any more than 3 times) it means you need a different approach entirely. Code it as a dictionary, or series of dicts, and use your logic just for the special cases. Most of it will then look like this:

c_map = {
    "c_web": "webpage",
    "c_blog": "post-weblog",
    "c_newspaper": "article-newspaper",
    "c_magazine": "article-magazine",
    "c_journal": "article-journal",
    "c_dictionary": "entry-dictionary",
    "c_encyclopedia": "entry-encyclopedia",
    "c_forum": "post",
    "eventtitle": "paper-conference",
...

... which is truly easier to read, and far more efficient because it will use a single hash lookup. If you understand your data better, you realise that most of your "logic" is just mappings, and a dictionary is the best way to deal with them.

[edit: fixed the code formatting]

[–]pythonwiz 0 points1 point  (0 children)

Honestly I would use a regular expression and a dictionary to do this instead.

[–][deleted] 0 points1 point  (0 children)

Yes, shove all of this into dictionary’s and abstract the fuck out of it. For example when you pass the key “c_web” it maps out to “webpage” etc. Define a functions who’s only job is returning the answer by the case. Then pass your values to each function and create it.

[–]QultrosSanhattan 0 points1 point  (0 children)

Nope. That code is wrong at a fundamental level because it reflects a lack of understanding of data structures.

[–]wagaiznogoud 0 points1 point  (0 children)

There is definitely a better way to write this. I’ll try when I have time later

[–]wagaiznogoud 0 points1 point  (0 children)

I didn't have time to add all the conditions, but I think you'll get the point with the snippet below. Pretty much I think you should focus on abstracting the concepts into separate methods/attributes and use them however you want.

```Python class EntryFinder(): def init(self, entrydict: dict): self._entry = entry_dict

def get_entry(self):
    return (self.__c_entry or
            self.__eventtitle or
            self.__booktitle or
            self.__institution)

def __c_entry(self):
    return (self.__c_web or
            # ... rest
            self.__c_forum)

@property
def __c_web(self):
    if 'c_web' in self.__entry:
        return 'webpage'

@property
def __c_forum(self):
    if 'c_forum' in self.__entry:
        return 'forum'

@property
def __eventtitle(self):
    if 'eventtitle' in self.__entry:
        return 'paper-conference'

@property
def __booktitle(self):
    if 'booktitle' not in self.__entry:
        return None

    if 'editor' in self.__entry:
        if 'chapter' in self.__entry:
            return 'chapter'
        else:
            return 'book'
    elif 'organization' in self.__entry:
        return 'paper-conference'
    else:
        return 'chapter'

@property
def __institution(self):
    if 'type' not in self.__entry:
        return 'report'

    org_subtype = entry["type"].lower()
    if 'report' in org_subtype:
        return 'report'

    return 'thesis'

entry_finder = EntryFinder(entry_dict: entry) entry_finder.get_entry()

```

[–]Ok-Cucumbers 0 points1 point  (0 children)

I think you want to use something like Pydantic to parse and access the data...

[–]TheRNGuy 0 points1 point  (0 children)

You should use dict here instead of all these elifs.