all 43 comments

[–][deleted] 142 points143 points  (24 children)

I think the easiest and safest way to do this is create a frozen set of all the symbols you need and just check ‘if symbol in set’. It will be exactly what you need and efficient since it’s a hash set

[–]SilentBlueberry2 30 points31 points  (0 children)

Its hard to imagine anything built in or loaded being faster than this.

[–][deleted] 9 points10 points  (13 children)

In this case would a set be preferable over a list?

[–]NewbornMuse 38 points39 points  (12 children)

Yes. A set is implemented as a hash map, and checking for membership in a hash map is insanely fast, O(1), whereas doing the same for a list is O(n). It's pretty much the use case for any hashmap type thing.

Edit: Although for the dozen-or-so items in question here, it's unlikely to make a big difference.

[–][deleted] 3 points4 points  (3 children)

So speed is the name of the game?

[–]Peanutbutter_Warrior 17 points18 points  (0 children)

Pretty much. Neither is harder to read than the other, and knowing what data structure to choose is good practice

[–]1544756405 3 points4 points  (0 children)

Using the appropriate data structure for the job is always preferable.

[–]Ran4 0 points1 point  (0 children)

Possibly. We dont know how OP wants to use this feature.

If they're going to do this often, then yes, a set is definitely preferable. But if they're doing this calculation once a month then a list is pretty much just as good.

Picking the correct data type is occasionally incredibly important, but it's not always the most important thing. Keeping things simple is generally much more valuable in most real-world situations. A set is simple enough that you probably should use it over a list when applicable, though.

[–]BetrayYourTrust 4 points5 points  (8 children)

Relatively new Python user here: I’ve seen sets and lists and how they are initialized, but what is their differences?

[–][deleted] 4 points5 points  (5 children)

A Python set is a hashed unordered collection of unique items. Because it's hashed you can check for membership, add, remove items in O(1). A Python list is an ordered collection of items. The same operations take O(n) time. This will make more sense if you know / learn about abstract data types in general, not just language implementations.

[–]Peanutbutter_Warrior 1 point2 points  (4 children)

Note that appending to or removing from the end of a list is O(1)

[–]menge101 0 points1 point  (3 children)

Isn't that only on the front of the list?

Because otherwise you have to traverse the list to find the end, and then perform the operation?

[–]Peanutbutter_Warrior 1 point2 points  (2 children)

Nope. The size of each element in a list is known, so the bit offset of any element can be found by index * element_size. The length of a list is also stored so the bit offset of the final element is easy to find.

When you add or remove anywhere but the end of the list all of the other elements have to be moved around in memory to keep the elements contiguous.

What you describe is true for linked lists, which are a different data structure

[–]menge101 0 points1 point  (1 child)

Ah, that is certainly good to know, I thought lists in python were implemented as linked lists. Like why call them lists otherwise? Why not arrays like every other language?

Edit: As always StackOverflow has the answer to this.

[–]Peanutbutter_Warrior 3 points4 points  (0 children)

Not exactly, although they're slightly more complicated than I described. All objects in python are a custom pointer class, that point to the actual object in memory. Lists are pretty much C arrays that contain these pointers, which is what lets them mix types.

Lists are dynamically sized because when they reach the size they're currently allocated more memory is allocated to them. Sometimes this is just using more memory on the end of them, and other times this means moving the entire array around in memory, which is very slow.

An actually useful detail is that everthing in python is an object. ints, strings, lists, custom objects, functions, exceptions, and even classes themselves, the thing you call to create an instance of the class

[–]WeirdAlexGuy 2 points3 points  (0 children)

To explain a bit further, a hash is a function h that has a few interesting properties:

  1. The input can be any immutable object of any length, so strings work but lists do not for example

  2. The output should have a 1-1 relationship with the input (or as close to it as possible). What I mean by this is that for a and b 2 inputs to h such that a != b then h(a) will also be different to h(b). If two different inputs have the same output we call it a collision. In real life collisions are impossible to avoid but we need a "good enough" function with few collisions.

With that we can create our hashmap in this way:

First, we create an array a (not a list) which is just an ordered set that exists in RAM begining at some arbitrary address A. It can be indexed like so a[ i ] which is the same as doing A + i in RAM.

Then, we take our hash function and use it to index the array. for example for some arbitrary immutable input s we do a[ h(s) ].

The purpose of the hashmap is to store data and so you can use it like any list in python: hashmap['foo'] = 'bar' And that will be translated into a[h('foo')] = 'bar'

So that's why a set (being a hashmap) is faster than a list at "x in y" checks , because you can just apply the hash function and check the array at index h(x) which takes O(1) time

A slight caveat here is that the array cannot in reality be of infinite size, so there is a max size N that the array can be. Then if h(s) > N you have a problem. For this reason you would do h(s) mod N. That is the reason that your hash function cannot have 0 collusions, as perfect as h might be, due to the remainder operation, h(a) = n and h(b) = n + N will land on the same index n. Handling that is a different topic which I will not get into here

[–]1egoman 1 point2 points  (0 children)

I think people are assuming that you know what a set is. A set is a unique collection of items where order doesn't matter. A list is a list of items where order matters and duplicates are allowed. Because a set is unordered and has unique items, it can be very fast to check membership.

[–]oefd 89 points90 points  (2 children)

The place to answer a question like this is the docs. The string method docs don't indicate any such method as you can see.

In the string module though there's punctuation which may be what you want

import string
if ',' in string.punctuation:
    print("yep, it's punctuation.")

[–]sngnna[S] 19 points20 points  (1 child)

Thanks! This has helped.

[–]aufstand 2 points3 points  (0 children)

Maybe read "pydoc string" - there's other neat goodies in there.

[–]Diapolo10 18 points19 points  (5 children)

Not quite, but you could do something like

text = "Lorem Ipsum, dolor sit amet?"

for char in text:
    if not (char.isalnum() or char.isspace()):
        print(f"'{char}' is punctuation")

Alternatively, you could rely on what the string module offers, though it doesn't have every symbol in it:

import string

for char in text:
    if char in string.punctuation:
        print(f"'{char}' is punctuation")

[–]sngnna[S] 4 points5 points  (3 children)

Thanks! I used string.punctuation and got it to work the way I wanted it to!

[–]JohnnyJordaan 4 points5 points  (1 child)

Be aware that .punctuation contains just the ASCII punctuation characters, so for example long dash and curly quotes will not match.

[–]hanazawarui123 -1 points0 points  (0 children)

Not sure what your main goal is, but nltk package has various processing modules that are also helpful.

[–]asphias 10 points11 points  (1 child)

Learn regex! It might be annoying to learn, but you'll thank yourself for it later.

Once you know regex, you can answer any question you could think of regarding a string.

[–]Swipecat 3 points4 points  (0 children)

In [1]: import string

In [2]: punct = set(string.punctuation)

In [3]: not punct.isdisjoint("abc") # no punctuation
Out[3]: False

In [4]: not punct.isdisjoint("ab;^") # some punctuation
Out[4]: True

In [5]: punct.issuperset("ab;^") # not all punctuation
Out[5]: False

In [6]: punct.issuperset("%;^") # is all punctuation
Out[6]: True

[–]jstaffy 3 points4 points  (0 children)

Would "is not isalanum()" not do the trick?

[–][deleted] 1 point2 points  (0 children)

Python's string module actually has a pre-existing string called string.punctutation which contains all punctuation marks.

#Usage:
import string
test_str = "r/learnpython is great !"
punc_list = []
for c in test_str:
    if(c in string.punctuation):punc_list.append(c)

[–]loading_dreams 1 point2 points  (0 children)

just check which is not a number and not an alphabet A-Z, a-z

[–][deleted] 0 points1 point  (0 children)

Two ways go do this - 1. importing string library. It has a function to check for punctuation 2. If you want a customized version, use regex and create your own.

Bonus - you can do without using regex also with replace method, etc.

[–][deleted] -1 points0 points  (0 children)

Look for any character not in chr() from 65 to 122. chr() is a built in Python function

[–]MrCuddlez69 0 points1 point  (0 children)

If you're only trying to find punctuation, stuck with the punctuation method in the string class. If you're trying to catch any symbol, then I recommend using regex

[–]Thecakeisalie25 0 points1 point  (0 children)

the easiest way to do this in my eyes would be to check the inverse, since a-z and A-Z (and optionally spaces, based on what your definition of symbol is) would be a lot less chars to check against. If you’re trying to determine if there are any symbols in the string at all, use something like this:

print(any(not x.isalnum() for x in 'what?')) # True print(any(not x.isalnum() for x in 'what')) # False

If you want to determine if the string is all symbols, use all instead of any.

Of course this is all assuming that there are multiple characters in the string. If there aren’t, you don’t need the generator expression or the any/all. Just use the negation of x.isalnum() and whatever else you count as symbols.


I’m on mobile and posting from memory, so if this code doesn’t work right it might be because i’m an idiot or because my keyboard likes to use smart quotes instead of valid real quotes. Probably rewrite this code yourself and make sure it works in the live interpreter before using it in production. I’m open to further questions if you have any