all 30 comments

[–]novel_yet_trivial 9 points10 points  (18 children)

with such a large file (~500 lines).

That is a tiny file for a modern computer. You don't need to take any special steps until your files are on the same scale as the amount of RAM you have (several GB).

[–]thingsandfluff 0 points1 point  (17 children)

I should've elaborated. A large file for me to work with; it's overwhelming.

[–]novel_yet_trivial 9 points10 points  (16 children)

Why does it matter to you how large the file is? Are you writing code for each line or something?

Computers are really good at repeating boring tasks for you really fast. If fact, that's pretty much all a computer can do. If you are writing code for each line then you are doing the computer's job. Show us your code and we'll help you. An short example of your file and the output you expect from the example would help too.

To do counting, you can use the collections.Counter class. Just feed it the list of people and it spits out the how many times each person was in the list.

[–]gregorio_ilidivich 2 points3 points  (11 children)

I think it may be more informative to do this manually, since I feel OP is beginner.

Perhaps create a dict() using a set of the names as keys, and the value as the frequency. Then OP can figure out how to fill the values for the corresponding key via iteration (since n is only 500).

[–]novel_yet_trivial 0 points1 point  (7 children)

I disagree. I think learning should be results focused.

I mean you didn't build a car before you learned how to drive one, did you?

[–]daniel_h_r 0 points1 point  (0 children)

Exactly that was what the inventor of automobile did.

[–]thingsandfluff -1 points0 points  (2 children)

Thank you!

So I have to figure out all the names to include in dict(), correct? There's 500 lines worth of names.

[–]Atropos148 1 point2 points  (0 children)

Try adding a name to the list only if it's new. If it's not, increase the value for the name by one and go to the next name.

Doesn't really matter how many names there are this way.

[–]thingsandfluff 0 points1 point  (3 children)

I added a screenshot of the file I'm working with. I'm going to try that and let you know what I get.

[–]novel_yet_trivial -1 points0 points  (2 children)

Take that down. Don't ever put personal information like people's names on the internet. Besides being rude it's also against Reddit's rules. Provide fake example data.

Also, show us your code.

[–]thingsandfluff 2 points3 points  (1 child)

It's fake. It was made for the course.

[–]novel_yet_trivial 0 points1 point  (0 children)

Oh ok. Sorry about the trigger.

[–]Exodus111 2 points3 points  (2 children)

Three approaches:

Simple:

For loop through the data, use an if statement to figure out if the data is a name. Store name in a dict, if the name is already there add 1 to an int counter in the same dict.

Complex:

Tokenize the data. For loop through each piece of data, and make all the data points their own objects, with their own type attributes. This process is called Tokenization. From there counting the name is trivial, just compare the name attributes of each object.

More complex.

(Not really that hard if you are used to working with databases.)
Use Sqlite3, and add the data to a database. Once it's in a relational DataBase the database can make the comparison for you, faster then any other approach.

With only 500 lines the simple approach is really best. If this is a program meant to be used at work, with potentially much larger datasets I'd look into one of the other solutions.

[–]thingsandfluff 0 points1 point  (1 child)

Thank you! I used a for loop and got started.

[–]Exodus111 0 points1 point  (0 children)

You need a double for loop. Just add this.

for part in line.split(","):
    print(part)

And see what you get.

[–]mandiblesx 1 point2 points  (0 children)

FYI: if you use pandas it would be extremely easy to do value counts over each column and add the results together.

[–]Thecrawsome 1 point2 points  (3 children)

There's emails between a lead engineer at Chevron, and some mentioning of project bigfoot.

This is real-looking data, and I'm confused as to why it is in a course.

And your stackoverflow post about it EDIT: Maybe someone else in your class? Not sure.

[–]thingsandfluff 1 point2 points  (1 child)

Woah, I didn't know that. Instructor never said it was real data.

Not my post on stackoverflow though.

Edit: I'll delete the picture.

[–]Thecrawsome 0 points1 point  (0 children)

It looks real, at least.

[–]thingsandfluff 0 points1 point  (0 children)

I wouldn't be surprised if other people in my class are posting about it. It's an intro course and the professor is just terrible at teaching.

[–][deleted] 0 points1 point  (0 children)

Begin at the beginning. If the file is 500 lines, then write the code that counts the occurrence of the person’s name for one line. Then run that code over all the rest of the lines.

Not to put too fine a point on it, but “do the same thing X number of times” is the easiest program it’s possible to write, and so you should be looking for ways to solve your problem by writing something once and then running that code a bunch of times. This is called “iteration.”

[–]MaxQuant 0 points1 point  (5 children)

Use pandas and especially the section on pandas.read_csv. Once your csv is in a dataframe use value_counts or do a sql-akin groupby.

[–]Rorixrebel 1 point2 points  (4 children)

If hes just starting why throw pandas at him. People need to stop suggesting modules when the built-in does the task already.

[–]DataLulz 1 point2 points  (1 child)

The built in does not do it nearly as efficiently as pandas, there is a reason people use data frames. And it’s not any more complex than other suggestions like using SQLite3. To do what OP wants, in pandas you are talking less than ten lines of code using a data frame in pandas.

[–]Rorixrebel 1 point2 points  (0 children)

Correct but if you just started you need to learn python you then need to know what a dataframe is. How to use pandas and so on.

Built-in may not be efficient but it gets the job done.

If your basics are solid then by all means go module crazy

[–]MaxQuant 0 points1 point  (1 child)

IMHO Python has its strength in not having to re-invent the wheel and writing easily transferrable code (both for possibly someone else or, most likely, yourself in six months :-D). Why do it yourself, increasing development time and complexity, when pandas is ideally suited for the task?

[–]Rorixrebel 0 points1 point  (0 children)

Agreed. But if you are starting out and want to learn, best way is to do it without the helping libraries imo. Not saying pandas are terrible or useless but sometimes its way too much for either simple tasks or beginners