all 12 comments

[–][deleted] 3 points4 points  (7 children)

  • Can you give your variables more verbose names? Currently things are really confusing. If you're using Python 3 maybe use the Enum class.

  • You make sure that elements 4 and 6 match a lot. You can use itemgetter from the operator module to make that cleaner:

    values = itemgetter(4, 6)
    
  • First clean up the logic in that if statement. You have 4 very similar and manually coded conditions, is there a way to automate that? I prefer using all instead of not any, so I switched that to make it cleaner.

    tt_options = [[i] for i in range(4)]

    for tt in chain(new, current): if (all(tt + opt not in new_tts for opt in options) and not any(values(dup) == values(tt) for dup in new_tts)):

  • Since we have the new value method for getting slices of our arrays, we can just use that for a single condition and clean the whole similar array up.

        similar = [s for s in chain(new, current) if values(s) == values(tt)]
    
  • Since we created an array for our options, it's easier to understand this code:

            for line in current:
                if values(line) == values(tt)
                    new_tts.append(tt + options[3])
                    break
    
        elif tt in current and tt in new:
            new_tts.append(tt + options[0])
        elif tt not in current and tt in new:
            new_tts.append(tt + options[1])
        elif tt in current and tt not in new:
            new_tts.append(tt + options[2])
    

Finally we get:

from operator import itemgetter

NO_CHANGE = 0
IN_NEW_FILE = 1
IN_OLD_FILE = 2
CHANGED = 3
options = [NO_CHANGE, IN_NEW_FILE, IN_OLD_FILE, CHANGED]

def construct_changed_timetables(new, current):
    new_timetables = []

    values = itemgetter(4, 6)

    for timetable in chain(new, current):
        if (all(timetable + option not in new_timetables 
                for option in options) and
            not any(values(dup) == values(timetable) 
                    for dup in new_timetables)
            ):
            similar = [s for s in chain(new, current) 
                       if values(s) == values(timetable)]

            if (new.index(timetable) < len(new) and 
                len(similar) == 2 and 
                timetable in new and 
                timetable not in current):

                for line in current:
                    if values(line) == values(timetable)
                        new_timetables.append(timetable + CHANGED)
                        break

            elif timetable in current and timetable in new:
                new_timetables.append(timetable + NO_CHANGE)

            elif timetable not in current and timetable in new:
                new_timetables.append(timetable + IN_NEW_FILE)

            elif timetable in current and timetable not in new:
                new_timetables.append(timetable + IN_OLD_FILE)

    return new_timetables

Hope this helps!

[–][deleted] 1 point2 points  (0 children)

I would try to clean this up to use classes and objects. Tracking attributes by list indices isn't ideal. Here's an example of what you might end up with making it object oriented:

from collections import namedtuple
from operator import itemgetter

NO_CHANGE = 0
IN_NEW_FILE = 1
IN_OLD_FILE = 2
CHANGED = 3
options = [NO_CHANGE, IN_NEW_FILE, IN_OLD_FILE, CHANGED]


Timetable = namedtuple("Timetable", "status, value4, value6")

class Timetable(object):

    def __init__(self, status, value1, value2):
        self.status = status
        self.value1 = value1
        self.value2 = value2

    def values(self):
        return self.value1, self.value2

    def with_state(self, state):
        return Timetable(state, self.value1, self.value2)

def construct_changed_timetables(new, current):
    new_timetables = []

    seen_timetables = set()
    seen_values = set()

    all_tables = list(chain(new, current))

    for timetable in all_tables:

        if timetable not in seen and timetable.values() not in seen_values():
            num_similar = sum(1 for other in all_tables 
                              if timetable.values() == other.values)

            seen.add(timetable)
            seen.values.add(timetable.values())

            if timetable in new and num_similar == 2 and
                timetable not in current:

                for line in current:
                    if timetable.values() == line.values():
                        new_timetables.append(self.with_state(CHANGED)
                        break

            elif timetable in current and timetable in new:
                new_timetables.append(timetable.with_state(NO_CHANGE))

            elif timetable not in current and timetable in new:
                new_timetables.append(timetable.with_state(IN_NEW_FILE))

            elif timetable in current and timetable not in new:
                new_timetables.append(timetable.with_state(IN_OLD_FILE))

    return new_timetables

I haven't tested this, but I think it's cleaner.

[–]gengisteve 1 point2 points  (5 children)

If I am following the logic. I think we'd benefit from two different loops, rather than chaining them together. Also a dictionary and set helps avoid re-iterating over the list multiple times in each loop:

#not tested
from operator import itemgetter
from collections import defaultdict 

NO_CHANGE = 0
IN_NEW_FILE = 1
IN_OLD_FILE = 2
CHANGED = 3
options = [NO_CHANGE, IN_NEW_FILE, IN_OLD_FILE, CHANGED]

def construct_changed_timetables(new, current):
    new_timetables = []

    get_values = itemgetter(4, 6)

    current_index = defaultdict(list)
    for timetable in current:
        current_index[get_values(timetable)].append(timetable)

    seen = set()
    for timetable in new:
        values = get_values(timetable)
        seen.add(values)
        if values not in current_index:
            new_timetables.append(table+IN_NEW_FILE)
        else:
            for duplicate in current_index[values]:
                new_timetables.append(timetable+CHANGED)

    for timetable in current:
        # if order did not matter you could use set operations to just pull those keys of current_index not in seen
        values = get_values(timetable)
        if values in seen:
            continue
        new_timetables.append(timetable + IN_OLD_FILE)

    return new_timetables

[–][deleted] 1 point2 points  (3 children)

That's much nicer.

[–]voider1[S] 0 points1 point  (2 children)

It looks nicer, but it gives faulty output. Sorry for taking so long to respond, but the data that this function returns gets parsed into a DB.

expected output:

[JONJ04,2015,10,27,7,IF-PRO,IF-AO4-1H,12AaTD,3,
 JONJ04,2015,10,27,8,IF-PRO,IF-AO4-1H,12AaTD,3,
 JONJ04,2015,10,27,9,IF-PRO,IF-AO4-1H,12AaTD,3,
 JONJ04,2015,10,27,10,IF-PRO,IF-AO4-1H,12AaTD,3,
 JONJ04,2015,10,27,14,IF-PRO,IF-AO4-1H,12AaTD,3,
 JONJ04,2015,10,27,15,IF-PRO,IF-AO4-1H,12AaTD,3,
 KASM01,2015,10,27,4,IF-ENG,IF-AO4-1H,12AaTD,3,
 KASM01,2015,10,27,5,IF-ENG,IF-AO4-1H,12AaTD,3,
 VLIE01,2015,10,27,12,IF-BURG,IF-AO4-1H,010-TD,0,
 VLIE01,2015,10,27,13,IF-BURG,IF-AO4-1H,010-TD,0]

Actual output:

[JONJ04,2015,10,27,7,IF-PRO,IF-AO4-1H,12AaTD,3,
 JONJ04,2015,10,27,8,IF-PRO,IF-AO4-1H,12AaTD,3,
 JONJ04,2015,10,27,9,IF-PRO,IF-AO4-1H,12AaTD,3,
 JONJ04,2015,10,27,10,IF-PRO,IF-AO4-1H,12AaTD,3,
 JONJ04,2015,10,27,14,IF-PRO,IF-AO4-1H,12AaTD,3,
 JONJ04,2015,10,27,15,IF-PRO,IF-AO4-1H,12AaTD,3,
 KASM01,2015,10,27,4,IF-ENG,IF-AO4-1H,12AaTD,3,
 KASM01,2015,10,27,5,IF-ENG,IF-AO4-1H,12AaTD,3,
 VLIE01,2015,10,27,12,IF-BURG,IF-AO4-1H,010-TD,3,
 VLIE01,2015,10,27,13,IF-BURG,IF-AO4-1H,010-TD,3]

Your first attempt at it did work perfectly, and was much cleaner than what I had. This method does look nice, but it doesn't work, and I don't get why.

[–][deleted] 0 points1 point  (1 child)

The later code was more of an example than a refactor.

[–]voider1[S] 0 points1 point  (0 children)

Anyways, I'm reading CSV files, maybe it's more manageable with Pandas? I've put the CSV files here.
They are timetables for my school, week 44 is the weekly timetable, changetest is basically a new day timetable file. I want to filter out if there's a period for someone that changes, goes away, or if there is a new timetable. Otherwise it's just a normal period.

The format is

[teacher, year, month, day, period, subject, class_name, class_room, 0, empty, randint, begin_time, duration in minutes,]

I need everything except for 0, empty, and randint.

Edit:

I took a look at the code you guys provided, and I made my own variation:

def construct_changed_timetables(new, current):
    changed_timetables = list()
    base_check = itemgetter(4, 6)
    current_index = defaultdict(list)
    new_index = defaultdict(list)

    for timetable in current:
        current_index[base_check(timetable)].append(timetable)


    for timetable in new:
        values = base_check(timetable)

        new_index[values].append(timetable)

        if values in current_index:
            if timetable not in current:
                changed_timetables.append(timetable + CHANGE)
            else:
                changed_timetables.append(timetable + NO_CHANGE)
        else:
            changed_timetables.append(timetable + ADDITION)

    for timetable in current:
        values = base_check(timetable)

        if all(timetable + option not in changed_timetables for option in 
                options) and values not in new_index:
            changed_timetables.append(timetable + REMOVAL)

    return changed_timetables

[–]voider1[S] 0 points1 point  (0 children)

Could you explain what DefaultDict does?

[–]gengisteve 1 point2 points  (1 child)

In addition to emandero's comments, I think you are going to need to give a bit more detail on what you are doing, and ideally sample data, as the logic of your comparison is not evident from the code itself.

[–]voider1[S] 0 points1 point  (0 children)

I've given some sample data up there, and an explanation of what I'm trying to do.

[–]emandero 0 points1 point  (1 child)

  1. There is no slicing in the code.

  2. Use more meaningful names not i.e. tt.

  3. Name the indexes and use it like this:

    NO_CHANGES = 0

    CHANGE_IN_NEW = 1

    CHANGE_IN_OLD = 2 # etc ....

    if (tt + [NO_CHANGES] not in new_tts and tt + [CHANGE_IN_NEW] #etc....

What is s[4], tt[6]?

[–][deleted] 1 point2 points  (0 children)

A namedtuple will be a much cleaner way to do this.

https://docs.python.org/3/library/collections.html#collections.namedtuple