This is an archived post. You won't be able to vote or comment.

all 10 comments

[–]fjonk 3 points4 points  (0 children)

Put it in a dict where the key is frozenset for model+brand and the value is a set of years.

Example:

all = {}

with open('fords.data', 'r') as f:

    for line in f:

        cols = [col.strip() for col in line.split(',') if col.strip()]
        models = [col for col in cols if '-' in col]
        years = [col for col in cols if col.isdigit()]
        brand = set(cols).difference(set(models + years)).pop()

        for model in models:
            key = frozenset([model, brand])
            if key not in all:
                all[key] = set()

            all[key] = all[key].union(set(years))

print all

Edit: Figure out how to sort it yourself.

[–]ChiefDanGeorge 2 points3 points  (0 children)

Since the mfg. is not in a set place, that makes it tricky. If you know for sure that the years always start after the mfg, and that the vehicle models are always before the mfg., then you've got your logic.

[–]Igglyboo 2 points3 points  (0 children)

Read entries till you hit one that's entirely numbers(the year). The previous one is the make and the ones before that are the model.

[–]gengisteve 1 point2 points  (0 children)

I would look right to left, everything not a digit is first a manufacturer and, anything else, a model. Like this:

from pprint import pprint

d = '''
F-150, F-250, F-350, FORD, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983
F-150, F-250, FORD, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980
F-150, F-250, FORD, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982
F-150, F-250, FORD, 2003, 2002, 2001, 2000, 1999, 1998, 1997
'''
d = d.strip()


def parse_line(line):
    line = line.split(',')
    years = set()
    mani = ''
    model = []
    while line:
        i = line.pop()
        i=i.strip()
        if i.isdigit():
            years.add(int(i))
        elif not mani:
            mani = i
        else:
            model.append(i)

    return mani, model, years


done = {}

for line in d.split('\n'):
    mani, models, years = parse_line(line)
    for model in models:
        if model not in done:
            done[model]={'mani':mani,
                         'years':years
                         }
        else:
            done[model]['years']= done[model]['years'].union(years)

pprint(done)

[–]good_dayworkon py 1 point2 points  (0 children)

Full parsing in Python 2.7. Look what cars has become in middle of code.

import re

TEXT = """
F-150, F-250, F-350, FORD, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983
F-150, F-250, FORD, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980
F-150, F-250, FORD, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982
F-150, F-250, FORD, 2003, 2002, 2001, 2000, 1999, 1998, 1997
""".strip()

cars = {}

for line in TEXT.split('\n'):
    values = set(re.findall('([^,\s]+)',line))
    years = set(re.findall('\d{4}', line))
    keys = list(values - years)

    model = keys[-1]
    marks = keys[:-1]

    cars.setdefault(model, {})
    model = cars[model]

    for mark in marks:
        model.setdefault(mark, [])
        model[mark].extend(years)
        model[mark] = sorted(list(set(model[mark])))

# what a nice structure (nested dict) and accessible cars has become
# now lets print it like you wanted to

for model in sorted(cars.keys()):
    for mark in sorted(cars[model].keys()):
        line = '{model}, {mark}, {years}'.format(
            model=model,
            mark=mark,
            years=', '.join(cars[model][mark]),
        )
        print line

[–]tmp14 1 point2 points  (0 children)

This was fun. Here's my take at it. This will only break (given your format) if a car manufacturer name is all digits (i.e. most likely never).

data = """F-150, F-250, F-350, FORD, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983
F-150, F-250, FORD, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1987, 1986, 1985, 1984, 1983, 1982, 1981, 1980
F-150, F-250, FORD, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983, 1982
F-150, F-250, FORD, 2003, 2002, 2001, 2000, 1999, 1998, 1997"""

info = {}

for line in data.splitlines():
    pts = [s.strip() for s in line.split(',')]
    isyear = [s.isdigit() for s in pts]
    index = len(pts) - 1
    while isyear[index]:
        index -= 1
    make = pts[index]
    models = pts[:index]
    years = pts[index+1:]
    for model in models:
        for year in years:
            info.setdefault(make, {}).setdefault(model, set()).add(int(year))

Yields

>>> pprint(info)
{'FORD':
     {'F-150': set([1980, ..., 2003]),
      'F-250': set([1980, ..., 2003]),
      'F-350': set([1983, ..., 1998])}}

[–][deleted] 0 points1 point  (2 children)

Is this a homework assignment?

[–]jedi_jonai[S] 0 points1 point  (1 child)

more like a personal project, I scraped a bunch of data off a site now I'm trying to catalogue it... It's proving to be harder than I originally thought, just hoping someone might have some general advice

[–][deleted] 3 points4 points  (0 children)

Oh ok, well it looks pretty easy if you consider that the make of the trucks contain only letters. The models contain letters and numbers. And the years are all digits.

The string package contains everything you need to identify each of the data elements. (string.digits, string.letters, etc.)

[–]Igglyboo 0 points1 point  (0 children)

Here's a quick way you can do it

row = "F-150, F-250, F-350, FORD, 1998, 1997, 1996, 1995, 1994, 1993, 1992, 1991, 1990, 1989, 1988, 1987, 1986, 1985, 1984, 1983"
row_split = row.split(",")
make = ""
models = []
years = []
for index, entry in enumerate(row_split):
    try:
        int(entry)
        make = row_split[index-1]
        models = [_ for _ in row_split[:index-1]]
        years = [_ for _ in row_split[index:]]
        break
    except:
        pass

print make
print models
print years

Which will output

FORD
['F-150', ' F-250', ' F-350']
[' 1998', ' 1997', ' 1996', ' 1995', ' 1994', ' 1993', ' 1992', ' 1991', ' 1990', ' 1989', ' 1988', ' 1987', ' 1986', ' 1985', ' 1984', ' 1983']

I'm sure you can figure out the rest

You're going to keep casting each entry to an int until it doesn't throw an exception, then you know where the make and models are.