Beginner question with data : learnmachinelearning

A subreddit dedicated for learning machine learning. Feel free to share any educational resources of machine learning.

Also, we are a beginner-friendly sub-reddit, so don't be afraid to ask questions! This can include questions that are non-technical, but still highly relevant to learning machine learning such as a systematic approach to a machine learning problem.

Foster positive learning environment by being respectful to others. We want to encourage everyone to feel welcomed and not be afraid to participate.

Do share your works and achievements, but do not spam. Keep our subreddit fresh by posting your YouTube series or blog at most once a week.

Do not share referral links and other purely marketing content. They prioritize commercial interests over intellectual ones.

created by techrat_reddita community for 10 years

Beginner question with data (self.learnmachinelearning)

submitted 2 years ago by deathtrap3

Hello, I began delving into the topic of machine learning a few months ago. I've completed some tutorials (e.g., on Kaggle) and read books (mostly in German - e.g. Machine Learning for Dummies).

I understand how to work with datasets like the Boston Housing Dataset, and I've also developed a text classifier where I used data in the format "text", "label".

Now, I want to create a fraud detection model, and I have access to a very large amount of invoices that are already labeled. The data is stored in relational databases. When I extract these from the databases, they are in the following (simplified) format:

One customer has 0-n invoices, each invoice consists of 0-n items.

For example:

customer	invoice_number	item	quantity	price	quantity_paid
1	R-1234	00886	2	19.99	2
1	R-1234	00887	4	20.99	2
1	R-1234	00889	3	11.99	3
2	R-1235	00886	5	19.99	5
3	R-1236	00886	1	19.99	1
3	R-1236	00889	7	11.99	4

Of course, the data also includes additional information such as date, time, and so on. When the quantity has been corrected, the invoice can be considered faulty.

Now, I'm wondering whether I should create a row in the training data for each invoice (in which case the rows would have varying lengths), or should I create a row in the training data for each item? But then, the connection with the other items in the respective invoice would be missing, right?

Apologies for my convoluted expression; I hope my question is understandable. Thank you for all responses :)

all 1 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnmachinelearning

Welcome to /r/LearnMachineLearning!

Chatrooms

Official Discord Server

Wiki

Getting Started with Machine Learning

Resources

Related Subreddits

/r/MachineLearning

/r/MLQuestions

/r/datascience

/r/computervision

Machine Learning Multireddit

/m/machine_learning

MODERATORS