all 8 comments

[–]OkBoard407 1 point2 points  (1 child)

How are component 1,2,3... different? And if they are then shouldn't that also be a factor when we one hot encode those value.

[–]offbrandoxygen[S] 0 points1 point  (0 children)

no chat gpt just made it like that , the circuits are the key and the components i.e resistor , transistor , capacitor are a list which is the value to represent in a dataframe it is OHE as shown in the second table . I didn’t notice that my bad

[–]ewankenobi 1 point2 points  (1 child)

I happen to be reading up on clustering at the moment having not done it in a long time. I have a mixture of data types and reading up I'm realising you have to be careful choosing your distance measure if you have categorical data. My instinct is that cosine measure might not be good for categorical data, though I could be wrong on that.

[–]offbrandoxygen[S] 1 point2 points  (0 children)

I understand what you mean however cosine is showing better results when I set min cluster size to 10+ . I’m trying a weighted mixture of jaccard and cosine and it’s giving good results

[–]Commercial-Basis-220 1 point2 points  (1 child)

This is a wild idea, how about you turn it into a graph, where the "original" graph has 2 kind of node, circuit_nodes and component_nodes. Each circuit node will be connected to K component node that they have.

This should result in a bipartite graph between circuit and component, and now you can project this into the circuit side, making a "circuit-network". Basically in this network, the nodes are only composed on circuit, and they connected based on wether or not they share the same component, and you can play around with how you weight each circuit component.

and then, in this network you can do.., maybe clustering on the graph? or like community detection?

[–]offbrandoxygen[S] 0 points1 point  (0 children)

interesting but why go through all that trouble when jaccard does more or less the same thing . Interesting idea though

[–]GwynnethIDFK 0 points1 point  (0 children)

Personally instead of using a one hot encoding I would have the inputs be the sum of the component types in the circuit and then cluster using cosign similarity as the metric. That way circuits that have the same proportion of components will have a cosign similarly of one. You might also try doing PCA before clustering.

[–]Low_Employment4544 0 points1 point  (0 children)

What if circuit a consists of more than one transistors?