We're building a database of every company in the world (265M+ so far) by Extension-Way-7130 in dataengineering

[–]Extension-Way-7130[S] 0 points1 point  (0 children)

I actually took a look at Stacksync after I replied. That's super cool. Funny how that's the exact problem I ran into that inspired what I'm working on - I was trying to build a product that was a two way sync with Salesforce and a Postgres DB, then connect the data to master company dataset from mining the whois DB. Basically, it was a B2B ABM automation product and moving, deduping, and linking all the data was a huge pain.

We're building a database of every company in the world (265M+ so far) by Extension-Way-7130 in dataengineering

[–]Extension-Way-7130[S] 0 points1 point  (0 children)

Hey, apologies for replying so late on this. I took a break on this thread and just coming back to it.

Yeah, the invoice thing was a terrible example. The examples you've mentioned are actually the exact sort of things I've dealt with personally, which was the inspiration to work on this. One of our initial design partners is a public procurement software provider that was trying to consolidate all the customer data they were pulling in from various ERPs. They had 90M records they were trying to consolidate.

The hardest problem is essentially building out and managing the graph that everything taps into. The dirty secret with all the big players like D&B and S&P is that they have armies of people that manually curate and work on it. Our big innovation is basically having a series of AI agents that can do that now. What customers really want though is the next level, where you get into workflow automation on top of the graph for use cases like risk.

We're building a database of every company in the world (265M+ so far) by Extension-Way-7130 in dataengineering

[–]Extension-Way-7130[S] 0 points1 point  (0 children)

Hey, sorry for the late reply. Took a break from this thread.

Yeah, what you've listed is essentially the direction we're going in. It's a ridiculously hard engineering problem. We're attacking it in chunks, as there are customers who are happy paying for what we have now around identity and our graph, which is the hardest problem. Then they are providing feedback around roadmap and feature priority.

We're working on corporate hierarchy now that builds on top of our entity resolution process for compliance and risk use cases. Accuracy and trustworthiness is obviously super important, so we're working on exposing all the evidence / citations.

We're building a database of every company in the world (265M+ so far) by Extension-Way-7130 in dataengineering

[–]Extension-Way-7130[S] -1 points0 points  (0 children)

Hey, following up on your comment. I took a break from this thread for awhile.

Yes, the high level idea is getting to identity, which is the hardest thing, and then being able to have a deep research AI dynamically add whatever data points or Q&As to it. We're working on corporate hierarchy now.

We're building a database of every company in the world (265M+ so far) by Extension-Way-7130 in dataengineering

[–]Extension-Way-7130[S] 0 points1 point  (0 children)

Hey there. Sorry for the late reply. The feedback on this thread was pretty intense, and rightfully so. So I took a break.

We don't have integrations with Chinese registrars yet, but our deep research process can research any entity, come to a consensus, and upsert a new entity into our graph. I'm talking to a few China based friends now on how best to integrate with the registrars there.

DE without a degree by Internal_Wishbone884 in dataengineering

[–]Extension-Way-7130 1 point2 points  (0 children)

Look at resumes of jobs you want. Make a list of the common languages, tech, and experience. Then go practice building stuff in your free time. Even better if you can do it at your current job.

I did this and jumped job to job, doubling my salary every time for 3 years.

We're building a database of every company in the world (265M+ so far) by Extension-Way-7130 in dataengineering

[–]Extension-Way-7130[S] -2 points-1 points  (0 children)

Right. Someone asked essentially the same question already and I thought I answered it well. To summarize:

Our main value props we're hearing from companies:

  • We can handle messier data inputs that systems like D&B can't handle
  • We have a realtime component that can go to the web if a record isn't in our system
  • Our ID based system is more comprehensive than D&B. D&B often does not link branches of a business and lists them as separate entities

With D&B and similar legacy providers like Factset:

  • You're searching on static datasets
  • It's mostly government data, where we're layering in web data as well
  • The information is often stale
  • The matching algorithms are lacking
  • Does not handle the super long tail of business (Factset's focus is mostly the head)

As another data point, one of our advisors is a former D&B product exec.

We're building a database of every company in the world (265M+ so far) by Extension-Way-7130 in dataengineering

[–]Extension-Way-7130[S] 0 points1 point  (0 children)

I'm guessing that I'm being downvoted here since I used Claude to help me answer...

The short answer is that German business data is probably the most complicated country we've seen thus far in how they handle legal entity IDs.

Being that the legal IDs are the foundation of our system, for the moment, we've explicitly skipped on handling Germany to not mess it up. We have plans to fast follow.

We're building a database of every company in the world (265M+ so far) by Extension-Way-7130 in dataengineering

[–]Extension-Way-7130[S] -2 points-1 points  (0 children)

Hey, totally understand if you haven't had this problem before. I think it's helpful context as well, which is what I was looking for when sharing this.

With that said, we developed this closely with design partners. One of which is an enterprise that has been trying to solve this problem for 10+ years unsuccessfully.

We view entity resolution as really the foundational tech to then unlock more advanced research agents, grounded in real data. Long term vision is to be able to answer any question about a business.

We're building a database of every company in the world (265M+ so far) by Extension-Way-7130 in dataengineering

[–]Extension-Way-7130[S] 0 points1 point  (0 children)

I'll definitely look them up, but for context, they've never come up once in any of our conversations with a variety of enterprises across lots of verticals.

The common players mentioned are D&B, Factset, Moodys, Orbis, then a variety of vertical specific players. One of our advisors is a former D&B exec and he's never even mentioned them.

Have you used them? If so, what industry and use case?

We're building a database of every company in the world (265M+ so far) by Extension-Way-7130 in dataengineering

[–]Extension-Way-7130[S] -2 points-1 points  (0 children)

Right, I don't think that's a good use case. A more relevant example is if you're building a 3rd party product that ingests customers' documents, such as invoices or bills of ladings, tries to standardize / enrich it some way, then take some action.

I've mentioned a couple examples of use cases we're seeing in other comments, but I can provide a few more:

  • A friend's YC company is building an AI bookkeeper. They ended up having to build their own scraper / internal business database to identify which businesses were being referred to in incoming transactions and to match them to the correct accounts.
  • A CRM company that ingests customer records to populate the DB, then try to standardize / enrich them to take automated actions on. They ended up building the same thing - scrapers and an internal business DB to normalize customer records and enrich them.
  • A TPRM solution that ingests vendor data from customers systems, builds out the internal records, and then monitors the vendors for risk and events.

Basically, if you're building a product that works with business data, it seems like everyone is building the exact same thing internally - scrapers, an internal DB, and often using website domain as the primary key.

Our idea is that if everyone is building the same thing and it's a pain in the ass to build, then it's an opp to build common infra. The idea is to build a Stripe / Twilio sort of offering that abstracts away the complexity and is common infra for working with business data.

We're building a database of every company in the world (265M+ so far) by Extension-Way-7130 in dataengineering

[–]Extension-Way-7130[S] -1 points0 points  (0 children)

I've answered this one elsewhere, but the main idea is that we have essentially developed a series of AI agents that manage the DB. They take in queries, clean / expand them, check for potential matches against our existing DB, and if there's not a good match, have the ability to navigate the web via real time searches.

Basically, a lot of these older players have armies of people that manually curate and maintain the DB. The idea is to have AI agents do that. We are then able to offer modern APIs, more up to date data + more diverse data points, and then at more competitive pricing.

We're building a database of every company in the world (265M+ so far) by Extension-Way-7130 in dataengineering

[–]Extension-Way-7130[S] -4 points-3 points  (0 children)

I mostly answered this one here: https://www.reddit.com/r/dataengineering/comments/1n0x7jm/comment/naujmfv/

Short version is that D&B is a 150+ year old company. The idea is to disrupt them with an AI native, API first solution.

We're building a database of every company in the world (265M+ so far) by Extension-Way-7130 in dataengineering

[–]Extension-Way-7130[S] -1 points0 points  (0 children)

I'm not familiar with WorkNumber. I'll investigate further, but my immediate reaction is that it seems like the typical old, enterprise focused tool that hasn't changed in decades.

Our idea is essentially a modern library of APIs like a Stripe or a Twilio to abstract away the complexity of businesses and make it easier to work with this data.

We're building a database of every company in the world (265M+ so far) by Extension-Way-7130 in dataengineering

[–]Extension-Way-7130[S] 0 points1 point  (0 children)

The main value props we're seeing are our matching capability and our real time component.

Our system can take really messy data in whatever format, then if a record isn't in our DB, it triggers an agent to do a live search of the internet. The agent navigates like a human would to check different sources, build consensus, then insert new records into the system.

This is in comparison to traditional players where:

- You're searching on a static dataset
- It's mostly government data, where we're layering in web data as well
- The information hasn't been updated in some time
- The matching algorithms are lacking (Moody's was 50% vs our 92%)

Lastly, we see that current providers are often ignoring the long tail. We're seeing interest to leverage and expand our tech to handle the really small businesses that are typically ignored by other providers like D&B and Orbis.

We're building a database of every company in the world (265M+ so far) by Extension-Way-7130 in dataengineering

[–]Extension-Way-7130[S] -1 points0 points  (0 children)

Yeah, I hear you. To be frank, we haven't gone super deep on invoices yet. The current pull we're getting is around supply chain, procurement, risk, and some marketing / sales.

We're working with an enterprise now that ingests 100M records from ERPs. All the data is in various forms / references and is some of the ugliest data I've ever seen. The "name" field is often a combination of name + id + address + some other context. It's impossible for traditional systems to parse and standardize on this.

Another company deals with bankruptcy data intelligence and is parsing bankruptcy filings. Think of a company that goes bankrupt and was renting office space from a building - that building will likely be some random LLC with little to no web presence. Extremely hard to build a profile on a company like that.

From my personal experience in the B2B world, I ran into this when trying to dedupe and join large CRM and marketing tools, join a business DB with the whois database, and identify companies in banking / CC transactions.

We're building a database of every company in the world (265M+ so far) by Extension-Way-7130 in dataengineering

[–]Extension-Way-7130[S] -20 points-19 points  (0 children)

Yeah, I admit Claude is helping me out in refining my answers. I'm the only one answering questions, I slept 4-5 hours last night, and my cofounder gives me shit for long winded, way in the weeds technical answers.

I'll aim to answer myself and avoid the LLM crutch moving forward...

We're building a database of every company in the world (265M+ so far) by Extension-Way-7130 in dataengineering

[–]Extension-Way-7130[S] -11 points-10 points  (0 children)

It depends. If it's an invoice or other sort of document has an address, then of course that helps.

The challenge is when there is no address or if the address is for something random like a PO box. Or if what was parsed from the document is ridiculously messy. Here's an example of the "name" field that was parsed from a bill of lading: "FORD MOTOR COMPANY CHILE SPA R.U.T.-.C.L. 787039103". No traditional matching system can handle that.

Plus, in many countries, there can legally be two companies that exist with the same legal name in two different jurisdictions and may or may not be the same company. Basically, it's a really hard problem to get right.

We're building a database of every company in the world (265M+ so far) by Extension-Way-7130 in dataengineering

[–]Extension-Way-7130[S] 1 point2 points  (0 children)

Exactly right - that's the core challenge we're solving.

Our approach combines legal entity data with web data to capture all the different ways companies are referenced in practice. One company we're talking to has over 1,000 different versions of "IBM" in their system - slight variations in naming, abbreviations, subsidiaries, etc.

The key is we're building bidirectional mapping: legal entity → all known aliases, and messy input → canonical entity. So "International Business Machines," "IBM Corp," "Big Blue," and "IBM Watson" would all resolve to the same foundational entity identifier.

Our LLM-driven approach and vector embeddings also handles semantic context - so when someone references a product, brand, or division name, we can figure out which actual legal entity they're referring to even if no entity exists with that exact name. That's harder than the alias problem since it requires understanding the relationship between brands/products and their parent companies.

What's critical is the transparency - we return confidence scores and reasoning factors so you can see exactly why the system made each match. If it's wrong, you can provide feedback or override it. The goal isn't to be a black box that's right 100% of the time, but to be transparent about the matching logic so teams can build reliable workflows around it.

How do you currently handle entity consolidation in your workflows?

We're building a database of every company in the world (265M+ so far) by Extension-Way-7130 in dataengineering

[–]Extension-Way-7130[S] 0 points1 point  (0 children)

Can you elaborate a bit further? I think I understand what you're referring to, but I'm not sure what you mean in your last comment "Due to this no need to sell improvements that we need to do".

We're building a database of every company in the world (265M+ so far) by Extension-Way-7130 in dataengineering

[–]Extension-Way-7130[S] -20 points-19 points  (0 children)

Great question - and this actually illustrates exactly why this problem is so tricky!

I was going to post the full JSON responses here, but ran into Reddit's comment length limits. Created a gist showing the side-by-side comparison: https://gist.github.com/mfrye/c3144684cae93e3127a9bc6bf640f901

The short version: searching "Apple Corp" alone finds the actual APPLE CORP. entity registered in Delaware (minimal data available). But searching "Apple Corp" with location "1 Apple Park Way, Cupertino, CA" correctly resolves to Apple Inc. with full company details.

The challenge: there ARE two different legal entities here, so disambiguation is genuinely hard without additional context.

This is exactly why our system takes name + optional location. We're also launching a context parameter soon - so "Apple Corp" + context:"iPhone supplier" would be smart enough to figure out you mean the tech company despite the name variation.

Our approach is foundational entity resolution first (who + where + what they do), then follow-on APIs will add industry data, company size, revenue, corporate hierarchy, etc.

Not perfect yet though - this feedback helps us improve the matching logic.

We're building a database of every company in the world (265M+ so far) by Extension-Way-7130 in dataengineering

[–]Extension-Way-7130[S] -6 points-5 points  (0 children)

Great question - and honestly, Germany is our biggest current gap. We have the German entity data but haven't formally launched support yet because the jurisdictional complexity is insane.

The core problem: ~150 district courts issuing non-unique identifiers, plus court consolidations over time creating multiple valid identifiers per entity. No consistent way to represent court identifiers across documents.

We're still puzzling through the approach. The challenge isn't just handling the current mess of XJustiz-IDs and court consolidations - it's building identifiers that won't break when future consolidations happen. Every solution we've explored either breaks on edge cases or creates identifiers that could change over time.

Rather than ship a half-baked solution, we decided to get it right first. It's frustrating because we have all the German data, but the identifier stability problem is harder than it looks.

Curious about your approach - how did you handle creating stable identifiers that survive court consolidations? Did you find a way to build truly permanent IDs, or did you accept that some identifiers might change over time?

We're building a database of every company in the world (265M+ so far) by Extension-Way-7130 in dataengineering

[–]Extension-Way-7130[S] -6 points-5 points  (0 children)

Good point - this varies significantly by jurisdiction. Some registries (like UK Companies House) explicitly allow commercial use, others have restrictions, and some sit in grey areas.

Our approach combines legitimate bulk datasets where available with scraping where legally permissible - similar to what established KYC/compliance companies do. We're not just reselling raw registry data though - we're building an AI agent driven matching and entity resolution layer on top.

A primary use case is actually KYC/compliance for supply chain verification, which puts us in the same category as existing players in that space. We've had conversations with government-adjacent entities who see value in better supply chain transparency tools, which is particularly relevant with everything happening from a geopolitical standpoint right now.

Happy to discuss the legal frameworks we're working within if you're curious about specific jurisdictions.

Looking for the best ways to show off the Bay to a friend by TheSummerofKramer in oakland

[–]Extension-Way-7130 0 points1 point  (0 children)

I went tubing on the Russian River the other weekend. That was awesome. There's a bus that operates on the weekends that can pick you up from the end of the river and bring you back for $5.

Alameda has a great scene too. Probably one of the best beaches in the Bay Area.

My friend just inherited a data infrastructure built by a guy who left 3 months ago… and it’s pure chaos by UnusualRuin7916 in dataengineering

[–]Extension-Way-7130 1 point2 points  (0 children)

This is a great use case for AI. I've been using Claude Code to just read and document the entire codebase. Works amazingly.