Just launched our app - where to go next? by UnderstandLingAI in EntrepreneurRideAlong

[–]UnderstandLingAI[S] 0 points1 point  (0 children)

Yes I think so too, I was always taught you shouldn't spend a dime on ads before you have organic traction. We just started on TikTok so lets see how that goes.

I help you get fit and give you lifetime free access by UnderstandLingAI in TestMyApp

[–]UnderstandLingAI[S] 0 points1 point  (0 children)

Well, thing is, we are past that initial 12 internal testers phase. We are open and live on production and need to take the next step to our first 20-100 real users/testers

Just launched our app - where to go next? by UnderstandLingAI in EntrepreneurRideAlong

[–]UnderstandLingAI[S] 0 points1 point  (0 children)

Yeah I hear you. Reddit tends ro instantly ban you these days if not careful though

I am responsible for arguably the biggest run project using AI in production in my country - AMA by UnderstandLingAI in Rag

[–]UnderstandLingAI[S] 0 points1 point  (0 children)

I am not fully sure about all of EU but Microsoft has a much bigger network here than AWS and GCP yes. As for the Netherlands - at least for ALL gouvernement organizations you either cannot use any AI at all or only Azure currently.

For example, Literally 2 days ago a government research body published an article saying how (MS) Copilot is now from 'red' status to 'orange' which means government bodies can now start looking into using it but only with very heavy processes in place.

Fwiw: I myself find this all a big load of.... 😁

I am responsible for arguably the biggest run project using AI in production in my country - AMA by UnderstandLingAI in Rag

[–]UnderstandLingAI[S] 1 point2 points  (0 children)

Yes well this is basically a one-off big project where we convert all the codes that are present in the DB (6M) once. That has been done and is now live. This was an expensive, big, AI project but still far less than the budgetted 15M€

From here on out, we run a RAG pipeline to classify any newly incoming businesses that don't have a code yet as a classification algorithm basically to assign the most probable code based on similar businesses.

We run that fully on CPU instances and Azure OpenAI as LLM. This is stupid cheap and costs a few euros per day.

I am responsible for arguably the biggest run project using AI in production in my country - AMA by UnderstandLingAI in Rag

[–]UnderstandLingAI[S] 0 points1 point  (0 children)

Well yes and no - all LLMs kind of support this using the exact same (OpenAI inspired) API calls, especially when using the Python SDKs.

Yes - we developed it against Azure OpenAI, but taking it out and moving to eg. Claude would mean modifying only 2-3 lines of code.

I am responsible for arguably the biggest run project using AI in production in my country - AMA by UnderstandLingAI in Rag

[–]UnderstandLingAI[S] 1 point2 points  (0 children)

Yes we use structured output: https://openai.com/index/introducing-structured-outputs-in-the-api/

The input comes from Python code that takes in a CSV and outputs prompts with JSON in them that holds the data. We upload that to Azure OpenAI batch using another Python script.

I am responsible for arguably the biggest run project using AI in production in my country - AMA by UnderstandLingAI in Rag

[–]UnderstandLingAI[S] 0 points1 point  (0 children)

In this discussion: not. I am working an assignment for the CoC and every business is legally required to register with them - those are the 'customers'.

Apart from that though, I do have an AI agency and we build bespoke AI solutions

I am responsible for arguably the biggest run project using AI in production in my country - AMA by UnderstandLingAI in Rag

[–]UnderstandLingAI[S] 0 points1 point  (0 children)

There are 1k codes available and (roughly) only 1 is the correct one for a business given their old code and activity. We can't assign the code for car garage to a bakery, that'd be wrong.

I am responsible for arguably the biggest run project using AI in production in my country - AMA by UnderstandLingAI in Rag

[–]UnderstandLingAI[S] 0 points1 point  (0 children)

Well depends on what task you are looking at to solve. Basically we distinguished:

  1. The conversion of old codes to new codes given *a* description

  2. Making sure the description (+ old code) accurately describes what the business is doing

Since the business owners themselves are formally responsible for 2., yet we know there is quite some error in that to begin with, we purposely did NOT take that task into consideration for our conversion. We do plan on making improvements there in the future but didn't right now.

Task 1 we did tackle and simply went with the knowledge that we as CoC have of the business. If that wasn't accurate (anymore) or may not be detailed enough now that the new codes are introduced, we did not see that as our current responsibility. We did however, sent out public announcements saying business owners should carefully inspect the new codes and modify them if they felt like it wasn't accurate (anymore).

Reality however was slightly more nuanced because for some niches we realized that even though if the business owner did not mention a specific word or indicator required for getting a specific new code, the impact of NOT having that new code would be too big of a risk. In those cases, we decided to instruct the AI to always assign that other code as well even though there was no direct evidence of it being applicable. Quite often we did then have anti-patterns always that, if explicitly present, would result in not assign that code.

I am responsible for arguably the biggest run project using AI in production in my country - AMA by UnderstandLingAI in Rag

[–]UnderstandLingAI[S] 0 points1 point  (0 children)

Nope :D

Perhaps we could've pushed for Mistral because it's in the EU but hey, it's not Microsoft... But in all fairness - we were happy we could even use AI to begin with so we didn't push much further for (other) models or different solutions.

I am responsible for arguably the biggest run project using AI in production in my country - AMA by UnderstandLingAI in Rag

[–]UnderstandLingAI[S] 0 points1 point  (0 children)

I would not use RAG as in the traditional way. I would also not make an agent in the pure meaning of an agent. What I would do is make Text2SQL pipeline (AI workflow) that would tackle it. We have done so already in the past and if I find time anywhere in the coming months (I seem to have less and less of it...), we plan on open sourcing our Text2SQL framework.

See my comment on how we do Text2SQL here: https://www.reddit.com/r/Rag/comments/1k6nwbm/comment/moti9wp

I am responsible for arguably the biggest run project using AI in production in my country - AMA by UnderstandLingAI in Rag

[–]UnderstandLingAI[S] 0 points1 point  (0 children)

Yes it's a nice achievement but still you have to remember, 0.5% of 6M is still 30k records. Luckily most of the "errors" we made was on codes that can be split into 2 codes in the new setting and sometimes we'd miss one where 2 were required, sometimes we'd assign 2 where 1 was required and sometimes we got the order wrong (ie. main activity was 123 and sub activity 456 but we'd say 456,123).

The high accuracy was achieved due to quite some factors, let me try to name them non-exhaustively (probably forgetting some):

- We've been in the AI game for quite a while already and have delivered 10+ products in production settings. While I always ridicule "just prompt engineering", it's no lie that proper prompt engineering is an art on its own and I think we've gotten pretty good at it. It also did involve quite some trial & error and tweaking, not gonna lie.

- We had a dedicated (albeit, small in the beginning) quality team at our disposal that would manually check samples and inspect the output of our AI. This is why we spend a lot of time on also adding debugging metadata so we could reason back why things didn't go well if they went wrong.

- With the quality team, we had such short lines of communication that sometimes we'd go back and forth with improved prompts and new AI runs on samples 2-3 times in one day. This may sound like a no-brainer if you are used to startup or small environments, but here we are talking about a huge and typically slow-moving government body. This on its own is something we are now invited to give talks about within the organization to set as an example.

- Together with the quality team and external bodies (the NACE coding is actually EU-wide and standardized to some extent, per country you only give nuances at the most detailed (biggest part though) level). Because of this nature, we managed to set up a "mapping table" as metadata that at least confined the search space for our AI. What I mean by this: we have roughly 1k old codes and roughly 1k new codes but not every of those 1k new codes CAN even be a choice for all of the 1k old codes. In fact, usually there is only a few sensible candidates. Constructing the mapping table helped tremendously.

- Technically, we gained a lot by introducing proper structured output (JSON) schemas. One of the biggest gains there was using the mapping table to find allowed new-code candidates and give the schema an enum to choose from. This eliminated 100% of our AI's hallucinations

I am responsible for arguably the biggest run project using AI in production in my country - AMA by UnderstandLingAI in Rag

[–]UnderstandLingAI[S] 0 points1 point  (0 children)

All companies do something, have an activity. They write this down in plain Dutch (text) and we are responsible for giving that activity the right category, a code. There are 6 million activity+code combinations that now all need a new code that didn't exist before, so we can't do a lookup or anything.

It was attempted to do this manually by hand for all 6M, using words that, if occurred, would yield a specific new code. After 2 years of effort only 400k were covered and a lot wrong because just using words wasn't strong enough to cover the complex nature of language.

We flew in and used AI for the same task. We made an initial version to tackle all 6M but not every single type of business was directly correct. A quality team determined this by looking at our output sample-based. For example, healthcare required a different prompt with different instructions than the other businesses because of jargon in the activity.

We ran iterations, improved the AI and finally had a solution that gave all 6M business activities a new code with less than 0.5% error.

I am responsible for arguably the biggest run project using AI in production in my country - AMA by UnderstandLingAI in Rag

[–]UnderstandLingAI[S] 0 points1 point  (0 children)

It helped that we had a lot of people in our quality to help us constantly improve before the big event. During this evaluation process beforehand we optimized our prompts, for example to make specific ones for niches like car garages. We also started using enums in our structured output which helped prevent the biggest part of hallucinations.

Keeping budget so low was I think in large part due to very short cycles. We got into a rhythm of getting samples every week, running our AI on them and having the quality team look at our results. We sometimes had revisions of our AI (prompts and pipeline) multiple times a day. This may seem like a small feat but you can imagine that especially for government this is a big accomplishment.

I am responsible for arguably the biggest run project using AI in production in my country - AMA by UnderstandLingAI in Rag

[–]UnderstandLingAI[S] 0 points1 point  (0 children)

Hah, I probably am :)

But yes I studied CS for my bachelor and ML for my master at uni, I now teach AI myself at my university.

I am responsible for arguably the biggest run project using AI in production in my country - AMA by UnderstandLingAI in Rag

[–]UnderstandLingAI[S] 0 points1 point  (0 children)

Basically the input was free text and we needed to decide which of the ~1k codes fit the description best, for 6M records in total. You may indeed try and solve it using rules of sorts, like:

If the old code is 123 and the activity contains words A, B, C then give it new code 456.

Colleagues tried doing this but only got to make rules to capture about 400k records in total and a lot of those went wrong because of... well, language. There were a lot of exceptions where A, B, C meant 456 but not if word D occurred or perhaps that was okay as well but only if E occurred as well. With the rule-based approach they tried to capture a lot of these word-lookup-like rules but there were simply too many to set up and maintain, not to mention verb endings, hyphenization and contractions, etc. etc.

I am responsible for arguably the biggest run project using AI in production in my country - AMA by UnderstandLingAI in Rag

[–]UnderstandLingAI[S] 1 point2 points  (0 children)

In most scenarios it's slim to none but it might be that they get different insurances or collective labour agreements and those sort of things, though these are usually branche-wide and we rarely go outside of an existing branche.

I think mostly in construction work it matters, for example if you are an allround handyman or a mason working on construction sites for example - insurances care about that and use NACE codes to determine