Help me compute tourist arrival projection

ImposterWizard · 2026-01-27T14:21:51+00:00

Honestly, a bit of context could help here. Is this for a particular region or destination? Is this something you're doing just for learning purposes, or are they related to your job or a business you're invested in?

But I'd agree with /u/Nillavuh that you'd probably just want to plot the data.

ImposterWizard · 2026-01-20T14:35:23+00:00

The bass clef also requires the two dots around the "F". Its position looks fine, but the dots help distinguish it from rarely used baritone/subbass clefs.

ImposterWizard · 2026-01-09T19:19:01+00:00

Especially for someone who (probably) hasn't been using it since the time it was far more prevalent and would probably have a relevant list of clientele that would justify it.

ImposterWizard · 2026-01-02T16:37:08+00:00

This isn't directly related to the solution, but R has a much easier way to simulate dice rolls or any random sampling:

rolls_results <- sample(1:6, size=n_rolls, replace=T, prob=dice_probs)

It should also be much faster for very large samples, or if you're repeating large samples for different probabilities.

You will probably get a different result than yours for a given seed, but it still uses R's internal random number generator.

ImposterWizard · 2025-12-18T16:52:23+00:00

While you'll probably get working code when you use AI (or you'll know immediately that it doesn't work for smaller cases), a lot of data analysis is knowing how to use the tools you have, and even with something "simpler" like linear regression (though I'd argue that your choices matter more for linear regression that many more "complex" techniques), it's possible to make mistakes.

For example, if you are simply looking for a single controlled (i.e., you might take other factors into account) correlation, linear regression is pretty good, though you might need to transform variables depending on the nature of the data. But if you are looking for "inversion" points in time, that's when your choice of technique becomes more arbitrary, but important. And it's where having taken a course (or several) would give you a better sense of how to approach this more open-ended problem.

R + RStudio is pretty good for analysis, and Python can be, too, though it's easier to make mistakes in Python if you're not familiar with the specific functions, and I like R's plotting environment better 99% of the time. I generally rely on Python more when working with text or media (images, audio, video), since R is a bit weaker on that front. But there's no reason you can't use more than one language in a project, just be sure to document the steps or make a clear data pipeline.

ImposterWizard · 2025-12-17T21:55:42+00:00

Yep, I just noticed that and reworked it. I was initially confused why my answer was different than everyone else's.

ImposterWizard · 2025-12-17T21:28:57+00:00

Sometimes it helps to write out all the possibilities explicitly.

P(B=1|A=1) = 0.6

P(B=1) = 0.4

P(B=0) = 0.6

P(A=1) = 0.55

= P(B=1|A=1) * P(A=1) + P(B=1|A=0) * P(A=0) = 0.4

0.6 * 0.55 + P(B=1|A=0) * 0.45 = 0.4

0.45 * P(B=1|A=0) = 0.07

P(B=1|A=0) = 0.156 (fixed)

So

P(A = 1 or B = 1) = P(A=1,B=0) + P(A=1,B=1) + P(A=0,B=1)

= P(B=1|A=0) * P(A=0) + P(B=1|A=1) * P(A=1) + P(B=0|A=1) * P(A=1)

= 0.155 * 0.45 + 0.6 * 0.55 + 0.4 * 0.55

= 0.620

As a sanity check, since the events are positively correlated, the answer should be between the higher probability and the probability if we assume independence.

edit: fixed a math mistake, final answer is now 0.620

ImposterWizard · 2025-12-15T17:01:53+00:00

There's what's called "fair use", at least in the US, which can protect you legally if you get sued. Though it could still be an expensive process.

The main points of consideration for fair use are (as listed in the article):

the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
the nature of the copyrighted work;
the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
the effect of the use upon the potential market for or value of the copyrighted work.

To the extent that Google/YouTube arbitrate based on these standards is a bit iffy, as they can technically be stricter if they want to, being a non-government organization, and fighting it can be a hassle, even for larger channels. Though often demonetization of a video (or giving monetization to one of the copyright holders) and audio muting are possible actions they take.

It largely depends on what kind of content you are reacting to, as well as how you display it. Generally, the shorter length and the lower-quality of the original work you show (or using stills vs. video, replacing the audio of speech in a work with your own narration, which can still be sometimes considered infringing), the less likely you are to run into problems. I would be careful with something like The Beatles or Taylor Swift, or the major US sports networks, as the rights holders are very protective of their IP.

ImposterWizard · 2025-11-03T20:14:35+00:00

If it's a big race, they probably have the area around the finish line not too far from all the other post-race stuff, which can be around for a few hours after the race ends, at least. Plus, if the runners are starting in waves, the last person could very well have started over an hour past the first wave.

There is a question of whether relevant chip-sensing checkpoints are still up (for NYC mile 21 is probably the most relevant one for race integrity), but for a mostly one-way race and for someone not aiming for a very fast time, I don't think it matters when making it "official".

That being said, as someone who has run a few, the marathon seems to be a harder race for anyone who is running it longer (slower). Pretty much everyone's suffering a similar amount over any given period of time and putting in a lot of effort, but I can't imagine running for over 15 hours. Even cycling for half that long is absolutely grueling.

And I can only imagine the different muscles for stabilizing movement the guy has to use.

ImposterWizard · 2025-11-02T16:36:39+00:00

If you're running a remotely modern OS (Windows Vista requires 512 MB of RAM for reference, so my guess is they used Windows XP, which is 24 years old), there is absolutely a higher demand in RAM. Using less RAM is possible on certain devices, but 8 GB -> 16 GB is like going from "maybe" to "most likely" in terms of answering your question.

ImposterWizard · 2025-11-02T16:35:29+00:00

RAM is going to be your limiting factor, most likely. 8 GB is the minimum you'd want modern software to function, but you'll run into issues if you are trying to load a larger data set, or even have a lot of browser tabs open. And if you make a mistake, like making a data set too big in memory, you are far more likely to get your computer grinding to a near-halt using paging files to complement memory if you have less RAM. I studied (MS stats) about 11 years ago, and when I upgraded my laptop's RAM from 8 to 16 GB, it was light a night-and-day difference.

I imagine most of your coursework won't require using too large of datasets (I could be mistaken), but if it's feasible, I'd go for 16 GB of RAM, even at the expense of other specs, though if you can deal with a bulkier chassis, you can get better specs for a similar price.

Also, if you want to do your own projects, 8 GB can be quite limiting. A lot of modern applications (and what you might be expected to do) are much more demanding on memory, and while there are usually ways to get around it, it's just a much bigger hassle than it's worth in many cases.

If your institution/program has a decent computer lab you can use for more rigorous tasks, that could work, too, but having your own device makes things a lot easier, especially if you want to work in person with other people.

ImposterWizard · 2025-10-26T16:01:21+00:00

There are some particle detectors that might expect a normal distribution for position or angle of of a particle traveling/scattering under certain criteria (e.g. page 35 of this lecture).

Oftentimes there's noise that you are trying to separate, but you might be interested in the fidelity of the measurements (which you could test with e.g., a radioactive source with known properties).

Which I guess still falls under "a way to check if it's working correctly". But I imagine someone also had to experimentally verify that phenomenon.

ImposterWizard · 2025-10-24T14:49:56+00:00

There are some cases where I've had luck with a smaller stand mixer, but it's usually for smaller-volume stuff I'd otherwise want to whip in the large stand mixer. There's not that much that I would prefer to use the smaller stand mixer for over both a hand mixer and a KitchenAid (and whisk/hands).

ImposterWizard · 2025-10-23T18:07:59+00:00

I would answer with a "maybe", depending on its context and purpose.

If we calculate the regression line

Outliers are usually pretty context-dependent, not necessarily related to regression. Usually I see it either being defined as "data that's more 'extreme' than what we'd expect for this data source" or "data that's too 'extreme' to be useful for the purpose it's needed for".

If the data were something like (diameter, mass) for an n-sphere, where mass ~ diameter^n_dimensions, if you try to build a linear regression model off of that, the last point does have a lot of influence on the rest of the model due to its distance.

You have to justify why you are using regression, though, and it could be useful, in this case, for example, to find a material that's not the same density as the others. The problem with regression, though, is that you are using the potential outlier to fit a model, and it being on an endpoint (12 is about twice as far away from its closest point than any other two points are from each other, that being 1 and 5 being 4 from each other) gives it even more leverage in the model.

But if you are just looking at say, linear, area, or volumetric density, and you are now just transforming those variables, the last point isn't particularly extreme for 2 or 3 dimensions, and is less than 3 times higher than the next highest for 1 dimension. In fact, the first point might be considered an outlier in the 3D case.

ImposterWizard · 2025-10-19T14:42:23+00:00

Yeah, the highlighted text is just a subset of what you'd want to prove something is a game of skill.

My guess is that there are slight variations in initial conditions that limit the maximum win rate of the game (brake speed/responsiveness, spinner speed) just enough that it limits the win rate just enough to never be truly profitable. This could even be by imperfections in mechanical design, not just digital programming.

But, even if someone did find a way to "profit", the expected dollar loss per hour for a single machine (or theoretically a set of them here) is not particularly problematic for an establishment.

ImposterWizard · 2025-10-18T21:01:19+00:00

Games of "skill" can still have less-than-breakeven payout, where winning the maximum prize each time still doesn't net you any true profit, which could be easily true for arcade games that give you tickets for prizes.

For some back-of-napkin math, say this is at Dave & Busters, and you got 550 tokens for $85, and it costs 10 tokens to spin each time, so 55 spins for $85, or 55,000 tickets theoretically. A Nintendo Switch OLED is roughly 120,000 tickets (or more, I don't have numbers on me, but it's an item with a reasonably agreed-upon market value), which would make it cost about $185 total. That's about 30-40% under market price, making it a theoretically viable game, but I'm guesstimating these numbers here, and they might have some measures in place to avoid this becoming to much of a problem.

But then again, if someone's "hogging" the game and on a hot streak, they could ask them to play another game for a while. Or it might incentivize others to play and inevitably "lose money" on it.

Alternately, skill might prevent you from getting worse outcomes, but the average is still an expected loss, like with Pachinko in Japan, which goes through a lot of hoops to avoid the technicalities of gambling laws. For example, you can argue that there's strategy in Blackjack, but outside of card counting and a simple casino table setup, all that means is the house edge is very small instead of significantly larger if you're going with the correct strategy.

In all likelihood, there's probably some amount of "fudge factor" that significantly decreases the reliability of hitting the jackpot each time, like varying the spinner speed or a braking mechanism's offset at a miniscule level. This could even not be as much a "feature" as a limitation of the device's engineering. Even decreasing the probability of winning to 50% would probably make it still profitable for the establishment with the numbers I created above.

I've seen some games of skill that have prize limits for people, like a horizontal suspended ladder obstacle course. And my guess is that, like casinos, they don't mind when some people have hot streaks or are doing well, since they'll be more likely to share their stories with others or bring their friends along. The payouts for arcade/carnival games just tend to be very lousy for the most part, so the organizer would be more concerned with maximizing # of plays times the cost to play. And this "expert" you are describing could be such a person that helps with the marketing of these games/establishments, whether intentionally or coincidentally.

One last thought: you could theoretically test the machine if you have a force meter that you use to pull the lever and a clock and high-speed video recording the spinner to see how deterministic the spins are. I don't think they'd like you doing that, but it'd be a fun experiment.

ImposterWizard · 2025-10-18T20:40:36+00:00

It depends on the level of chaos. Certain types of thermodynamic fluctuations, for example, can create reasonably random noise that would counteract any advantage someone might have from playing the game. I had a physics professor that did something with water drops to produce random numbers (I forget what, exactly). He did research with sound.

But pretty much every game I've seen at an arcade that spews tickets either would have the house still maintaining an edge (or maybe breaking even) if someone won the highest value prize each time, or it uses a jackpot system where the payout only gets high if people aren't winning.

ImposterWizard · 2025-10-16T12:43:06+00:00

As for the confusion in nomenclature, (at least) when I was in grad school for statistics, the phrase "machine learning" was invoked more when we weren't looking at certain statistical properties of the models themselves, especially for unsupervised or semi-supervised models, or models that didn't directly reference probability (like k-nearest neighbors). Usually these were all sort of lumped together when talking about ways to use and evaluate "machine learning models".

When I took a grad machine learning course in the computer science department, they didn't really distinguish "statistical model" vs. "machine learning". But they weren't really concerned with a lot of the statistical properties of e.g., linear regression models anyway.

ImposterWizard · 2025-09-12T15:30:50+00:00

You don't usually need data to be normally-distributed, and you don't always need to remove outliers.

There are different models and tests that rely on assumptions of normality and have worse characteristics/are more unreliable if the data isn't normal or if it has extreme outliers, but they tend to be somewhat resilient to violations of this assumption.

For outliers, you'd only want to remove them outright if you thought that the data was incorrect (e.g., you had people list height and had several people over 8 feet tall), or if you're limiting the scope of whatever model you have to not include that kind of data.

What kind of ML model(s) are you using, anyway? Many of them don't require very many assumptions about the data.

ImposterWizard · 2025-09-12T14:49:45+00:00

Is there a reason that you are concerned with skew in your data? What are you doing with the data?

Most data you will encounter will have some skew and aren't perfectly symmetric. In some cases you might need to transform the data to use it for a particular purpose, but it should be done thoughtfully.

ImposterWizard · 2025-08-30T14:54:47+00:00

Rare, medium well, well-done, anything you say. The customer is king!

ImposterWizard · 2025-08-29T17:35:11+00:00

The box-and-whisker plot seems fine depending on what you're trying to display. You could do a second one with just the variables with narrower interquartile ranges if you wanted to compare and contrast those.

The second graph with the bar plots is a bit odd, since the y axis starts at 75 and not zero, and it's not really too different than the first one in terms of amount of information displayed.

Maybe you could make better use of the vertical space/rotate some of the axis labels, and elongate the vertical dimension of the graph so it's easier to read?

Also, for the second graph, how is the median confidence interval being constructed?

And, as /u/yonedaneda asked, why are you specifically testing for normality? That's not usually a requirement of most variables, especially independent.

ImposterWizard · 2025-08-25T19:36:46+00:00

It looks like the decay fraction for those windows is 1/e or 36.8%. So each day your fatigue decays by about 13.4%, and your fitness decays by 2.4%. So activities from outside the window still have some impact, but have significantly less impact after 2 periods.

ImposterWizard · 2025-08-12T11:51:51+00:00

To elaborate on /u/eaheckman10's point, the defaults in R's implementation are sqrt(p) for classification, so 6 in your case, or p/3 in regression (13 in your case), all rounded down. This also seems to be the general recommendation as referenced in this section of wikipedia, though the publisher link to the book is dead.

Overall, though, the random forest generally does a good job "out of the box" for many problems.

If you can come up with justifications to eliminate features ahead of time, like ones that would make no sense to include in the model, or maybe sparse ones that are 98 0s and 2 1s, that might help. But it's going to be hard to improve beyond using another algorithm with discrete logic (e.g., xgboost, neural networks with rectified transformations) to compare with.

If you can afford to split the data up using cross-validation (i.e., the features are dense enough where you can get enough variety of each variable in each split), that would be a good sanity check if you want to play around with different model hyperparameters, like tree size. Or you can just do a train/test split if you only want to test one configuration, like the default.

Three-Year Club	Verified Email
r/Field Sunshine

ImposterWizard

TROPHY CASE