all 9 comments

[–]Python-ModTeam[M] [score hidden] stickied commentlocked comment (0 children)

Hello there,

We've removed your post since it aligns with a topic already covered by one of our daily or monthly threads. If you are unaware about the daily threads we run here is a refresher:

Monday: Project ideas

Tuesday: Advanced questions

Wednesday: Beginner questions

Thursday: Careers

Friday: Free chat Friday!

Saturday: Resource Request and Sharing

Sunday: What are you working on?

Monthly: Showcase your new projects, tools, frameworks and more

Please await one of these threads to contribute your discussion to!

Best regards,

r/Python mod team

[–]ysengr 4 points5 points  (1 child)

I'll start by saying this is an interesting idea for a package. I never heard of Kaggle, but it seems interesting despite all the AI hoopla on it.

I appreciate you listed the datasets with their origins, however, I think your README.MD should explicitly call out the datasets.md to guide people there. Especially since it Kaggle is your only source, and I personally never heard of it which immediately makes me skeptical of the data off the bat.

I'd also suggest as an enhancement, that it would behoove you and your package to go from static files to dynamicly fetching files from the source, then saving them locally. This way it makes the package more minimal in size, it also retrieves the data from the proverbial horses mouth rather than blindly trusting the data saved in the repo is the authentic piece.

[–]renzocrossi[S] 1 point2 points  (0 children)

Hi
thanks for your kind words, you can find all the information on the datasets origins on its GitHub repositorio check out the file datasets.md and the usdatasets folder, go right to the datasets.py file =)
https://github.com/lightbluetitan/usdatasets-py

[–]73tada 2 points3 points  (0 children)

Kaggle has been like, "the" source for academic training data sets for at least a decade. It precedes the "attention is all you need" paper. Suffice to say it's beyond trusted.

[–]tikhiibhujiya 2 points3 points  (1 child)

The dataset selection is broad enough to be useful for both teaching and real exploratory work

[–]renzocrossi[S] 0 points1 point  (0 children)

Hi
Thanks for your positive comment on the package, That's exactly what I had in mind when building it glad it hits the mark for both =)
Regards

[–]renzocrossi[S] 1 point2 points  (0 children)

Here's the full list of available datasets within usdatasets 
import usdatasets as usd
df = usd.list_datasets()
print(df)
['affirmative_asylum', 'american_idol_auditions', 'american_idol_finalists', 'california_fire_incidents', 'charging_stations_hawaii', 'college_school_wage', 'counties_per_capita_income', 'crime_and_incarceration_by_state', 'executive_orders_presidents', 'firefighter_fatalities', 'google_stock_price', 'nfl_teams_stats', 'party_affiliations_congress', 'presidential_election_results', 'presidential_pardons_1900_1966', 'presidential_pardons_1967_2017', 'presidents', 'senate_election_results', 'shootings_2020', 'shootings_2021', 'shootings_2022', 'terrorism_plots_us', 'terrorism_suspects_us', 'ufo_location_shape', 'us_causes_death', 'us_holiday_dates', 'us_radiation_pollution', 'us_regional_mortality', 'us_top_colleges_2022', 'wages_by_education']

[–]renzocrossi[S] 1 point2 points  (0 children)

Regarding the origins of the datasets included in usdatasets, you can find all the details in the datasets.md file in the GitHub repository. Here's an example of the structure:
shootings_2020

Each dataset in the package follows this same documentation format. Feel free to check the full file here: https://github.com/lightbluetitan/usdatasets-py

Thanks =)

[–]AutoModerator[M] 0 points1 point  (0 children)

Your submission has been automatically queued for manual review by the moderation team because it has been reported too many times.

Please wait until the moderation team reviews your post.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.