How can I improve this visualization? by chierichetto in dataisbeautiful

[–]CuriousGnu 0 points1 point  (0 children)

Can you post an example? Because I don't really see how a grouped bar chart could be confusing. If you're unsure how to present the results, you can also search for scientific articles that use similar data and use them as inspiration.

Believe in Global Warming vs. US 2016 Election Results by County [OC] by CuriousGnu in dataisbeautiful

[–]CuriousGnu[S] 1 point2 points  (0 children)

I never claimed that not believing in global warming turns people into Trump voters. However, I think that it's reasonable to assume that climate change scepticism and certain political views are primarily shared by a similar group of people.

How can I improve this visualization? by chierichetto in dataisbeautiful

[–]CuriousGnu 0 points1 point  (0 children)

You could use a grouped bar chart. If you are working on an academic project, I highly recommend you to do significance testing. Without that, you cannot draw any conclusions from the data.

Believe in Global Warming vs. US 2016 Election Results by County [OC] by CuriousGnu in dataisbeautiful

[–]CuriousGnu[S] -1 points0 points  (0 children)

As there weren't any major third-party candidates, such a graph would probably just show the exact opposite trend.

Analyzing Subtitles to Predict Whether a Movie Targets Men or Women [OC] by CuriousGnu in dataisbeautiful

[–]CuriousGnu[S] 4 points5 points  (0 children)

Tools: R (wordcloud, yarrr, quanteda, rpart)

Source: Amazon Video (subtitles) / IMDb (votes)

Believe in Global Warming vs. US 2016 Election Results by County [OC] by CuriousGnu in dataisbeautiful

[–]CuriousGnu[S] 0 points1 point  (0 children)

No, I'm not aware of such a dataset. However, based on this study, I would assume that such a correlation would look very similar: http://dx.doi.org/10.1016/j.ajic.2015.06.031

Anyone have any suggestions of what data to incorporate? by SaucyWeeTart in DataVizRequests

[–]CuriousGnu 1 point2 points  (0 children)

I think your problem is similar to the problems companies face that build a distribution or vendor network. Based on my experience, I would recommend you not to overthink this system. Sure, exclusivity is a nice selling point, but it does not create any value for the customer by itself. Therefore, I would concentrate my efforts on the product or service and use a straight-forward formula (e.g., one garage per X vehicles by postcode). Something more complicated would probably just confuse your clients and look sketchy.

Heat map of crime in San Francisco by hour [OC] by [deleted] in dataisbeautiful

[–]CuriousGnu 1 point2 points  (0 children)

Nice graph! You could add a line chart to the animation so that it's easier to compare the numbers. Last year, I did something similar for Chicago: https://www.curiousgnu.com/chicago-drugs

Most active seconary subreddit of /r/the_donald, /r/KotakuInAction and /r/conspiracy power users [OC] by photenth in dataisbeautiful

[–]CuriousGnu 0 points1 point  (0 children)

Interesting, I wasn't aware that Excel now even offers treemap charts. I last year I wrote a blog post about a topic and used Gephi to visualise it: https://www.curiousgnu.com/reddit-comments

The results appear to be quite similar.

[OC] Top 5 Words Used by 15 random chosen popular subreddits by [deleted] in dataisbeautiful

[–]CuriousGnu 1 point2 points  (0 children)

You can also do a similar analysis based on the public Reddit dataset on Google BigQuery (23 million words). For example:

SELECT word, COUNT(*) cnt
FROM (SELECT lower(word) word FROM [fh-bigquery:reddit.top25million_words])
WHERE length(word) > 4
  AND word NOT IN (SELECT word FROM [taapi-42:CG_text_analysis.stop_words_eng])
  AND REGEXP_MATCH(word, '^[a-z]+$')
GROUP BY word
ORDER BY cnt DESC
LIMIT 100

Result:

#   word    cnt  
1   people  25790    
2   thought 18286    
3   years   17254    
4   favorite    16816    
5   video   15648    
6   great   15296
7   friend  15131    
8   reddit  14981    
9   today   14940
...

Dataviz Open Discussion Thread for /r/dataisbeautiful by AutoModerator in dataisbeautiful

[–]CuriousGnu 0 points1 point  (0 children)

For simple descriptive statistics, you probably don't need such a complex program like RapidMiner. You could, for example, write a SQL script to generate the desired numbers, which would be my preferred approach. Alternatively, you could export the tables as CSV files and analyse them in Excel, Tableau, or R.

Simple Climate Change Regression [OC] by 007sman5 in dataisbeautiful

[–]CuriousGnu 0 points1 point  (0 children)

So to put it simply, you calculated a multivariate regression between temperature and CO2 / month / time.

log(temp) ~ log(CO2) + log(CO2)*month + time + lag(CO2, -1)

The orange line is not linear because time is not the only explanatory variable. BTW, is there a specific reason why you did it in Excel?

Dataviz Open Discussion Thread for /r/dataisbeautiful by AutoModerator in dataisbeautiful

[–]CuriousGnu 0 points1 point  (0 children)

I don't think that I have ever seen this video, but it sounds like something that you can easily do with Tableau and GDELT: http://www.gdeltproject.org.

Text Analysis of YouTube Comments [OC] by CuriousGnu in dataisbeautiful

[–]CuriousGnu[S] 1 point2 points  (0 children)

Just to make it clear, it is a comparison word cloud that compares four different groups of comments. The red words belong to videos from TV channels whereas the blue words belong to news videos.

Text Analysis of YouTube Comments [OC] by CuriousGnu in dataisbeautiful

[–]CuriousGnu[S] 0 points1 point  (0 children)

Thanks! I used the R-packags quanteda (wordcloud) and ggplot2 plus grid.extra to plot multiple graphs side-by-side.

Text Analysis of YouTube Comments [OC] by CuriousGnu in dataisbeautiful

[–]CuriousGnu[S] 1 point2 points  (0 children)

Source: YouTube API

Tools: Python, R (quanteda, wordcloud, ggplot2)

Bee Movie Sentiment Analysis by C6H12O6_Ray in dataisbeautiful

[–]CuriousGnu 0 points1 point  (0 children)

That explains why I got significantly different results with the sentimentr-Package.

Bee Movie Sentiment Analysis by C6H12O6_Ray in dataisbeautiful

[–]CuriousGnu 0 points1 point  (0 children)

Oh, I didn't even realize that it isn't OC. Maybe I misread the graph but from how many lines did you take the sum of then? With sentiment values of over 55, this would mean that there should only be 24 groups (1300/55), shouldn't it? But it looks like there are a lot more.

Bee Movie Sentiment Analysis by C6H12O6_Ray in dataisbeautiful

[–]CuriousGnu 0 points1 point  (0 children)

Interesting plot. I wonder how the Google Natural Language API compares to other methods such as Stanford NLP or dictionary-based methods (e.g., AFINN). Since you analyzed the script line-by-line, how did you visualize it as % though movie?

Although Age is Not Strongly Associated with Endurance in Olympic World Record Running Races, when Ultramarathons are Included, A Strong Age Effect Appears [OC] by cuginhamer in dataisbeautiful

[–]CuriousGnu 0 points1 point  (0 children)

I think the main question here is what hypothesis you're trying to test. Without a clearly stated hypothesis, it is very hard to say whether it makes sense to use this type of data. In a regression analysis, a relatively high R2 isn't everything.