Everything else goes here, including discussion of parks outside of Carowinds and any off-topic discussion
User avatar
By aoriole19
#93627
I'm taking a data analysis class at MIT, and we got a project assignment that was really open-ended, so of course I chose roller coasters. I'm going to be doing web scraping in R to get data from rcdb.com and do some kind of interesting analysis, but I haven't decided exactly what I'm going to do.

I considered comparing coasters made by B&M and by Intamin to determine which company made more "thrilling" coasters (comparing measures of height, speed, inversions, and G-forces). I also thought I could compare different states or countries to see which had the best coasters, or even compare the east coast to the west coast.

I'm not sure how profound these findings would be though, because it would take into account the averages, including kiddie coasters, etc. So I'm not sure what kind of test/comparisons would be the most interesting.

Basically, I'm posting this here mostly because I thought you guys would think it was interesting, and also if you have any suggestions or ideas I'd love to hear them! I'll post some updates if I do anything interesting.

PS I am definitely a beginner (I've never used R before this class, and we haven't even used it all that much) so don't expect anything too crazy- just getting the code to gather the data will be a huge accomplishment to me!
User avatar
By arby
#93637
Sorry, I don't know if I'll be much help. I've never used R (and actually just had to Google it). All my programming is primarily in C# although I do some work with VB.Net and have used C++ and Cobol within the past decade. I also write a lot of scripts in PowerShell although I don't consider that a 'language' per se.

If there is anything I can do to help, just let me know.
User avatar
By aoriole19
#93738
I'm having a lot of problems actually getting the data, which is really frustrating because the project is due soon.

Can anyone help?? The issue is rcdb doesn't have all data for all rides. For example 1066 (https://rcdb.com/3111.htm) has all the stats like length, height, drop, etc. whereas 3 Ring Roller Coaster (https://rcdb.com/3896.htm) only has inversions and elements.

The problem from what I understand (coming from a place of not knowing computer science) is that the source code of the website essentially encodes whatever the first statistic is for all the rides as the same thing. If I try to write a program in R that extracts lengths of roller coasters, it can correctly identify the length of 1066 (it's the first statistic) but it will incorrectly pick out "2" for 3 Ring Roller Coaster (since number of inversions is the first statistic listed).

Does anyone have any ideas how to get around this? You don't need to know how to code it, theoretical ideas are also welcome!!

(PS I can maybe try to do it in Python instead of R if anyone has specific suggestions for Python)
User avatar
By Chris
#93740
I don't have any idea how to code but is ther a way that you can tell it to give you the number after it locates the word, "Length:"?
User avatar
By aoriole19
#93741
There might be a way to do something like that, but the only way I actually know how to get elements off a page is by using something called Selector Gadget, and my only options are to basically select the numbers, the words, or both.

I did try selecting both the words and the numbers so I could sort it myself, but then it makes each word and number a separate entry (not matched with each other) and it also picks up on weird stuff in the page like "language" "date format" "econo mode on/off" and ends up making too much clutter for me to sort it /:
User avatar
By arby
#93744
Sorry, I don't scrape web page content so I'm not familiar with that. Any time I reference external databases I use their API and get data most often through a web service or WCF. For example, I have servers in three locations and I makes calls between them using WCF.

Since coast2coaster.com uses RCDB data, I'm assuming that RCDB has an API although it doesn't appear to be publicly published. If that is allowed, I would recommend contacting RCDB to see if they have an API that you can use. That would help ensure you are getting the correct data.
User avatar
By aoriole19
#93759
Thanks for the suggestion- I might use rcdb data in a future class or project, but because I was in a time crunch I decided to switch my topic to comparing Duke and UNC statistics; those are much easier to find online.

If anyone else ever does any kind of analysis with rcdb data though definitely let me know, I'd be interested to see what can be done.

Users browsing this forum: Firechaser and 1 guest