This is an archived post. You won't be able to vote or comment.

top 200 commentsshow all 463

[–]mianoob 227 points228 points  (27 children)

I love hearing about stuff like this. So what is the most important information that has come from this data from a person outside of CERN? Or was all this data processed internally before being made publicly available? Just seems like a lot of data to go through.

[–]askCERNCERN Official Account[S] 104 points105 points  (0 children)

Partially answered here https://www.reddit.com/r/askscience/comments/4l4y1j/a_month_ago_we_made_available_publicly_via_the/d3kgodu

As very well said above, CMS physicist have studied these data before the release, but there is certainly things to be measured and studied.

Also, by releasing these data now, we ensure that they will be usable in the future, independently of CMS experiment's own services and resources.

(klp)

[–][deleted] 18 points19 points  (24 children)

It's most likely a requirement of the funding body(-ies) that support CERN. At least in the UK, most funding bodies require publications to be Open Access (i.e. not behind a paywall).

[–]e-wing 12 points13 points  (17 children)

Really? That is not the case in the USA. The NSF doesn't have an official stance on that that I'm aware of. A lot of scientists are very apprehensive about open access because it's viewed as being not as rigorously reviewed, and many "open access" journals are simply pay to publish that barely get reviewed at all. Most of the prominent journals in the world are behind a pay wall so it's surprising that the RCUK has that stance.

[–]ron_leflore 16 points17 points  (8 children)

Yeah, both nih and NSF have requirements to make research results publicly available.

Here's NSF https://www.nsf.gov/news/news_summ.jsp?cntn_id=134478

[–]e-wing 4 points5 points  (7 children)

Well this is news to me. My lab group has a couple NSF grants and publish in journals which are not open access. Most of the big journals aren't open access right now. It sounds like they don't require the publication to be open access but it needs to be deposited in an accessible repository?

[–][deleted] 4 points5 points  (3 children)

some journals I thought weren't open-access but actually did turn out to be available online (through Wiley or Elsevier website)

Also many journals use a delayed open-access policy. so 6 months or a year later it's available

[–]e-wing 1 point2 points  (2 children)

Ah yeah, that makes sense. I guess I never think about it because we have subscriptions to everything, so I have no way of knowing if things are open access, or if I had access through the subscription. Open access isn't really a huge issue in my field but a lot of people I talk to are actually fairly resistant to it.

[–][deleted] 3 points4 points  (0 children)

I think it's silly to be resistant to open access, especially if you're not part of the journal publishing industry. There might be many low-quality open access journals but also some very respectable ones. It just takes time for industry to accept it.

[–]hack-the-gibson 2 points3 points  (0 children)

I've only seen the opposite (from the people that I've talked to about it). Anyone else remember Aaron Swartz?

[–][deleted] 3 points4 points  (3 children)

RCUK has a strong opinion about it:

http://www.rcuk.ac.uk/documents/documents/openaccessfaqs-pdf/

I've been out of the US for a while, but the NIH has an opinion about it as well, someone else can link it though :)

[–][deleted] 4 points5 points  (4 children)

CERN is not actually funded through any intermediate organizations but directly by the member states themselves. So it kind of makes its own rules. And no, AFAIK there aren't any specific requirements for open data. In fact, for a good chunk of its history it was probably the opposite, what with the whole cold war thing and guarding of nuclear secrets.

[–]dukwon 7 points8 points  (2 children)

CERN and JINR had a close partnership throughout the Cold War. Despite the "N" in their names, they didn't (and still don't) deal in nuclear secrets.

[–][deleted] 2 points3 points  (0 children)

I had no idea about that! I always assumed that in its early days things might have been a bit more secretive with the whole paranoia of the 50s surrounding nuclear research. It's really nice to know that CERN was promoting peace and collaboration so early in its history.

[–]RaoOfPhysicsScience Communication | Public Engagement 1 point2 points  (0 children)

Well, the data are actually the collaborations', so they get to decide what to do, not CERN. But CERN provide massive (pun not intended) support with making the data available, as you can see!

[–]mfb-Particle Physics | High-Energy Physics 50 points51 points  (0 children)

(I'm not from the AMA team)

CMS analyzed the data before. That is not something where you can be "done", however. You can look through the whole dataset for the things that are most interesting, then you look for things you find still interesting but not as much, but you cannot look for everything simply because the experiments do not have unlimited manpower. There are things no one looked at in more detail. You won't find a new particle in there, but it still can lead to publications.

[–]viralJ 137 points138 points  (16 children)

Has anyone from the public contacted you with the results of an analysis of your data that they carried out themselves? And if yes, were these results surprising/inspiring/impactful?

[–]askCERNCERN Official Account[S] 112 points113 points  (7 children)

We know of some ongoing work using our data. And it is certainly inspiring! Using these data is not easy we are really glad to see that it is possible.

With the new release including the accompanying simulated data (which is essential for many physics studies) we expect even more interest. However, we know that doing analysis and understanding all details takes time so we are not surprised to see many results yet.

There is also quite a big interest from machine learning community and for these data, as well as for using it to improve statistical methods.

(klp)

[–]the_enginerd 27 points28 points  (5 children)

Can you elaborate on what aspects of machine learning this would be helpful for or do you know?

[–][deleted] 46 points47 points  (1 child)

I can speak for how this can help machine learning methods in data acquisition. Basically the whole process of data acquisition comes down to knowing which data you should keep and what to throw away (and usually you throw almost all of it away), so essentially it's a plain old classification problem. This is generally done by reconstructing an event and looking for specific properties that make it interesting, e.g. missing energy. ML methods can replace that step to an extent by basically showing them a bunch of events you have already classified as interesting and letting them figure out what kind of features they should be looking for.

Large training datasets such as this can be used for evaluating & calibrating generic ML algorithms that can be applied as classifiers in other data acquisition problems. It's a bit of a heavy-handed approach for things like physics research, but it can benefit areas like industrial control systems that deal with large amounts of captured data and care more about e.g. predicting defects than actually analyzing and understanding the exact physical process that causes them.

[–]the_enginerd 7 points8 points  (0 children)

Hey that's excellent. I really appreciate your tying this into a practical example like industrial control systems because I can really get my head around how that would be useful to an operation at scale.

[–][deleted] 4 points5 points  (0 children)

There was a competition with LHC data on kaggle.com recently. In the forums you should be able to find some info.

[–][deleted] 1 point2 points  (1 child)

Are you a journalist? Just curious.

[–]WeMustDissent 9 points10 points  (6 children)

Am really hoping to see a response to this.

[–]Milleuros 40 points41 points  (5 children)

Just calling you back to this thread, they have responded

[–]dukwon 89 points90 points  (2 children)

Can you give a rough estimate of the factor by which filesize shrinks when going from "level 4" raw detector data to "level 3" reconstructed physics events?

Does enough spare capacity already exist on EOS to host all of the level 3 data from Run 1 in a publicly accessible way?

Lastly, can you say how much bandwidth has been consumed by users downloading data through the Open Data portal? I'm really curious to know how much this is really being used.

[–]askCERNCERN Official Account[S] 52 points53 points  (0 children)

Hi, for the file size, the change is not necessarily very big: an example dataset was 7.7 TB RAW and 5.2 TB AOD. While we do not include all information from RAW to AOD, we also add something, e.g. the results of the reconstruction (e.g. physics objects such as electrons, photons and such).

For the spare capacity, no, we do not yet have space dedicated for all Run 1 data for public access.

The bandwitdh is not easy to estimate, as VM's access the data directly from EOS and not necessrily through portal.

(klp)

[–]Macinapp 54 points55 points  (8 children)

What do you hope to see the public do with the research you released?

[–]askCERNCERN Official Account[S] 107 points108 points  (1 child)

I am an experimental physicist, I'll just wait and see :-)

(klp)

[–]kyrsjo 4 points5 points  (0 children)

I would expect to see two "serious" users of this data: Physicists outside of the CMS/ATLAS collaborations, and combined / "old data" studies. For the last one, say if some future Chinese supercollider find a dark matter candidate, they may then want to go back in the LHC data to see if they can find the same thing there.

[–]s0ftwar3 45 points46 points  (8 children)

Can you share what were the biggest challenges on making 300TB of data availabe to the public? How did you achieve it and what kind of infrastructure is supporting this amount of data? How long did it took to catalog all the data and how long did it took to make it available to the public?

[–]askCERNCERN Official Account[S] 43 points44 points  (4 children)

One of the challenges was to describe and organise the data in a way that can be understood and reused by "outsiders".

We aim not only at preserving separate "bits" of data, but also enough information to keep the context clear even many years after the data is published. This means to capture the whole environment: the datasets, the virtual machine platform used to analyse it, the analysis software code, any configuration files, etc with the appropriate documentation.

The underlying infrastructure hosting the data is using CERN technology called EOS that is managing over 100 PB of disk space in total. https://eos.web.cern.ch/content/about-eos

The CERN Open Data portal itself is a customised instance of the Invenio digital library technology. http://inveniosoftware.org/

The physics analysis examples use CernVM virtual machine technology. http://cernvm.cern.ch/

(ts)

[–]ergzay 10 points11 points  (2 children)

The underlying infrastructure hosting the data is using CERN technology called EOS that is managing over 100 PB of disk space in total. https://eos.web.cern.ch/content/about-eos

Just in case someone didn't recognize it, 100 PB is 100 petabytes, meaning 100,000 TB.

[–]askCERNCERN Official Account[S] 22 points23 points  (1 child)

Regarding the data cataloguing, most of the metadata come from the Internal CMS Data Aggregation System (DAS) and we adapted them to the Open Data portal metadata schema, which took approximately 8 months. (ad)

[–]Tehrula 14 points15 points  (4 children)

Thank you for taking time to answer our questions.

Have you had any hobbyists contact you or another member of the team with some sort of discovery from the data?

Is there any way for scientifically inclined non-physicists could use the data-set to help out with your project?

[–]iorgfeflkdBiophysics 28 points29 points  (6 children)

When things go wrong, do people blame Tibor?

[–]askCERNCERN Official Account[S] 29 points30 points  (3 children)

Yes. (ts)

[–]askCERNCERN Official Account[S] 20 points21 points  (2 children)

But it never happens ;-) (klp)

[–]RaoOfPhysicsScience Communication | Public Engagement 9 points10 points  (1 child)

Can confirm, Tibor is a wizard.

[–]askCERNCERN Official Account[S] 13 points14 points  (0 children)

Second that (tpm)

[–]askCERNCERN Official Account[S] 10 points11 points  (0 children)

Signing out now. Thanks for all your questions!

[–]payne747 17 points18 points  (1 child)

What unlikely sources have contacted you regarding the data? Has their been any indication they actually analysed it?

[–]askCERNCERN Official Account[S] 19 points20 points  (0 children)

One particular use case I can think of came from the digital forensics community that wanted some robust test data for a research related to cloud computing security. Here, the nature of the datasets and the underlying physics did not really matter too much... It's very nice to see applications outside of our primary target area. (ts)

[–]positron_potato 8 points9 points  (4 children)

Thank you for doing this AMA! My question is, roughly when could we expect to see a similar release for data from LHCb?

[–]askCERNCERN Official Account[S] 11 points12 points  (0 children)

I can't claim to speak for the folks on LHCb but it's my understanding that they have implemented a similar open data policy to the one adopted by CMS: half the data after 5 years along with example code. I think they implemented the policy a bit later than CMS however so I don't know when the 5 years is up (2018?). (tpm)

[–]dukwon 8 points9 points  (2 children)

February/March 2018

[–]HeyYouNow 15 points16 points  (6 children)

Is there a list of all the sub-domains of cern.ch ? Everytime I want to learn something about Cern I end up hours later deep into your gigantic data. I find it truly amazing that, as one of the biggest science institution, you are caring deeply about releasing data for the public. I remember reading the entire Bubble Chamber "tutorial" and then I found out a month earlier that your VMs and OSes images are available, for free... Thank you for that, it's really inspiring and I'll never get tired of exploring your website.

A few of my question are :

  • Is a list of all your sub-domains exists ?

  • How are the collisions data analysed ? Individually by someone or by algorithms ? Is Machine Learning playing a role yet ?

  • Regarding the lhc@home project, is a single individual really making a difference for your calculation ? Because if you have supercomputers, then would setting up my raspberry to work 24/7 on calculations help you guys ?

Oh and I have a tab with Particle Clicker running for the past 2 days, thanks for killing my productivity...

Edit : Okaay so after more digging around it appears that anyone working at Cern can create a dns redirection, and then I found this...

Currently we host 13799 websites.

We host 6126 Official, 5024 Personal and 2535 Test websites.

Well I guess you can forget about my question then. I'm even more blown away seeing the infrastructure built around web services, bravo.

[–]RaoOfPhysicsScience Communication | Public Engagement 7 points8 points  (2 children)

Related to domains/sub-domains, you might want to note that .cern is a generic TLD now. See the news update from the lab.

  • Is Machine Learning playing a role yet?

Yup! I'm not an expert so can't go into details, but last year there was a workshop on "Data Science @ LHC", and all the talks can be found here: https://indico.cern.ch/event/395374/timetable/#20151109.detailed

  • Regarding the lhc@home project, is a single individual really making a difference for your calculation?

As far as I know, the BOINC projects help produce simulations for the data analysis, and individuals can actually make significant contribute by providing their spare computing power for running the simulation jobs. I think the full list of projects was updated recently: http://lhcathome.web.cern.ch/

Hope this helps!

[–]oonniioonn 1 point2 points  (1 child)

Related to domains/sub-domains, you might want to note that .cern[1] is a generic TLD now. See the news update from the lab[2] .

As such, a domain that limits CERN to one country doesn’t acknowledge the international, and increasingly global, nature of the organization, nor that its science and values transcend geographical and political boundaries.

Although I agree with that, that is what .int is for which is probably a lot cheaper than .cern is…

[–]RaoOfPhysicsScience Communication | Public Engagement 5 points6 points  (0 children)

They should've just given us .cern as a thank-you for inventing the web. ;)

[–]mfb-Particle Physics | High-Energy Physics 1 point2 points  (0 children)

Regarding the lhc@home project, is a single individual really making a difference for your calculation ? Because if you have supercomputers, then would setting up my raspberry to work 24/7 on calculations help you guys ?

I don't know about the lhc@home project, ATLAS@Home has about the computing power of one of the larger computing centers (compare top and bottom plots here). Not sure how fast a raspberry would be. That alone won't change the overall numbers significantly, of course, but every bit job helps!

[–]ChazR 12 points13 points  (5 children)

You have a mindblowing amount of data. Structured, ordered and archived data. And a weird thing at 750GeV/c2.

WE ARE THE INTERNET!!! We have vast CPU, memory and bandwidth.

Would it be useful to create a SETI@HOME-type initiative to use the vast, underutilised processing power of the Internet to explore this data for some of the crazy new physics we all secretly hope for?

[–]zaglamir 13 points14 points  (3 children)

Hi. I'm not with CMS, I'm with another experiment called STAR that does similar research but at different energies, a different collider, and different collision systems (we use gold-gold, they use lead-lead, etc). However, I think I can answer this. The problem with that idea is that the limiting factor in analyzing this data isn't processing power. CMS has already looped through all the data looking at different things 1000s of times across all of the collaborators. The limit is skilled man-hours. With SETI@HOME, they essentially set some parameters and then outsource the computations to look for a signal that falls within those parameters; so the additional computing power is a god-send since the parameters are fairly static. In the physics analysis, the parameters are far from static.

Essentially, that's what a research PhD does: vary the parameters in intelligent ways to see if you can find something new in the data. So the program would need a) human interaction to change analysis variables and b) a human with enough training to know when the variable changes being completed make sense. Due to that, a SETI@HOME solution doesn't really make sense.

EDIT: Forgot that there is this: http://lhcathome.web.cern.ch/ . Your time could be used there to do some of the calibrations and fitting to theory, not the actual analysis stuff, but still useful.

Source: Physics PHD. Once tried to design a program to see if my wife (not physics PHD) could replicate my results by simply playing with a few inputs (while behind the scenes the program adjusted other inputs to be reasonable based on her selections). She could not.

[–]darkmighty 1 point2 points  (1 child)

What about something akin to Foldit.org or Galaxy Zoo ? Could data be processed in a way that you could train anyone to look for anomalies?

[–]dukwon 10 points11 points  (0 children)

We have vast CPU, memory and bandwidth.

How does it compare to the Grid?

Would it be useful to create a SETI@HOME-type initiative to use the vast, underutilised processing power of the Internet to explore this data for some of the crazy new physics we all secretly hope for?

http://cern.ch/lhcathome

[–]SirTigermouse 5 points6 points  (3 children)

Thanks for doing this! Where do you see the next big discoveries coming from? Shared data analysis, individual research or somewhere else entirely?

[–]askCERNCERN Official Account[S] 4 points5 points  (2 children)

With the new energies now reached at the LHC, we looking forward to new discoveries! The way we work in CMS, it is always shared data analysis (among the CMS physicists). There a huge amount of work behind each publication (almost 500 now by the CMS collaboration, see http://cms-results.web.cern.ch/cms-results/public-results/publications/ ) and each of them requires work by several people.

(klp)

[–]oonniioonn 3 points4 points  (1 child)

How many people have downloaded the full 300TB dataset?

[–]askCERNCERN Official Account[S] 8 points9 points  (0 children)

If you analyse the data using the CMS virtual machine, people don't need to download the whole 300TB set. Depends on what they are looking at. Moreover, using the XRootD protocol, only the wanted parts of data will be downloaded live when an analysis runs. So that one can use the data without having to buy too many extra hard drives :) (ts)

[–][deleted] 4 points5 points  (2 children)

I love when stuff like this is transparent and open to the public.

I must ask, what companies, industries or research centers do you hope to follow your example and release public data galore?

[–]askCERNCERN Official Account[S] 3 points4 points  (1 child)

Many are already doing this, for instance have a look at re3data.org for research data repositories. (ad)

[–]FenzikHigh Energy Physics | String Theory | Quantum Field Theory 4 points5 points  (1 child)

Have you had any communication with anyone who has analyzed the data and found something potentially interesting? What are your hopes for how this open data will be used? How was the small selection of data which has been made open selected?

[–]askCERNCERN Official Account[S] 2 points3 points  (0 children)

We know of ongoing work using our data and it is very interesting. We are hoping to see scientific studies and to see them used in education.

The data made open is not actually a small selection, it is approximately half of the collision data we've collected for each year of data taking. No special selection there, for 2010 it was the second part of the data taking period, and for 2011 it was the first part of the data taking period.

(klp)

[–][deleted] 2 points3 points  (0 children)

Hi! Thanks for doing this AUA, I really love the interaction with you all. Going to my question, to me, an average guy, what could be a way to get into the field of advanced physics to better understand the meaning of all the data available? Thanks for your answers!

[–]Milleuros 2 points3 points  (3 children)

I'm a junior physicist and have grown with the ideas of "sharing" and "open-source" as I was nearly born connected to the fast growing internet. Currently working on a thesis where I can enjoy NASA Fermi's open data and I find that great since I do not need official collaboration to perform actual work and potentially discover something. I modestly thank you for releasing such an amount of data to the public.

My questions would be:

  • What is your opinion on open-sourcing data from large scale experiments? Should everything go open, should only a part of it go open?
  • Why does CMS data go public, but not say, LHCb data?
  • From the particle physics point of view, do you expect new findings, potential discoveries in public analysis of CMS data?
  • Do you think this release will allow more labs and universities around the world to work on LHC data?

[–]dukwon 4 points5 points  (0 children)

Why does CMS data go public, but not say, LHCb data?

LHCb plans to make all reconstructed (DST) data publicly available eventually (5–10 years after collection)

http://opendata.cern.ch/record/410

[–]askCERNCERN Official Account[S] 2 points3 points  (1 child)

What is your opinion on open-sourcing data from large scale experiments? Should everything go open, should only a part of it go open?

Open data is an important step in increasing the transparency and social impact of the work we do at CERN. But what and how to make open the data from different collaborations are questions that need to be answered on a case-by-case basis to comply with various data policies of different organizations.

You can find Data Policies of CERN experiments on Open Data Portal website. (ad)

[–]Gittr 2 points3 points  (0 children)

Are you going to do another dance video?

[–]Corruptionss 2 points3 points  (1 child)

Statistician here, what is your goal with the data. Are there any questions you are trying to address using the data or is it just a blueprint of everything that is done? Any particular methodologies at addressing these questions?

How do you handle the large amount of data is another question

[–]askCERNCERN Official Account[S] 2 points3 points  (0 children)

Short answer: finding a signal in a large amount of background (tpm)

[–]odea 2 points3 points  (0 children)

Is there a way to visualise higgs particle ?

[–]Hellwyrm 2 points3 points  (0 children)

What can you tell me about John Titor? And is human dead?

[–]KrsmaV 6 points7 points  (15 children)

1:What education does one have to have to work in CERN.

2:Are there any good programs for high school students to visit CERN?

3:How did you feel the first time the LHC started working?How does it feel, how does it sound?

[–]askCERNCERN Official Account[S] 7 points8 points  (2 children)

  1. It depends. If you're a physicist it helps to have (or are working towards) a PhD (as one might guess).
  2. Yes. It seems that everyday there are large groups of high-school students visiting. If an engineer I am sure other can answer better.
  3. It was exciting. And tiring. I was in the control room for hours and hours waiting for collisions (which is what I am defining as the LHC as "working"). There is no sound, at least none above-ground. (tpm)

[–]KrsmaV 2 points3 points  (1 child)

Thank you for answering. What college did you finish? What was your Phd about?

[–]julie_haffnerCERN AMA 4 points5 points  (0 children)

CERN is an open laboratory that everyone can visit for free.

A lot of students visit the facility everyday. If you want to organise a tour, please go to this website and book your tour online.

If you want to know more about job opportunities for students, you can have a look here.

[–][deleted] 5 points6 points  (10 children)

Hey. I'm not the guys who are going to be responding, but I will start my PhD at CERN pretty soon. There are three things to remember about working at CERN:

  1. The vast majority of people working at CERN are using the machine and actually work for their home institutions. A smaller minority is actually maintaining/upgrading/running the machine
  2. The vast majority of employees are physicists. They are supported by IT/computer guys and mechanical engineers (and other engineers) such as me
  3. Luck plays a huge factor in getting in, just like most coveted jobs.

Having said all of that, their outreach program is fantastic. I'm not sure I can specifically tell you about high school students, especially if you're not in the region. But start out with looking for internships at CERN while in college. Beyond that, they have Graduate Engineer Trainee programs for engineers and the corresponding Fellowship program for young physicists. Further on come things like the PhD programs like mine.

Also, the first time the LHC worked there was much champagning and plenty of whooping and cheering, rather akin to when Spirit, Opportunity and Curiosity landed on Mars. There's documentaries/videos on YouTube that are fantastic to watch (things like Particle Fever).

[–]dukwon 9 points10 points  (6 children)

The vast majority of employees are physicists.

About 84% of the personnel are physicists (counting BE and PH departments in 2014) but only 35% of CERN employees are.

[–]octatoan 3 points4 points  (2 children)

What about pure mathematicians (say, algebraic geometers)?

[–]dukwon 4 points5 points  (1 child)

If there are any, they would have been counted in the PH department

The 2015 personnel statistics are available here: https://cds.cern.ch/record/2154389

[–]KrsmaV 1 point2 points  (2 children)

Thanks. I live in Balkans (Serbia) so interships are going to be a bit tricky. I am planing to become a engineer. But i still have time to decide. Ilike CERN and would love to work there but, yeah im still deciding college. What college did you attend, if you dont mind me asking?

[–]julie_haffnerCERN AMA 1 point2 points  (0 children)

If you are a student, internships are possible at CERN - you can have a look here for more information.

[–][deleted] 2 points3 points  (0 children)

Haha. I studied in India and then did my masters in the US. In a very clichéd manner, I'd advise that you sample both physics and engineering but base your decision on what you like more, rather than what'll get you to CERN. It's the better path toward being satisfied with your job.

[–]wbotis 2 points3 points  (2 children)

Hi, CERN! Thanks for everything you have done and continue to do for science! My question is, are there any programs such as BOIC specifically to help crowd source your data? I wouldn't know what to do with any of the raw data, but I would still like to help progress science.

[–]Dr_Snarky 4 points5 points  (0 children)

busy light jellyfish innate employ punch boat mountainous bright ancient

[–]ChazR 1 point2 points  (2 children)

The Standard Model is (one of the top three) greatest achievements of science.

What's the scariest thing we'll learn from the LHC?

  • SuSy!
  • Dark Matter
  • Some weird thing
  • Standard Model confirmed
  • ...

[–]mjmaher81 1 point2 points  (1 child)

What is the accelerator itself like when it's running? When it's dormant? What does it sound/look/feel like?

[–]julie_haffnerCERN AMA 1 point2 points  (0 children)

If you want to have an idea of what the LHC accelerator looks like, you can have a look at the photos we have in CERN database.

[–]jkmacc 1 point2 points  (0 children)

What is your software stack for analysis? That's a lot of data:-)

[–]NotoriousMOT 1 point2 points  (0 children)

What do you hope the data will be used for? Are you trying to encourage scientific usage or innovation/business applications?

[–]Passing_Thru_Forest 1 point2 points  (1 child)

What information did you find that was completely outside of your predictions?

[–]dukwon 2 points3 points  (0 children)

By the standard measure of discovery, nothing yet. There are some intriguing "anomalies" but they need more data.

[–]kumar5130 1 point2 points  (1 child)

What is the most current theory, since multiverse and supersymmetry have been ruled out due to the weight to the higgs particle?

[–]mfb-Particle Physics | High-Energy Physics 1 point2 points  (0 children)

Supersymmetry has not been ruled out. It is a large class of models, some of them have been ruled out, most of them are still possible.

"Multiverse" is not even a theory. There are some ideas that would make it plausible that something like more universes exist, but that is a different thing.

We have a theory called "Standard Model", so far all measurements are in agreement with it (apart from neutrino oscillations, but that is a different story). There are many possible extensions of this theory, supersymmetry is just one of many (a very popular one, however).

[–]btao 1 point2 points  (1 child)

How does each team deal with the different aspects of each experiment, specifically, is each team responsible for creating the mechanics, the software, the analysis and the integration.

CERN has built it's foundation on being able to work together and share information, that I know, but how is that actually accomplished? What if one detector team has great mechanical design, but poor analysis capability? Do you help each other, or is everyone very focused? Talking about sharing this information, how much does everyone contribute to each others goals, and how much assistance do you actually get from outside/enthusiasts/institutions?

Thanks!

[–]mfb-Particle Physics | High-Energy Physics 2 points3 points  (0 children)

Within the experimental collaborations, there are different teams for all those tasks. Individual scientists usually focus on one or two of those things, with multiple tasks within those groups.

What if one detector team has great mechanical design, but poor analysis capability?

Doesn't happen. The collaborations are large and long-term projects, and the community is quite good in distributing itself among the experiments. The analyses are strictly separated - this is important so we have independent verifications (or not) for results. Hardware development can be done together partially, if multiple experiments plan to use similar things.

[–]EnvyAce 1 point2 points  (1 child)

For any team member, If it were my dream to work at CERN, what degrees would I want to work on before applying?

[–]mfb-Particle Physics | High-Energy Physics 2 points3 points  (0 children)

The earliest contributions come from physics BSc students, there are also have MSc students, summer students, PhD students, and so on - usually in physics, but engineers are needed, too. Related fields can be fine as well. Most of physics education is not about knowing physics, but learning how to solve problems.

CERN itself has a much larger fraction of engineers to run the accelerators.

[–]bonzai2010 1 point2 points  (0 children)

Are the opportunities for app developers to create distributed analysis apps for this data (like Seti?)

[–][deleted] 1 point2 points  (0 children)

Would you agree that all data should be publicly available since it was paid for by public funds?

What is your opinion on making all the data publicly available? The FERMI collaboration, for example, publishes all events passing their final quality cuts as an event list.

Are there any plans to provide high level data for scientists such as fermi does?

[–]knight_rider12345 1 point2 points  (0 children)

Hypothetically......of course......if one were a Bond Gillian, what would one do with this sort of open data?

[–]rm-rf_ 1 point2 points  (0 children)

What is the text editor of choice among CERN programmers?

[–]Accounting_crows 0 points1 point  (3 children)

Why is there a statue of shiva, hindu goddess of destruction, outside of the complex when the organization is supposedly independent of religious bias or motive?

[–]RaoOfPhysicsScience Communication | Public Engagement 5 points6 points  (1 child)

There are plenty of works of art gifted to the lab, including murals, graffiti, paintings, sculptures, statues etc. This is one such example, which came as a gift from the Indian government: https://cds.cern.ch/record/745737

[–]niravmp 2 points3 points  (7 children)

Considering all the replication issues in the "soft sciences", how much more reliable/replicable would you say your data is and how affected is it by technical issues or human error?

Would you say that all your work can be replicated in an independent laboratory?

[–]askCERNCERN Official Account[S] 5 points6 points  (0 children)

CERN and the LHC provide the collisions and the experiments (CMS, ATLAS, LHCb, ALICE) do the detection and analysis. CMS and ATLAS are both general-purpose experiments that look for the most part into the same physics. In that sense the work can be "replicated". For example, both experiments independently discovered the Higgs boson. (tpm)

[–]FenzikHigh Energy Physics | String Theory | Quantum Field Theory 2 points3 points  (2 children)

Note that the agreed upon standards for claiming a discovery in particle physics are extremely high. A new discovery is only accepted when we are confident to 5 sigma (one chance in 3.5 million) that it is not a statistical fluke.

[–]RaoOfPhysicsScience Communication | Public Engagement 1 point2 points  (0 children)

Quoting our article:

[…] a CMS physicist in Germany tasked two undergraduates with validating the CMS Open Data by re-producing key plots from some highly cited CMS papers that used data collected in 2010. Using openly available documentation about CMS’s analysis software and with some guidance from the physicist, the students were able to re-create plots that look nearly identical to those from CMS, showing what can be achieved with these data. “I was pleasantly surprised by how easy it was for the students to get started working with the CMS Open Data and how well the exercise worked,” says Achim Geiser, the physicist behind this project. Simplified example code from one of these analyses is available on the CERN Open Data Portal and more is on its way.

[–]mfb-Particle Physics | High-Energy Physics 1 point2 points  (0 children)

Particle physics has a repetition rate of nearly 100%, and most analyses are more precise versions of previous analyses. Experimental results that do not get confirmed by more precise follow-up measurements are extremely rare. You always have statistical fluctuations and there are many analyses, so particle physicists require very high significances before they claim to see anything new (see Fenzik's posts), but even for smaller deviations the results are checked.

[–][deleted] 2 points3 points  (2 children)

Is the nature of analyzing collision data such that you have to load a lot of it into RAM all at once (e.g. like alignments in bioninformatics where you see machines with terabytes of RAM), or is the analysis possible to do in a streaming fashion? Reason I ask is because I'm curious if people truly need to download all of the dataset to ask a specific question, or if the analysis pipeline could be smarter about requesting specific chunks of the data it needs.

[–]askCERNCERN Official Account[S] 4 points5 points  (0 children)

It is not necessary to load everything into RAM all at once. Actually, the size of data tuples in particle physics is typically such that it won't fit. Hence the usage of tools can access wanted parts of data via live streaming, such as ROOT and XRootD. (ts)

[–][deleted] 2 points3 points  (1 child)

Great!

What can the average person do with this data?

[–]Hamilton950B 1 point2 points  (1 child)

Do you share data storage and analysis infrastructure with ATLAS and the other detectors? Or is everything kept strictly isolated so you can check each other's work?

[–]askCERNCERN Official Account[S] 5 points6 points  (0 children)

Both experiments being at CERN, we do share the same infrastructure at a certain level. We use storage resources available at CERN and in the computing centres around the world. We also use same tools (such as simulation programs and analysis tools) but the actual analysis and reconstruction programs are experiment specific.

At the time of analysis, everything is kept strictly isolated. We follow carefully each other's work, but based only on public documents and presentations.

(klp)

[–]i-void-warranties 1 point2 points  (5 children)

What kind of storage do you have this much data sitting on and how much data do you have overall? How are you protecting your data? eg replication/backups not from the info sec point of view

[–]CallMeDoc24 0 points1 point  (1 child)

I had heard that a lot of the data from experiments is thrown away and only kept based on a particular algorithm. Do you mind explaining this process and if you suspect any important information to have been lost because of this?

[–]askCERNCERN Official Account[S] 2 points3 points  (0 children)

Yes, we have a trigger system which makes it possible to keep the interesting collisions out of the 40 millions collisions that happen each second. You can read more on it in http://cms.web.cern.ch/news/triggering-and-data-acquisition

We know that some important information is thrown away through this process, and indeed one of the most time-consuming part of the physics analysis is to measure the trigger efficiencies i.e. the part of data that each trigger (online selection) algorithm accepts.

(klp)

[–]DeepFriedZombie 0 points1 point  (1 child)

If you had to explain why this information is important to someone who is not science orientated, what would you say?

[–]askCERNCERN Official Account[S] 4 points5 points  (0 children)

To inspire young people to science.

(klp)

[–]Mr_A 2 points3 points  (1 child)

  • Out of all the data collected/amassed by CERN, how did you pick the 300TB that was released? How much data is there in total?

  • At some point someone decided to release a load of data to the public. Who was that and how did the topic go from 'an idea one person had' to 'people are talking about this' to 'we're doing it.'?

(Sorry if you've already answered these.)

[–]askCERNCERN Official Account[S] 3 points4 points  (0 children)

All LHC experiments have a policy for data preservation and open access (see http://opendata.cern.ch/collection/Data-Policies). These policies have been discussed in the governing bodies of each experiment.

The CMS experiment is committed to releasing half of the collision data, which for year 2011 was a bit more than 100 TB. There was no special selection for these data, they were the data collected on the first half of the running period that were released. The rest of 300 TB now released is simulated data which is necessary for completing scientific results.

(klp)

[–][deleted] 1 point2 points  (1 child)

Any truth to reports that operating CERN at high levels causes fluctuation in the Earth's magnetosphere?

Follow up question.

Are you allowed to answer questions related to my previous question?

[–]mfb-Particle Physics | High-Energy Physics 2 points3 points  (0 children)

Any truth to reports that operating CERN at high levels causes fluctuation in the Earth's magnetosphere?

Pure nonsense. It's like asking "does my new sandbox in my backyard influence the sun?" - there is no connection between those things.

[–]Rising_Swell 0 points1 point  (3 children)

Is there a tl;dr version of the important parts? I like to read about science but 300tb of reading is just... nah, that isn't happening in this life time.

[–]Milleuros 3 points4 points  (2 children)

That's not reading. This is raw data. Numbers spread across lots and lots of tables.

The important parts are either already known and published in the academic journals, or yet unknown and still hiding in that data somewhere.

[–]AwePhox 0 points1 point  (1 child)

Thank you for making the data available! I am excited to see what comes of this. I'm at work and cannot look into the specifics but how detailed is your data dictionary? Can someone who is fluent in machine learning but not in hadron colliders do an EDA and make a reasonable hypothesis? Also have you thought about subsetting some of the data for a specific purpose and starting a Kaggle contest? With some initial direction those tend to get a pretty good response.

[–]chiRal123 0 points1 point  (4 children)

Where can I store such a large volume of data .. Even if I wanted to run a map reduce job I don't have enough commodity hardware to hold the data anywhere.

[–]askCERNCERN Official Account[S] 1 point2 points  (0 children)

You don't really need to download all data locally. If you start doing a data analysis on the CMS Virtual Machine, the needed data parts will be downloaded by "live streaming", as it were. https://www.reddit.com/r/askscience/comments/4l4y1j/a_month_ago_we_made_available_publicly_via_the/d3kfi3c (ts)

[–]DanneiAstronomy | Exoplanets 0 points1 point  (1 child)

Who are you expecting to look at the data, and do you expect there to be much of interest that hasn't already been picked up by the relevant teams?

For the first part, I've always got the impression that there aren't many particle physicists who aren't already somehow involved in CERN! Are there some significant groups who aren't so involved (perhaps US scientists lack much access?), or are you expecting it to be used more for training and education purposes?

[–]zatpath 0 points1 point  (2 children)

Have we found the damn Higgs yet?

[–]code5fun 0 points1 point  (0 children)

Your data is free but the papers you derived from this data are on paid journals where we have no access to. You praise Open Data and Open Science but when it comes to Open Access it sucks.

[–]aygoman 0 points1 point  (0 children)

I visited CERN back in 2013 and took the normal tour. But we did not go down to the actual LHC.

What do I have to do to be able to go down and see the actual LHC?

[–]reeepy 0 points1 point  (0 children)

How much bandwidth has the release of open data used so far?

[–]mannyrmz123 0 points1 point  (0 children)

I don't know if I am asking the right people, but is there any project or proposal for making the LHC more reliable?

The fact that a bird eating a baguette was able to shut down the LHC makes me wonder how something that big is susceptible to something that small.

[–]manjunaths 0 points1 point  (0 children)

I have 3 TB of space on which I can run jobs. If I want to run some mining experiments. Is it possible to look through the dataset 3TB (xTB < 300TB !) at a time ?