a thoughtful web.
Good ideas and conversation. No ads, no tracking.   Login or Take a Tour!
comment by kleinbl00
kleinbl00  ·  2899 days ago  ·  link  ·    ·  parent  ·  post: How to kill a supercomputer

My dad's been bitching about ASCI Q for over 10 years. The article is oddly present-tense; experiences with ASCI-Q and whatever the livermore machine were basically pushed the labs out of supercomputers and into distributed computing. Not mentioned in the article is LANL's solution - kill any processor that errors out. Throughout its history, ASCI-Q usually ran with about 20% of its chips offline. Effectively, they turned their giant box into a network.

It has also been suggested that LANL had to be more pragmatic than Livermore because, at 7200 feet, it had a lot more cosmic rays to deal with.





alpha0  ·  2899 days ago  ·  link  ·  

fail-fast is great but depends on detecting failures in a timely manner. :) Some errors will affect the output (e.g. result is off) but the process itself hasn't failed.

Re. distributed computing. It's a bit fuzzy. Your mutli-core laptop is in practical fact a "distributed system" but comes in a nice monolithic envelope.

kleinbl00  ·  2899 days ago  ·  link  ·  

Hey, my dad does networking. The supercomputing side of thing is someone else's problem.

Someone else's problem that he gleefully laughs and points at.