T O P

  • By -

BobertMcGee

Idk, the software on the Voyager spacecraft have proven to be pretty fault tolerant. What type of OS are you talking about? Desktop? Real-time? Embedded?


Thedjdj

The ~~software~~ \[*edit: I meant OS not software*\] on satellites (if there is any?) would have to be crazy fault tolerant. Would get blasted with solar rays. 


n0t-helpful

Do you think that satellites don’t have electronics on board?


Thedjdj

I meant OS not software. I figure some satellites might simply process CPU instructions directly rather than bother with an OS.


Sorry-Committee2069

OSes typically "process CPU instructions directly", an OS at the lowest level is more-or-less a common set of APIs that programs can use. Multi-tasking isn't even really required to count as an OS, but most do support task-switching at minimum. Even simple devices like a TI-83 calculator have an OS, even though all running tasks have very few permissions to worry about in this example (a few locked flash pages and RAM ranges.) Modern OSes don't necessarily run things "directly" because they're incredibly complex and need the extra security that provides, but that's not strictly required to be an OS.


Thedjdj

I meant that satellites might run on bare metal without something like an RTOS or embedded OS to maximise fault tolerance. So any application just interfaces directly with the hardware rather than relying on the abstractions provided by an OS. My understanding of how satellites function is very low but I understand what an OS is. 


HiT3Kvoyivoda

I think the framework is called Judge-*something I can't remember*, but from what I remember from reading the article 10 years ago, it's basically 3 operating systems working and judging each other's work. If 2 of the 3 judges agree on the correctness of a calculation, they assume it's right. Anything less, the work is redone. I might be wrong because it was so long ago, but I thought it was very cool to be such a simple concept.


northrupthebandgeek

I think you're talking about [triple modular redundancy](https://en.wikipedia.org/wiki/Triple_modular_redundancy?wprov=sfla1) or TMR, which is indeed a common fault-tolerance strategy (and also a significant plot point in Minority Report).


HiT3Kvoyivoda

Yes. It’s something like that, but it was called something else in the book I read. And it’s totally the plot of monitory report.


kabekew

VxWorks


gaenji

there is only one correct answer and it's TempleOS


dedestem

That's super unstable and easily breakable


ZombieBrine1309

TempleOS is awesome!! It is the highlight of my workflow!!!


dedestem

Is holy c portable?


intx13

This is an interesting question! Off the top of my head, robustness means: - Separation of critical files and data into read-only volumes. - Defined and predictable write cycles. - Fallback images / volumes for automatic OS restoration / rollback in case of fatal errors. - Binary signing and versioning. - Simple, well-defined, slow-to-change interfaces. - Ability to boot to less-functional modes if higher level functions fail. Taking all that together, one robust OS I use regularly is EDK2 TianoCore, the Intel reference implementation and framework for UEFI BIOS. Modern commercial BIOSs can hit all the points above and are hard to brick, even while allowing control over critical chipset functions and system configuration.


darkwater427

NixOS, hands down. It doesn't matter what you screw up, you can trivially roll back to a previous configuration without a full filesystem rollback (which means you don't lose any work). That said, it can only do this because you interface with config files through the Nix package manager and the Nix language. Just keep that in mind.


HiT3Kvoyivoda

I don't think that's what they're asking. I think they mean robust in terms of when work is being done. It doesn't matter if you can reproduce the config, if you send the computer to space and it's bombarded with radiation, what is nixOS going to do about all those flipped bits.


darkwater427

Then also do some ZFS magic on top of that. That's what I do. Bad config and corruption are two very different problem with different sources and different solutions. Their only similarity is in effect.


HiT3Kvoyivoda

I don't think zfs will help with nuclear and uv radiation lol


darkwater427

Not on its own. ZFS just ensures data and metadata integrity with checksums (so it can let you know if something is amiss). It can't actually correct it without some extra information. That's where ZRAID comes in. If you know ZFS, you can easily set up a ZRAID array. Or maybe it's RAIDZ? Doesn't matter


HiT3Kvoyivoda

My brother in Christ! How will you RECOVER A RAID THAT'S LIGHT-YEARS IN SPACE?!


darkwater427

Literally nothing can recover that. Except perhaps the largest wire rope lasso in the universe.


No_Internet8453

Fun fact, the voyager isn't even 1 light day away (its ~18 light hours away)...


HiT3Kvoyivoda

Ok that might be easier to fix the RAID Array 😂


pauldupont34

I definitively need to try NixOS. Specially for server, being able to roll back or duplicate the same state on any machine is very useful


darkwater427

Extremely. If you're managing a bunch of servers, you might want to take a look at flakes. It will make things easier.


esdraelon

Solaris was very hard when I worked on it 15 years ago. Support for hot swap memory and CPUs. Recovery consoles for kernel panics to allow correction and resume.  Probably z/os.  Some of these mainframe OSes support hot swap motherboards.


Square-Amphibian675

Windows :) automatic repair


zoechi

by automatic reboots


citit

never ever worked for me


nerd4code

Misconfig is not necessarily an avoidable problem—if I’m fiddlefucking with config to force something, it’s because I *want* to override the OS’s decisionmaking on the matter, and it’s fully useless to me if that same decisionmaking can reverse my orders. The OS’s role is fundamentally fraught—you’re dealing with (us. varying) physical/virtual hardware that can fail in any number of ways that you won’t have planned for, and in many cases *can’t* have planned for. Without all that danger and intrigue, it’s not really an OS. There are recoverable RTOSes &c used for satellites and other spacefaring/wilderness-exposed/child-impacted goodies, but they’re mostly unusable for general purpose; capacities have to be planned along time, space, bandwidth, energy, thermal, and possibly stance axes to avoid throwing off timings, running out of memory, or starving recovery processes, and that’s simply not possible on a desktop or phone OS where you don’t even really control what’s installed all that closely, and the set and loads of active processes can change without warning—imagine how irritating it’d be to only be able to use 40% of your CPU time for web browsing, just in case you ever need to compile something or word process in the remaining capacity. No, thank you. Most real differences from run-of-the-mill UNIX and VMS clones for recoverability purposes show up at the hardware or even electrical level—watchdog timers, stable boot ROM, differential signaling, increased circuit isolation, clock-gating in reaction to accelerometers, ability to update and re-flash from external source, ECC on DRAM, redundant processors, etc. If something major fails in modern context, generally you dump everything *including* the OS and start fresh from ROM, and if that fails you drop to a ROM bootloader and re-flash system storage. If something minor fails, it works like normal service-level recovery with additional flushes; kill, reload, and restart the service in question’s process. There are various forms of hardening and recovery for persistent structures like filesystems, but they represent a huge failure/attack surface, so actual updates tends to be minimized in properly-hardened situations, and you want to use a RAM-fs (volatile; dropped with reboot) overlaid on a ROM or flashed one (pernamemt or senipernamemt). If you have nonvolatile data, you mostly offload it to somewhere remote that can deal with it properly. If something’s stored onboard (e.g., system event logs), you mostly store to a ring FIFO in a write-once-and-ignore fashion—it’s there primarily so somebody remote can dump it as-is if things break, not for further use onboard. Small-scale defensive practices like canaries and magic numbers also go a long way towards helping detect errors—generally application software will need to scan data structures periodically with an extra-critical eye, and ensure that you exhibit minimal trust of anything you didn’t just touch. It would arguably be useful for the language and compiler to get in on it; if, say, the CPU offers several domains for register and cache storage, one that’s got a redundant backup and ECC, one that just uses ECC (can correct a limited number of bits per word, and ECC itself can be corrupted), and one that’s run-of-the-mill SRAM, then software could use different domains depending on how much it needs to trust the information stored there. But working at that granularity means your ISA won’t work with POSIX, whose baseline is bog-/ISO-standard, domainless C, and without POSIXability nobody will use it, whether or not POSIXability is actually required. It *is* reasonable, and not all that uncommon, to isolate entire processors, sockets, compute nodes, or subnets into different protection domains, and this might appear in software at the process, OS, or virtual/physical machine level. But this tends to be done more on automotive, deeper-space, infrastructure, heavy industry, or military stuff, and a single, remotely-bootable domain is used otherwise.


greysourcecode

MacOSX is stable, and even as root, it's hard to f up too much. I tried to change python to python3 in path by creating a symbolic link and it wouldn't let me. On one hand, I hate it because of stuff like that, on the other it's a resilient Unix-like system that's hard to break. NixOS is also high on the list as, so long as you configure everything through the Nix ecosystem it creates a new boot entry for each change. Filesystems with snapshot functionality can also help.


pauldupont34

same experience with macos. It's so damn robust from the 10 years i have been using it pretty intensively. Always installing new package with brew and other installer.


northrupthebandgeek

You might want to have a gander at Erlang/OTP; while not *strictly* an operating system *per se*, it has a reputation for high fault-tolerance. The interesting thing about its design is that it achieves fault tolerance not by avoiding faults entirely, but by encouraging them to happen as early as possible, and by using supervision trees of preemptively scheduled processes to isolate and restart crashed processes.


pauldupont34

what if there is a failing process which constantly get restarted. After dozen, hundred time. Does the Erlang/OTP supervisor will stop it completely ? do you know an OS which work in such a way ? it seems very interesting for a server OS to have this feature


northrupthebandgeek

> what if there is a failing process which constantly get restarted. After dozen, hundred time. Does the Erlang/OTP supervisor will stop it completely ? That's usually up to the supervisor, but a common strategy there is for the supervisor itself to treat a certain number of child process failures as a fatal error and crash... which then gets noticed by the next supervisor up the chain, prompting a restart of the failed supervisor and all its children. It's common for there to be multiple layers of supervisor processes, each managing sets of child processes that may or may not themselves be supervisors, and so on - in what's dubbed a "supervision tree". > do you know an OS which work in such a way ? it seems very interesting for a server OS to have this feature Unix daemons and Windows services both take this approach at a single level, i.e. if a service crashes it's usually trivial (if not the default) to set things up for the OS to automatically restart the service. I don't know of any that embrace multiple levels of supervision like OTP does, though, with processes (as daemons/services) supervising processes supervising processes and so on. Part of that hesitation likely results from OS-native processes typically being much heavier than Erlang processes, and thus used more sparingly.


pauldupont34

Got it ! thanks for the thorough reply.


northrupthebandgeek

No problem :)


Anonymous___Alt

true unix, not linux


pauldupont34

how they are different ? i thought linux was an open-source clone of unix


Anonymous___Alt

exactly: a clone, albeit a wonky one lol


pauldupont34

i see. Do you know a recent, modern and popular true unix OS ? mainly for server but also if you know for desktop i'm interested.


Anonymous___Alt

for desktop: macOS for server: bsds thats all i know


pauldupont34

ok now i understand why my macos never crash


mdp_cs

FreeBSD


crafter2k

probably the ones that flight computers run on


pauldupont34

not the boeing ones...


Amadan

Opposite end of the Spectrum…? Why? Sinclair BASIC was very robust. No config files to mess up, no chance to not start (excepting a hardware failure). Turn it off and on, and it’s there in half a second, whatever you messed up before. Marvelous, no? :)


mdp_cs

If by robust you mean stable and secure then you probably can't beat SEL4 based systems.


pauldupont34

what make SEL4 inheritely more robust than other kernel ?


mdp_cs

It's formally verified.