• 1 Post
  • 25 Comments
Joined 2 years ago
cake
Cake day: June 2nd, 2023

help-circle


  • Roscoe is one of my professors at ETH, and he gave a keynote at VISCon a few months ago where he discussed this stuff and what his department is working on. Apparently a lot of their (they being the systems department at ETH) current work is related to formally modeling which parts of a system have access to what other parts, and then figuring out which of those permissions are actually needed and then deriving the strictest possible MPU configuration while still having a working system. The advantage of this approach over an entirely new kernel is that, well, it doesn’t require an entirely new kernel, but can be built into an existing system, while still allowing them to basically eliminate the entire class of vulnerabilities they’re targeting.


  • This guy (Roscoe) is one of my professors and I’ve heard him give a few talks related to this before, so I’ll try to summarize the problem:

    Basically, modern systems do not really match with the classic model of “there’s a some memory and perhipheral devices attached to a bus, and they’re all driven by the CPU running a kernel which is responsible for controlling everything”. Practically every component has it’s own memory and processor(s), each running their own software independently of the main kernel (sometime even with their own separate kernel!), there are separate buses completely inaccessible to the CPU specifically for communicating between components, often virtually every component is directly attached to the memory bus and therefore bypasses the CPU’s memory protection mechanisms, and a lot of these hidden coprocessors are completely undocumented. A modern smartphone SoC can have 10s of separate processors all running their own software independently of each other.

    This is bad for a lot of reasons, most importantly that it becomes basically impossible to reason about the correctness or security of the system when the “OS kernel” is actually just one of many equally privileged devices sharing the same bus. An example of what this allows: it is (or was) possible to send malformed WiFi packets and trigger a buffer overrun in certain mobile WiFi modems, allowing an attacker to get arbitrary code execution on the modem and then use that to overwrite the linux kernel in main memory, thus achieving full kernel-level RCE with no user interaction required. You can have the most security-hardened linux kernel you want, but that doesn’t mean a damn thing if any one of dozens of other processors can just… overwrite your code or read sensitive data directly from applications!

    As I understand it, the goal of these projects is basically to make the kernel truly control all the hardware again, by having them also provide the firmware/control software for every component in the system. Obviously this requires a very different approach than conventional kernel designs, which basically just assume they rule the machine.


  • This is specific to page reclamations, which only occur when the kernel is removing a block of memory from a process. VMs in particular pretty much never do this; they pin a whole ton of memory and leave it entirely up to the guest OS to manage. The JVM also rarely ever returns heap memory to the kernel - only a few garbage collectors even support doing so (and support is relatively recent), and even if you have it configured correctly it’ll only release memory when the Java application is relatively idle (so the performance hit isn’t noticeable).



  • This probably won’t make much difference unless your application is frequently adding and removing large numbers of page mappings (either because it’s explicitly unmapping memory segments, or because pages are constantly being swapped in and out due to low system memory). I would suspect that the only things which would really benefit from this under “normal” circumstances are some particularly I/O intensive applications with lots of large memory mappings (e.g. some webservers, some BitTorrent clients), or applications which are frequently allocating and deallocating huge slabs of memory.

    There might be some improvement during application startup as all the code+data pages are mapped in and the memory allocator’s arenas fill up, but as far as I know anonymous mappings are typically filled in one page at a time on first write so I don’t know how relevant this sort of batching might be.



  • Traditional graphics code works by having the CPU generate a sequence of commands which are packed together and sent to the GPU to run. This extension let’s you write code which runs on the GPU to generate commands, and then execute those same commands on the GPU without involving the CPU at all.

    This is a super powerful feature which makes it possible to do things which simply weren’t feasible in the traditional model. Vulkan improved on OpenGL by allowing people to build command buffers on multiple threads, and also re-use existing command buffers, but GPU pipelines are getting so wide that scenes containing many objects with different render settings are bottlenecked by the rate at which the CPU can prepare commands, not by GPU throughput. Letting the GPU generate its own commands means you can leverage the GPU’s massive parallelism for the entire render process, and can also make render state changes much cheaper.

    (For anyone familiar, this is basically a more fleshed out version of NVIDIA’s proprietary NV_command_list extension for OpenGL, except that it’s in Vulkan and standardized across all GPU drivers)







  • It’s not that obscure - I had a use case a while back where I had multiple rocksdb instances running on the same machine and wanted each of them to store their WAL only on SSD storage with compression and have the main tables be stored uncompressed on an HDD array with write-through SSD cache (ideally using the same set of SSDs for cost). I eventually did it, but it required partitioning the SSDs in half, using one half for a bcache (not bcachefs) in front of the HDDs and then using the other half of the SSDs to create a compressed filesystem which I then created subdirectories on and bind mounted each into the corresponding rocksdb database.

    Yes, it works, but it’s also ugly as sin and the SSD allocation between the cache and the WAL storage is also fixed (I’d like to use as much space as possible for caching). This would be just a few simple commands using bcachefs, and would also be completely transparent once configured (no messing around with dozens of fstab entries or bind mounts).


  • ext4 aims to not lose data under the assumption that the single underlying drive is reliable. btrfs/bcachefs/ZFS assume that one/many of the perhaps dozens of underlying drives could fail entirely or start returning garbage at any time, and try to ensure that the bad drive can be kicked out and replaced without losing any data or interrupting the system. They’re both aiming for stability, but stability requirements are much different at scale than a “dumb” filesystem can offer, because once you have enough drives one of them WILL fail and ext4 cannot save you in that situation.

    Complaining that datacenter-grade filesystems are unreliable when using them in your home computer is like removing all but one of the engines from a 747 and then complaining that it’s prone to crashing. Of course it is, because it was designed under the assumption that there would be redundancy.


  • XFS still isn’t a multi-device filesystem, though… of course you can run it on top of mdraid/LVM, but that still doesn’t come close to the flexibility of what these specialized filesystems can do. Being able to simply run btrfs device add /dev/sdx1 / and immediately having the new space available is far less hassle than adding a device to an md array, then resizing the partition and then resizing the filesystem (and removing a device is even worse). Snapshots are a similar deal - sure, LVM can let you snapshot your entire virtual block device, but your snapshots are block devices themselves which need to be explicitly mounted, while in btrfs/bcachefs a snapshot is just a directory, and can be isolated to a specific subvolume rather than the entire block device.

    Data checksums are also substantially less useful when the filesystem can’t address the underlying devices individually, because it makes repairing the data from a replica impossible. If you have a file on an md RAID1 device and one of the replicas has a bad block, you might be able to detect the bitrot by verifying the checksum, but you can’t actually fix it, because even though there is a second copy of the data on another drive, mdadm simply exposes a simple block device and doesn’t provide any way to read from “the other copy”. mdraid can recover from total drive failure, but not data corruption.



  • bcachefs is way more flexible than btrfs on multi-device filesystems. You can group storage devices together based on performance/capacity/whatever else, and then do funky things like assigning a group of SSDs as a write-through/write-back cache for a bigger array of HDDs. You can also configure a ton of properties for individual files or directories, including the cache+main storage group, amount of data replicas, compression type, and quite a bit more.

    So you could have two files in the same folder, one of them stored compressed on an array of HDDs in RAID10 and the other one stored on a different array of HDDs uncompressed in RAID5 with a write-back SSD cache, and wouldn’t have to fiddle around with multiple filesystems and bind mounts - everything can be configured by simply setting xattr values. You could even have a third file which is striped across both groups of HDDs without having to partition them up.


  • DaPorkchop_@lemmy.mltoLinux@lemmy.mlJava uses double ram.
    link
    fedilink
    arrow-up
    5
    ·
    edit-2
    8 months ago

    There are stilly plenty of native libraries and the JVM itself. For instance, the networking library (Netty) uses off-heap memory which it preallocates in fairly large blocks. The server will spawn quite a few threads both for networking and for handling async chunk loading+generation, each of which will add likely multiple megabytes of off-heap memory for stack space and thread-locals and GC state and system memory allocator state and I/O buffers. And none of this is accounting for the memory used by the JVM itself, which includes up to a few hundred megabytes of space for JIT-compiled code, JIT compiler state such as code profiling information (in practice a good chunk of opcodes need to track this), method signatures and field layouts and superclass+superinterface information for every single loaded class (for modern Minecraft, this is well into the 10s of thousands), full uncompressed bytecode for every single method in every single loaded class. If you’re using G1 or Shenandoah (you almost certainly are), add the GC card table, which IIRC is one byte per alignment unit of heap space (so by default, one byte per 8 bytes of JVM heap) (I don’t recall if this is bitpacked, I don’t think it is for performance reasons). I could go on, but you get the picture.