[[!meta copyright="Copyright © 2011 Free Software Foundation, Inc."]] [[!meta license="""[[!toggle id="license" text="GFDL 1.2+"]][[!toggleable id="license" text="Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled [[GNU Free Documentation License|/fdl]]."]]"""]] [[!toc]] # General Some [[tschwinge]] comments regarding your proposal. Which is very good, if I may say so again! :-) Of course, everyone is invited to contribute here! I want to give the following methodology a try, instead of only having email/IRC discussions -- for the latter are again and again showing a tendency to be dumped and deposited into their respective archives, and be forgotten there. Of course, email/IRC discussions have their usefulness too, so we're not going to replace them totally. For example, for conducting discussions with a bunch of people (who may not even be following these pages here), email (or, as applicable, the even more interactive IRC) will still be the medium of choice. (And then, the executive summary should be posted here, or incorporated into your proposal.) Also, if you disagree with this suggested procedure right away, or at some later point begin to feel that this thing doesn't work out, or simply takes too much time (I don't think so: writing emails takes time, too), just say so, and we can reconsider. Of course, as this wiki is a passive medium rather than an active one as IRC and email are, it is fine to send notices like: *I have updated the wiki page, please have a look*. One idea is that your proposal evolves alongside with the ongoing work, and represents (in more or less detail) what has been done and what will be done. Also, we can hopefully use parts of it for documentation purposes, or as recipes for similar work (enabling other programming languages on the Hurd, for example). For this, I suggest the following procedure: as applicable, you can either address any comments in here (for example, if they're wrong :-), or if they require further discussion; think: *email discussion*), or you can address them directly in your propoal and remove the comments from here at the same time (think: *bug fix*). Generally, you can assume that for things I didn't comment on (within some reasonable timeframe/upon asking me again) that I'm fine with them. Otherwise, I might say: *I don't like this as is, but I'll need more time to think about it.* There is also a possibility that parts of your proposal will be split off; in cases where we think they're valuable to follow, but not at this time. (As you know, your proposal is not really a trivial one, so it may just be too much for one person's summer.) Such bits could be moved to [[open_issues]] pages, either new ones or existing ones, as applicable. # GSoC Site Discussion * Discussion items from should be copied here: * technical bits (obviously); * also the *why do we want Java bindings* reasoning; * CLISP findings should also be documented somewhere permanently. * We should probaby open up a *languages for Hurd* section on the web pages ([[!taglink open_issue_documentation]]). # Java Native Interface (JNI) * * * * ## Java Native Access (JNA) * * This is a different approach, and *while some attention is paid to performance, correctness and ease of use take priority*. As we plan on only having a few native methods (for invoking `mach_msg`, essentially), JNA is probably the wrong approach: portability and ease of use is not important, but performance is. ## Compiled Native Interface (CNI) * * Probably faster than JNI, but only usable with GCJ. > Given that we have very few JNI calls, > it might be interesting to take a "dual" approach > if CNI actually improves performance > when compiling to native code. > --[[jkoenig]] 2011-07-20 # IRC, freenode, #hurd, 2011-07-13 [[!tag open_issue_documentation]] Yes, I guess so. Maybe start investigating mig because it may have repercussions on what the best approach would be for some aspects of the Mach bindings. I still think that making MIG emit Java code is not too difficult, once you have the required Java infrastructure (like what you're writing at the moment). On the other hand, if there's another approach that you'd like to use, I'm not trying to force using MIG. i still have a problem understanding your approach at which level are your bindings located ? I expect mig it will be the easiest route, but of course possibly it won't. jkoenig: Yeah, be give some high-level to low-level overview? ok, so at the very core, low-level, we have a very thin amount of JNI code to access (proper) system calls. by "proper" I mean things like mach_task_self, mach_msg and mach_reply_port, which are actually system calls rather than RPCs to the kernel. right at this level, we manipulate port names as integers, and the message buffers for mach_msg are raw ByteBuffers (from the java.nio package) actually, so-called /direct/ ByteBuffers, which are backed by memory allocated outside of the Java heap, rather than as a byte[] array we can retreive the pointer from the JNI code and use the buffer directly. (so, good for performance and it's also portable.) ok i'm more interested in the higher level bindings :) ok so, higher up. design goal from my proposal: "the memory safety of Java should be maintained and extended to Mach primitives such as port names and out-of-line memory regions" so integer port names are not "safe" in the sense that they can be forged and misused in all kinds of way which is why I have a layer of Java code whose job is to wrap this kind of low-level Mach stuff into safe abstractions and ideally the user should only use these safe abstractions. (Not to restrict the programmer, but to help him write correct code.) right. so you can't use mach RPCs directly tschwinge, also to actually restrict them, in a Joe-E / object-capability context, but that's not the primary concern right now ;-) or you force your wrappers to have these abstractions as input braunr, well, actually at this level you still have Mach RPC but for instance, port names are encapsulated into "MachPort" objects which ensure they are handled correcly As I understand it, you use these abstractions to prepare a usual mach_msg message, and then invoke mach_msg. ok and message buffers are wrapped into "MachMsg" objects which both help you write the messages into the ByteBuffer and prevent you from doing funky stuff and ensure the ports which you send/receive/pseudo-receive after an error/... are deallocated as required, etc. what's the interface to use IPC ? Is MIG doing that, too, I think? (And antrik once found some error there, which is still to be reviewed...) braunr, so basically as a user you would be free to use either one of these layers, or to use MIG-generated classes which would construct and exchange messages for you using the second (safe) layer. ok, let's just finish with the low level layer before going further please tschwinge, MIG does some type checking on the received message and saves you the trouble of constructing/parsing them yourself, but I'm not sure about how mach_msg errors are handled what are the main methods of MachMsg for example ? braunr, you may want to have a look at http://jk.fr.eu.org/hurd-java/doc/html/classorg_1_1gnu_1_1mach_1_1MachMsg.html right, sorry grabbed the code at work and forgot here and also https://github.com/jeremie-koenig/hurd-java/blob/master/HelloMach.java which uses it but roughly, you'd use setRemotePort, setLocalPort, setId to write your message's header then use one of the putFoo() methods to add data items to the message ok, the mapping with the low level C interface is very clear that's good for me the putFoo() methods would write the appropriate type descriptors, then the actual data. we can go on with the MiG part if you want :) right, so here you may want to look at the UML class diagram from http://www.bddebian.com/~hurd-web/user/jkoenig/java/proposal/ [[proposal]]. so in the C case, mig generates 3 files a header file which has the prototypes of the mig-generated stubs, a *User.c which has their actual implementation and a *Server.c which handles demultiplexing the incoming messages and helps with implementing servers. so we would do something along these lines, more or less: mig would generate the code for a Java interface in lieu of the *.h file. a generated FooUser class would implement this interface by doing RPC (so basically you would pass a MachPort object to the constructor, and then you could use the resulting object to do RPC with whatever is on the other end) and the generated FooServer class would do the opposite, ok issues with threads ? you would pass an object implementing the Foo interface to the constructor, i'm guessing the demux part may have to create threads, right ? and the resulting object would handle messages by using the object you passed. braunr, right, so that would be more a libports kind of code, the libports-like library, i see to which you could pass Server objects (for instance the FooServer above), and it would handle incoming messages. how is message content mapped to a java interface ? this would be determined from the .defs files and MIG would generate the appropriate code, hopefully. so the demux part would handle rpc integer identifiers ? right. but hm also mapping .defs files to Java interfaces might prove to be tricky. data types conversion and all tschwinge: my mamory is rather hazy. IIRC the issue was that the MIG-generated stubs deallocate out-of-line port arrays after the implementation returns, before returning to the dispatcher i'll just overlook this specific implementation detail but we could use some annotation-based system if we need to provide more information to generate the java code. but the Hurd (or rather glibc) RPC handling also automatically deallocates everything if an error occurs so I changed the MIG code to deallocate only when no error occurs jkoenig: ok, we'll talk about that when there is more progress and you have a better view of the problem at that time I was pretty sure that this is a correctly working solution, but it always seemed questionable conceptually... however, I wasn't able to come up with a better one, and nobody else commented on it antrik: shouldn't the hurd be changed not to deallocate something it didn't allocate in the first place ? braunr: no, the server has to deallocate stuff before returning to the client. the request message is destroyed before returning the reply. jkoenig, braunr: That's what I had in mind where MIG might be a bit awkward. Then we can indeed either add annotations to the .defs files, or reproduce them in some other format. That's some work, but it's mostly a one-time work. After all, the RPC interface is a binary one, and there may be more than one API for creating these messages, etc. jkoenig: actually, at least in the Hurd, server-side and client-side headers are separate -- so MIG actually creates four files tschwinge, wrt to annotations I was more thinking about Java ones, such as: @MIGDefsFile("mach/task.defs") @MIGCType("task_t") public interface Task { } antrik, oh, ok, it makes sense. jkoenig: anything else ? braunr, nothing that I can think of ok tschwinge: I think it would be a *very* bad idea to introduce redundancy regarding RPC definitions thanks for the tour :) (the _request.defs/_reply.defs mess is bad enough...) did I speak about the "Unsafe" pseudo-exception? that's interesting :-) jkoenig: Also, virtual memory abstractions? jkoenig: you didn't antrik: Well, then we could create some other super-format. But that's just a detail IMO. ok, so wrt virtual memory, a page we received can be wrapped with some JNI help into a (direct) ByteBuffer object. deallocating sent pages will be tricky, though. antrik: To put it this way: for me the .defs files are just one way of expressing the RPC interfaces' contracts. (At the same time, they happen to be the actual reference for these, too. But the specification itself could just as well be a textual one.) on approach I've been thinking about would be to "wrap" the ByteBuffer object into an object which has the sole reference to it, so that when it's deallocated the reference can be replaced with "null", and further attempts to access the buffer would throw exceptions. sounds reasonable but that's still in flux in my head, we may end up needing our own implementation of ByteBuffer-like objects. The problem being that there is no mechanism to ``revoke'' an object once a reference to it has been shared. right. A wrapper is one possibility indeed. tschwinge: they are called interface *definitions* for a reason :-) This is a very similar problem as with capabilities when there is no revoke operation for these, too. antrik: Yes, because they define MIG's input. :-P Isn't that what is called a membrane in the capability world? I do not say that we have to consider the format of the .defs to be set in stone; but I do insist on using a canonical machine-parsable source for all language bindings attenuation tschwinge, you mean the revokable proxy contruct ? (It's the same principle indeed) A common design pattern in object-capability systems: given one reference of an object, create another reference for a proxy object with certain security restrictions, such as only permitting read-only access or allowing revocation. The proxy object performs security checks on messages that it receives and passes on any that are allowed. Deep attenuation refers to the case where the same attenuation is applied transitively to any objects obtained via the original attenuated object, typically by use of a "membrane". http://en.wikipedia.org/wiki/Object-capability_model Yes. Good. I understood something. ;-) antrik: OKAY! :-P jkoenig: And hopefully the JVM will optimize away all the additional indirection... :-D jkoenig: Is there anything more to say about the VM layer? tschwinge, "hopefully", yes :-) Like, the data that I'm sharing -- is it untyped, isn't it? tschwinge, you mean that within the received/sent pages ? Yes. But that'S how it is, indeed. well actually the type descriptor should indicate what they contain. I cannot trust anything I receive from externally. it's most often used for MACH_MSG_TYPE_CHAR items I guess, and it will be type checked when retreive Yeah, and that then just *is* arbitrary data, like a block read from a disk file. you would have something like: ByteBuffer MachMsg.getBuffer(MachMsg.Type expected), and MachMsg would check the type descriptor against that which you specified Or a packet transmitted over the network. OK, yes. jkoenig: in theory ints should be used quite often too. the whole purpose of the type descriptors is to allow byte order swapping when messages are passed between hosts with different architecture... tschwinge, right, except for out-of-line port arrays, which need to be handled differently obviously. (which is totally irrelevat for our purposes -- especially since the actual network IPC code doesn't exist anymore ;-) ) antrik, oh, interesting Yes, that was one original idea. actually my litmus test for what the bindings should be, is you should be able to implement such a proxy in Java :-) antrik: And hey, you now have processors that can switch between different modes during runtime... :-) (although arguably that's a little bit ambitious) tschwinge: there should be bits in page tables to indicate the endianness to use on a page .. :) Hehe! jkoenig: Don't worry -- you're already known for ambitious projects. One more can't hurt. Also, actually the word size is not something that I've been able to abstract so far, so I'll be hardcoding little-endian 32 bits for now. why is that ? some of the Hurd RPC break the idea anyways BTW the org.vmmagic package (from Jikes RVM and JNode) could help with that, but GCJ does not support it unfortunately (not sure about OpenJDK) braunr, Java does not allow us to define new unboxed types jkoenig: does it have its own definition of the word size ? braunr, nope. (although, maybe, and also we could use JNI to query it) even if virtual, i'd expect a machine to have such a defnition braunr, maybe it has, but basically in Java nothing depends on the word size 'int' is 32 bits, 'long' is 64 and that's it. oh right, i remember most types are fixed size, right ? right. if not all now Jikes RVM's "org.vmmagic" provides an interface to defined new unboxed types which can depend on the actual word size, but Jikes RVM is its own JVM so obviously they can use and provide whatever extensions they need :-) (but maybe they've implemented them in OpenJDK for bootstrap purposes, I'm not sure) I'm missing this detail: where does the word size come into play here? anyway, I _could_ indiscriminately use 'long' for port names, and sparkle the code with word size tests but that would be very clumsy jkoenig: port names are actually ints :/ tschwinge, the actual format of the message header and type descriptors, for instance. jkoenig: ok, got your point braunr, by 'long' I mean 64-bits integers (which they are on 64-bits machines I think?) :) jkoenig: port names are as large as the word size but in C at least, they're int, not long it doesn't change many things, but you get lots of warnings if you try with a long :) What is the reason that port names are an architecture-dependent word size's width, and not simply 32 bit? "4 billions of port names should be enough for everyone" :-) tschwinge: an optimization is to use them as pointers in the kernel tschwinge: the machine's native word size is what it can process most efficiently, and what should be used for most normal operations... it makes sense to define stuff as int, except for network communication jkoenig: Well, yeah, but if you want to communicate with a peer, you have to agree on the maximum number anyway (not for port names, though, which are local). antrik: int isn't the word size everywhere antrik: the most common type matching the word size is long, at least on ILP32/LP64 data models braunr: that's just because some idiots assumed int would always be 32 bits, and consequently when 64 architectures came up the compiler guys chickened out ;-) without int, you wouldn't have a 32 bits type that's not true for all architectures and/or operating systems though AFAIK or a 16 bits one antrik: windows guys got even more scared, so windows 64 is LLP64 BTW, I haven't checked, but it's quite possible that 32 bit numbers are actually preferable even on AMD64... jkoenig: So, back on track. :-) jkoenig: You didn't find anything yet in Mach's VM interfaces as well a MemoryObject, etc., that can't be used/implemented in the Java world? antrik: they consume less memory, but don't have much effect on performance tschwinge, once we have the basic system calls and the corresponding abstractions in place, I don't think anything else fundamentally problematic could possibly show up braunr: if you really *need* a type of a certain bit size, you should use stdint types. so not having a 16 or 32 bit type in the short/int/long canon is *not* an excuse jkoenig: That speaks for the Mach designers! antrik: right tschwinge, on trick is that for instance, mach_task_self would still be unsafe even if it returned a nicely wrapped Task object, because you could still wreck your own address space and threads with it. So we would need the "attenuation" pattern mentionned above to provide a safe one. (which would disallow thinks such as the port/thread/vm calls) jkoenig: you mentioned the unsafe pseudo exception earlier braunr, right, so the issue is with distinguishing safe from unsafe methods braunr: BTW, the Windows guys actually broke a lot of stuff by fixing long at 32 bits -- this way long doesn't match size_t and pointer types anymore, which was an assumption that was true for pretty much any system so far... jkoenig: Yes. (And again hope for the JVM to optim...) antrik: that's right :) antrik: that's LLP64 antrik: long long and pointers braunr, so basically the idea is that unsafe methods are declared as "throws Unsafe" the effect is that if you use such a method you must either "throw Unsafe" yourself, or if you're building a safe abstraction on top of Unsafe methods, you'll "catch" the "exception" in question to tell the compiler that it's okay. it's more or less inspired from the "semantic regimes" idea from the org.vmmagic paper which is referenced in my original proposal, only implementing by hijacking the exception checking machinery, which has a behaviour similar to what we want. ok but hmm this seems pretty normal, what's the tricky part ? :) braunr: The idea is that the programmer explicitly has to acknowledge if he'S using an unsafe interface. tschwinge: sounds pretty normal too braunr, the trick is that you would not usually declare exceptions which are never actually thrown (and actually since the compiler does not know it's never thrown, I need to work around it in a few places) oh, ok jkoenig: that's interesting indeed braunr, the org.vmmagic paper provides an example which uses some annotations called @UncheckedMemoryAccess and @AssertSafe to the same effect (which is kind of cleaner), but it would be a headache to implement without help from the compiler I think (as far as I can tell the annotation processor would have to inspect the bytecode) but hm what's the true problem about this ? (the paper advocates "high-level low-level programming" and is a very interesting read I think, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.151.5253&rep=rep1&type=pdf, for what it's worth) what's wrong if you just declare your methods unsafe and don't alter anything else ? Yes, I read it and it is interesting. Unfortunately, it seems I forgot most of it again... braunr, declare? alter? you mean just tag them with an annotation? just stating a method "throws Unsafe" braunr, well some compiler will output a warning because they can tell there's no way the method is going to throw such an exception. and then some other compiler will complain that my @SuppressWarnings("unused") does not serve any purpose to them :-) also, when initializing final fields, I need to work around the fact that the compiler thinks "Unsafe" might be thrown. see for instance MachPort.DEAD jkoenig: ok braunr, but I'm more than willing to accept this in exchange for a clear, compiler-enforced materialization of the border between safe an unsafe code. actually another question I have is the amount of static typing I should add to the safe version, for instance should I subclass MachPort into MachSendRight, MachReceiveRight and so on. I don't want to depart from the C inteface too much but it could be useful. jkoenig: can't answer that :) jkoenig: keep them in mind for later i think jkoenig: What's the safety concern w.r.t. having MachPort (not) final? tschwinge, actually I'm partly wrong in that we only need name() and a couple other methods to be final jkoenig: That's what I was thinking. :-) I though I'm missing something here. tschwinge, the idea is that the user (ie., the adversary :-) could extend MachPort and inject their own fake port name into messages by overriding name() or clear() Yeah, but if these are final, that's not possible. right. And that *should* be enough, I think. Unless I'm missing something. I don't think so. Also I hope it is, because as mentionned above there might be some value in subclassing MachPort. Yep. incidentally, declaring the class or the method final will allow the JVM to inline them I think. It will help the JVM, yes. It can also figure that out without final, though. (And may have to de-optimize the code again in case there are additional classes loaded during run-time.) jkoenig: The reference counting in MachPort. I think I'm beginning to understand this. oh ok tschwinge, yes the javadoc is maybe a bit obscure so far. but basically you don't want the port name you acquire to become invalid before you're done using it. But how is this different from the C world? here my goal is to provide some guarantees if you use only safe methods like, you can't forge a port name and things like that so basically it should never be possible to include an invalid port name in a message if you use only safe methods. Ah, I see! Now that does make sense. but the mechanism in itself is similar to the Hurd port cells and user_link structures It's again ``only'' helping the programmer. right, no object-capability ulterior motives :-) another assumption which the javadoc does not state yet it that basically there should be exactly one MachPort object for each mach-level port name reference (in the sense of mach_port_mod_refs) Yes, I figured out that bit.