Unsafe abstractions / Habrahabr

January 9, 2018
91 Views

The keyword unsafe is an integral part of the design of the language Rust. For those who are not familiar with it: unsafe is a keyword that, in simple terms, is a way to bypass type checking checking ) Rust.

The existence of the key word unsafe for many at first is a surprise.
In fact, is that the fact that programs do not "fall" from errors when working with memory,
is not a feature of Rust? If this is so, then why there is an easy way around
type system? It may seem like a language defect.

Yet, in my opinion, unsafe is not a flaw. In fact, he is
an important part of the language. unsafe performs the role of some output valve – this means that we can use the type system in simple cases, but it allows us to use all sorts of clever tricks that you want to use in your code. We only require that you hide these tricks ( unsafe code) for secure external abstractions.

This note presents the keyword unsafe and the idea of ​​a limited "insecurity".
In fact, this is a precursor of a note, which I hope to write a little later.
It discusses the Rust memory model, which indicates what can and can not be done in unsafe code.

"Insecure" code as plugin

I think that how interpreted languages ​​like Ruby (or Python) use code in C is a good comparison with the work unsafe in Rust. Take, for example, the JSON module in Ruby. It includes both an implementation on Ruby (JSON :: Pure) and an alternative implementation on C (JSON :: Ext). Usually when you use the JSON module, you run C code, but Ruby code
It does not interact with it as it does with regular Ruby code. Externally, this code looks like this
As well as any other module on Ruby, but inside it can use different tricky tricks and perform optimizations that can not be written only in the code on Ruby itself. (You can read this excellent article on Helix to find out more, there you can also learn how to write plug-ins to Ruby on Rust.)

Well, the same can happen in Rust, but on a slightly different scale . For example, you can write a productive hash table implementation on a "clean" Rust. Adding the same unsafe code will make this code even faster. If this data structure is used by many people or its work is very important for your program,
then it might be worth it (So we use unsafe code in the implementation of the standard library). However, in any case, the calling code on Rust refers to unsafe code as to unsafe : the superimposed levels of abstraction provide a uniform

Of course, the fact that using unsafe code to make the program faster does not mean that you should use it very often. Just like most Ruby code is written in Ruby, most of the Rust code is written on safe Rust. This is also true because safe Rust code is very efficient, so the benefits of moving to the use of unsafe code to achieve high performance are rarely worth the effort.

It seems that the most The frequent use of unsafe code for Rust is the use of libraries in other languages ​​via FFI ( Foreign Function Interface ). Each C function call from Rust is unsafe because the compiler can not in any way judge the "security" of the C code.

Extension of the language through unsafe code.

I think it's most interesting to write unsafe code on Rust (or C module on Ruby) in order to
to expand the capabilities of the language. Probably the most frequently cited example is the type Vec in the standard library, which uses unsafe code to manipulate uninitialized memory. Rc and Arc which are reference counters,
are also indicative examples. However, there are much more interesting examples, such as: CrossBeam and deque use unsafe code to implement nonblocking ( lock-free ) structures data Jobsteal and Rayon use unsafe code to implement the thread pool.

In this note, we'll look at one simple example: method split_at_mut which is available in the standard library. This method works with variable slices ( mutable slices ). It also takes the index ( mid ) and divides the section into two parts at the specified index. Subsequently, he returns two smaller slices: one with a range of 0..mid the second in mid ..

For convenience, it is possible to imagine split_at_mut implemented as follows:

  impl [T] {
    pub fn split_at_mut (& mut self, mid: usize) -> (& mut [T]& mut [T]) {
        (& mut self [0..mid]& mut self [mid..])
    }
}  

This code will not be compiled for two reasons:

  • In general, the compiler does not consider the index too "intently", in isolation from the array that includes it. This means that when he sees the indexing of the foo [i] type, he ignores the index and treats the array as a single entity ( foo [_] ). This means that he can not reveal what & mut self [0..mid] is an appeal to another area of ​​memory, rather than & mut self [mid..] . This is because carrying out like an analysis would require a much more complex type system.
  • Actually the operator [] is not part of the language – it is fully implemented in the standard library. Therefore, even if the compiler knew that 0..mid and mid .. do not overlap, this would not have followed his knowledge that these ranges refer to non-overlapping memory areas.

One can imagine that it is possible, by changing the compiler, to ensure that the specified code sample will be compiled, and perhaps we will implement it one day. But at the moment we prefer to implement methods similar to split_at_mut via unsafe code. This allows us to have a simple type system, having the ability to write an API similar to split_at_mut .

The boundaries of abstraction

A look at unsafe code like the plug-in code makes it possible to clearly express the idea of ​​"boundaries of abstraction." When you write a plugin on Rust, you expect that when the calling code in Ruby calls your functions, it will provide you with "native" Ruby variables.
Inside, you can do what you want, for example, use C array instead of vector but on Ruby. But when you go back to doing Ruby code, you have to convert your returned entities into standard variables for Ruby.

The same is true for unsafe code on Rust. The client code seems to be that your code is safe . This means that you can assume that the calling code will pass valid values ​​to the input. This also means that all of your values ​​that you return must meet the requirements of the Rust type system. Being within unsafe borders, you can bypass the rules at your discretion (of course, the scope of the additional opportunities provided is a topic for discussion, I hope to discuss this in a later note.)

Let's look at the method split_at_mut which we saw in the last section. To simplify understanding, we will only consider the external interface of the function represented by the signature:

  impl [T] {
    pub fn split_at_mut (& mut self, mid: usize) -> (& mut [T]& mut [T]) {
        // The body of the function is skipped, so that we can focus
        // on the public interface. In any case, the secure code should not
        // to be interested in what is here.
    }
}  

What can we understand from this signature?
To begin with, split_at_mut relies on the fact that all of its input data is valid (In safe-code, the compiler checks that this is indeed the case). unsafe the semantics of the method split_at_mut can be expressed in the following rules:

  • self the argument is of the type mut [T] . It follows that we get a reference pointing to some (N) number of elements of type T. This is a mutable ( mutable ) reference, so we know that to the memory self no one else can handle (until the mutable link ceases to exist). We also know that memory is initialized.
  • mid the argument is of the type usize . All we know is that this variable is a nonnegative integer.

There is one more unmentioned moment. Nowhere is it guaranteed that mid the index is an allowable index for referring to self . This implies the necessity that unsafe the code we will write will need to check this.

When split_at_mut ends, it must make the return value
matched the signature. Simply put, this means that the function must return
two permissible (indicating the allocated memory) subarray (slice), It is also important that these subarrays do not intersect, that is, they represent two non-overlapping areas of memory.

Possible implementations

Come on Let's look at several possible implementations split_at_mut and determine whether they are working variants or not. We have already seen that an implementation written on a "clean" Rust does not work (does not compile). Let's try to implement the function using raw ( raw ) pointers:

  impl [T] {
    pub fn split_at_mut (& mut self, mid: usize) -> (& mut [T]& mut [T]) {
        use std :: slice :: from_raw_parts_mut;

        // `unsafe` block gives access to operations with * raw * pointer.
        // Using the `unsafe` block, we declare that no our actions
        // will not cause UB (undefined behaviour).
        unsafe {
            // get * raw * pointer to the first element
            let p: * mut T = & mut self [0];
            // get a pointer to the `mid` element
            let q: * mut T = p.offset (mid as isize);
            // number of elements after `mid`
            let remainder = self.len () - mid;
            // "collect" the subarray from the elements in the range `0..mid`
            let left: & mut [T] = from_raw_parts_mut (p, mid);
            // "collect" the subarray from the elements in the range `mid..`
            let right: & mut [T] = from_raw_parts_mut (q, remainder);
            (left, right)
        }
    }
}  

This version is the closest to the one that is implemented in the standard library.
However, this code is based on an assumption that is not justified by the input values: the code assumes that mid is within the array boundaries. Nowhere is it verified that mid <= len . This means that q can be outside the boundaries of the array, also this means that the calculation remainder can cause type overflow and wrapping ( wrap around ),
This is a flawed implementation because it requires more guarantees than required
from the calling code.

We can correct this implementation by adding assert 'and the fact that mid is
(note that assert in Rust is always satisfied, even in optimized code):

  impl [T] {
    pub fn split_at_mut (& mut self, mid: usize) -> (& mut [T]& mut [T]) {
        use std :: slice :: from_raw_parts_mut;
        // check that `mid` is within the bounds of the array:
        assert! (mid <= self.len ());

        // as before, but without comments
        unsafe {
            let p: * mut T = & mut self [0];
            let q: * mut T = p.offset (mid as isize);
            let remainder = self.len () - mid;
            let left: & mut [T] = from_raw_parts_mut (p, mid);
            let right: & mut [T] = from_raw_parts_mut (q, remainder);
            (left, right)
        }
    }
}  

Well, here we practically repeated the implementation of this function in a standard library (we used several other auxiliary ones here
tools, but, in fact, the idea is the same.)

Extend the boundaries of abstraction

Of course, it could happen that we really wanted to believe that mid is within acceptable limits, and wanted to dispense with this verification. We can not do this because split_at_mut is part of the standard library. However, you can imagine an auxiliary method for the calling code that would certify this assumption, so we would do without an expensive check to find the index within the bounds of the array at runtime. In this case, split_at_mut relies on the calling helper code so that it can be guaranteed to find
mid within the array boundaries. This means that split_at_mut is no longer a safe-code, because it has additional requirements to the input values ​​to guarantee safe work with memory.

Rust allows you to express that all the function code is unsafe by placing the keyword unsafe in the signature of the function. After such a move, the "insecurity" of the code is no longer the internal implementation part of the function, now it is part of the function interface . So, we can make the option split_at_mut split_at_mut_unchecked – which does not check the location of mid in the acceptable boundaries:

  impl [T] {
     // Here, this function is declared as `unsafe`. Calling this
     // function is an `unsafe` action for the calling code,
     // because they must guarantee the invariant: `mid <= self.len()`.
     unsafe pub fn split_at_mut_unchecked(&mut self, mid: usize) -> (& mut [T]& mut [T]) {
         use std :: slice :: from_raw_parts_mut;
         let p: * mut T = & mut self [0];
         let q: * mut T = p.offset (mid as isize);
         let remainder = self.len () - mid;
         let left: & mut [T] = from_raw_parts_mut (p, mid);
         let right: & mut [T] = from_raw_parts_mut (q, remainder);
         (left, right)
     }
 }  

When fn is declared as unsafe just like it was done above, its call also becomes unsafe . This means that the person who writes the calling code must read the function documentation and make sure that all conditions are met.
And in this particular case, the calling code should make sure that mid <= self.len () .

If you are thinking about the boundaries of abstraction, the declaration unsafe that this is not part of the "safe" area of ​​Rust, where the compiler itself identifies errors by performing a static analysis at compile time. On the contrary, it means that a new abstraction appears that becomes part of the unsafe abstraction of the calling code.

Using split_at_mut_unchecked we can change the implementation of split_at_mut so that it invokes within itself, carrying out the necessary checks, caused split_at_mut_unchecked :

  impl [T] {
    pub fn split_at_mut (& mut self, mid: usize) -> (& mut [T]& mut [T]) {
        assert! (mid <= self.len ());

        // By placing the `unsafe` block in a function, we declare that we know
        // that the additional conditions imposed on `split_at_mut_unchecked`,
        // executed, and therefore calling this function is a safe action.
        unsafe {
            self.split_at_mut_unchecked (mid)
        }
    }

    // ** NB: ** requires that `mid <= self.len()`.
    pub unsafe fn split_at_mut_unchecked(&mut self, mid: usize) -> (& mut [T]& mut [T]) {
        ... // as earlier.
    }
}  

Unsafe abstractions and privacy.

Although there is no one in the language that would explicitly link the rules of privacy and the boundaries of unsafe abstractions, yet they are naturally connected to each other. This is because privacy allows you to control a piece of code that can change
field in your data, and this is the main building element used to build unsafe abstractions.

We noticed earlier that the Vec type in the standard library is implemented by using unsafe of the code. This would not have been possible without privacy. If you look at the definition of Vec you will see that it looks like this:

  pub struct Vec  {
    pointer: * mut T, // pointer to the beginning of the selected area of ​​memory
    capacity: usize, // the amount of allocated memory
    length: usize, // the amount of initialized memory
}  

The implementation code Vec carefully supports the invariant according to which the pointer and the first length elements to which it refers are always admissible. One would think that if length were open ( pub ) field, then the upper invariant was not possible: any calling external code could change the length Vec to an arbitrary

Based on this reason, the boundaries of "insecurity" tend to fall into one of two categories:

Leave a Comment

Your email address will not be published.