rubycopy-on-write

Can I change the encoding of a frozen String without copying it?


Can a String and its duplicate share the same underlying memory? Is there copy-on-write in Ruby?

I have a large, frozen String and I want to change its encoding. But I don't want to copy the whole String just to do that. For context, this is to pass values to a Google Protocol Buffer which has the bytes type and only accepts Encoding::ASCII_8BIT.

big_string.freeze

MyProtobuf::SomeMessage.new(
  # I would prefer not to have to copy the whole string just to
  # change the encoding.
  value: big_string.dup.force_encoding(Encoding::ASCII_8BIT)
)

Solution

  • It seems to work just fine for me: (using MRI/YARV 1.9, 2.x, 3.x)

    require 'objspace'
    
    big_string = Random.bytes(1_000_000).force_encoding(Encoding::UTF_8)
    
    big_string.encoding #=> #<Encoding:UTF-8>
    big_string.bytesize #=> 1000000
    ObjectSpace.memsize_of(big_string) #=> 1000041
    
    
    dup_string = big_string.dup.force_encoding(Encoding::ASCII_8BIT)
    
    dup_string.encoding #=> #<Encoding:ASCII-8BIT>
    dup_string.bytesize #=> 1000000
    ObjectSpace.memsize_of(dup_string) #=> 40
    

    Those 40 bytes are the size to hold an object (RVALUE) in Ruby.

    Note that instead of dup / force_encoding(Encoding::ASCII_8BIT) there's also b which returns a copy in binary encoding right away.

    For more in-depth information, here's a blog post from 2012 (Ruby 1.9) about copy-on-write / shared strings in Ruby:

    From the author's book Ruby Under a Microscope: (p. 265)

    Internally, both JRuby and MRI use an optimization called copy-on-write for strings and other data. This trick allows two identical string values to share the same data buffer, which saves both memory and time because Ruby avoids making separate copies of the same string data unnecessarily.