Can a String and its duplicate share the same underlying memory? Is there copy-on-write in Ruby?
I have a large, frozen String and I want to change its encoding. But I don't want to copy the whole String just to do that. For context, this is to pass values to a Google Protocol Buffer which has the bytes
type and only accepts Encoding::ASCII_8BIT.
big_string.freeze
MyProtobuf::SomeMessage.new(
# I would prefer not to have to copy the whole string just to
# change the encoding.
value: big_string.dup.force_encoding(Encoding::ASCII_8BIT)
)
It seems to work just fine for me: (using MRI/YARV 1.9, 2.x, 3.x)
require 'objspace'
big_string = Random.bytes(1_000_000).force_encoding(Encoding::UTF_8)
big_string.encoding #=> #<Encoding:UTF-8>
big_string.bytesize #=> 1000000
ObjectSpace.memsize_of(big_string) #=> 1000041
dup_string = big_string.dup.force_encoding(Encoding::ASCII_8BIT)
dup_string.encoding #=> #<Encoding:ASCII-8BIT>
dup_string.bytesize #=> 1000000
ObjectSpace.memsize_of(dup_string) #=> 40
Those 40 bytes are the size to hold an object (RVALUE) in Ruby.
Note that instead of dup
/ force_encoding(Encoding::ASCII_8BIT)
there's also b
which returns a copy in binary encoding right away.
For more in-depth information, here's a blog post from 2012 (Ruby 1.9) about copy-on-write / shared strings in Ruby:
From the author's book Ruby Under a Microscope: (p. 265)
Internally, both JRuby and MRI use an optimization called copy-on-write for strings and other data. This trick allows two identical string values to share the same data buffer, which saves both memory and time because Ruby avoids making separate copies of the same string data unnecessarily.