tcl

A string that represents an array: Using list commands versus iterating through the string?


This might be a rather stupid question but I'd like to know anyway.

I've been storing, in SQLite, a desired row order in a JSON array of integers such as [5,42,14,9]; and wanted to insert some new elements and it appears that SQLite's json_insert permits appending only, unlike MySQL's json_array_insert. So, I figured I'd just make a user-defined function to do this; and, because I don't need to do anything to the JSON array (which is just a string apart from the methods used on it) other than insert some integers separated by commas and/or remove a few, I figured why take the string range from 1 to end-1, split it into a list, insert some items, replace some items, and join into a string again. That seemed like a lot of work for a simple string operation.

So, I thought I'd just step through the string similar to as done in C and count the commas as indications of the array index and then append ranges of the string, excluding the elements to be removed and including the new elements. I even tracked the last known index and comma location to sort of "optimize" the search rather than always starting at either end. It works and I thought it was fairly fast, even on a list of 100,000 integers, especially when the new activity took place around the last edited index.

After that, I decided to compare it to using the list commands and don't quite understand.

In the first two blocks of results at the bottom of this post, the information on the last edited element is

set orderKeysLastIdx 24999
set orderKeysLastChar 138884

but in the third block, this information is not used and the best that can be done is to search down from the top, since 57,000 is closer to 100,000 than 0.

The only time my litte method is quicker than the list commands is when the new edit is close to the last edit: 1408 versus 10220 microseconds. After the distance increases or the information is not known, the list commands are far quicker.

I realize that the developers of Tcl are experts and I'm a hack, and native methods are quicker than script methods, but that's not my question. My question is, Why is doing what appears to be a lot of work in splitting the string on the comma, manipulating it a bit, and joining again, so quick and consistent (8-10ms) regardless of the location in the list, relative to iterating once through the string until the desired index/indices are located? Am I doing/coding it wrong or just wrong to try it at all in that, regardless of how much work the list commands appear to do, they'll always be quicker?

Thank you for considering my question.

The string looping I attempted is as below. "Optimize" is a bit of an overstatement and just uses the last edited index if closer than 0 or the array length.

ArraySliceOptimize $addStart
if { $dir > 0 } {
  while {
         [incr i] < $len
      && [incr comma [expr {[string index $arrString $i] eq {,}}]] < $addStart } {}
  set as $i
} else {
  if { $comma == 0 } {
    set as $len
    return
  } else {
    while {
           [incr i -1] > 0
        && [incr comma -[expr {[string index $arrString $i] eq {,}}]] > 0 } {}
    set as $i
  }
}

And based on the values found, the new string is built similar to:

append newArrayString [string range $arrString $de+1 $as-2],
append newArrayString [join $elements {,}]
append newArrayString [string range $arrString $as-1 end]

Results:

set starttime [clock microseconds]
puts [ArraySplice orderKeys 25042 5 25500 {abc efg hij klm}]
puts "elapsed: [expr {[clock microseconds] - $starttime}]"

set starttime [clock microseconds]
set strList [split [string range $orderKeys 1 end-1] {,}]
puts [join [lrange $strList 25042 25046] {,}]
set orderKeys \[[join [linsert [lreplace $strList 25042 25046] 25500 abc def hij klm] {,}]\]
puts "elapsed: [expr {[clock microseconds] - $starttime}]"

# Search up from orderKeysLastIdx.
# 25042,25043,25044,25045,25046
# elapsed: 1498
# 25042,25043,25044,25045,25046
# elapsed: 10220

set starttime [clock microseconds]
puts [ArraySplice orderKeys 57000 5 65500 {abc efg hij klm}]
puts "elapsed: [expr {[clock microseconds] - $starttime}]"

set starttime [clock microseconds]
set strList [split [string range $orderKeys 1 end-1] {,}]
puts [join [lrange $strList 57000 57004] {,}]
set orderKeys \[[join [linsert [lreplace $strList 57000 57004] 65500 abc def hij klm] {,}]\]
puts "elapsed: [expr {[clock microseconds] - $starttime}]"

# Search up from orderKeysLastIdx.
# 57000,57001,57002,57003,57004
# elapsed: 31961
# 57000,57001,57002,57003,57004
# elapsed: 8761

set starttime [clock microseconds]
set strList [split [string range $orderKeys 1 end-1] {,}]
puts [join [lrange $strList 57000 57004] {,}]
set orderKeys \[[join [linsert [lreplace $strList 57000 57004] 65500 abc def hij klm] {,}]\]
puts "elapsed: [expr {[clock microseconds] - $starttime}]"

# No info. : Search down from top.
# 57000,57001,57002,57003,57004
# elapsed: 90081
# 57000,57001,57002,57003,57004
# elapsed: 8977



Solution

  • Above a certain point of complexity, it's faster to perform manipulations on the information model of a value rather than the actual string representation of it. The way Tcl handles values internally reflects that. The runtime remembers the interpretation of a value that it last used, as a string or an integer or a list or ... These interpretations are very similar to types in other languages, but the difference with Tcl is that they don't represent the final word on what a value is. Interpretations can change, and this can be useful.

    Strings have three different interpretations that can be present. One is the UTF-8 representation, which is useful as a lingua franca. The other two used a fixed width character model that permits efficient access and modification: one is as an array of Tcl_UniChar (usually unsigned short; corresponds to UCS-2 in host endianness), and the other is as an array of unsigned char (only useful for binary data, but works great for that). Those array forms work really well and permit efficient indexing, necessary for how many high-speed string manipulation algorithms work, but don't interoperate quite so well, whereas UTF-8 strings can usually be fed into all sorts of APIs and work without modification. The downside of UTF-8 is exactly that its characters occupy a variable number of bytes; indexing into it is expensive.

    You normally don't need to care about the differences; Tcl uses the ones that make most sense for the API operations you call.

    Inserts into the middle of strings are not a heavily optimized case; they're comparatively rare (appends are different, and deeply optimized). Inserts into lists are a bit more common, and are a bit faster (mostly because they can be done with less data movement). The downside is that values need to be converted into a list and back; the database itself doesn't help. Do enough manipulation and the overhead becomes worth it.

    My thoughts at this point on what you're really doing though are that you should either look at the rl_json package for a more efficient way to handle JSON in Tcl (though I don't know how good it is at mid-array insertion), or look at changing the data model in SQLite so you store the list not as JSON but rather as values in their own table.