algorithmvb6realbasicxojogedcom

Parse Gedcom to SQLite-Database


I am a Hobby Xojo-User. I wanna import a Gedcom-File to my Program, espacially to a SQLite-Database.

Structure of the Database

Tables

Persons

 - ID: Integer
 - Gender: Varchar // M, F or U
 - Surname: Varchar
 - Givenname: Varchar

Relationships

 - ID: Integer
 - Husband: Integer
 - Wife: Integer

Children

 - ID: Integer
 - PersonID: Integer
 - FamilyID: Integer
 - Order: Integer

PersonEvents

 - ID: Integer
 - PersonID: Integer
 - EventType: Varchar // e.g. BIRT, DEAT, BURI, CHR
 - Date: Varchar
 - Description: Varchar
 - Order: Integer

RelationshipEvents

 - ID: Integer
 - RelationshipID: Integer
 - EventType: Varchar // e.g. MARR, DIV, DIVF
 - Date: Varchar
 - Description: Integer
 - Order: Integer

I wrote a working Gedcom-Line-Parser. He splits a single Gedcomline into:

 - Level As Integer
 - Reference As String // optional
 - Tag As String
 - Value As String // optional

I load the Gedcom-File via TextInputStream (working fine). No i need to parse every Line.

Gedcom-Individual-Sample

0 @I1@ INDI
1 NAME George /Clooney/
2 GIVN George
2 SURN Clooney
1 BIRT
2 DATE 6 MAY 1961
2 PLAC Lexington, Fayette County, Kentucky, USA

You'll see, the Level-Numbers shows us a "Tree-Structure". So i thought it would be the best and simplest way to parse the File into separated Objects (PersonObj, RelationshipObj, EventObj etc.) into a JSONItem, because there its easy to get the Childs of a Node. Later on, i can simple read the Nodes, Child-Nodes to create the Database-Entries. But i don't know how to create such an Algorithm.

Can anyone help my please?


Solution

  • To parse the Gedcom lines with a good speed, try these ideas:

    Read the entire file into a String and split the lines up:

    dim f as FolderItem = ...
    dim fileContent as String = TextInputStream.Open(f).ReadAll
    fileContent = fileContent.DefineEncoding (Encodings.WindowsLatin1)
    dim lines() as String = ReplaceLineEndings(fileContent,EndOfLine).Split(EndOfLine)
    

    Parse every line using RegEx to extract its 3 columns

    dim re as new RegEx
    re.SearchPattern = "^(\d+) ([^ ]+)(.*)$"
    for each line as String in lines
      dim rm as RegExMatch = re.Search (line)
      if rm = nil then
        // nothing found in this line. Is this correct?
        break
        continue // -> onward with next line
      end
      dim level as Integer = rm.SubExpressionString(1).Val
      dim code as String = rm.SubExpressionString(2)
      dim value as String = rm.SubExpressionString(3).Trim
      ... process the level, code and value
    next
    

    The RegEx search pattern means that it looks for the start of the line ("^"), then for one or more digits ("\d"), a blank, one or more non-blank chars ("[^ ]"), and finally any more chars (".") before the end of the string ("$"). The parentheses around each of these groups is for extracting their results with SubExpression() then.

    The check for rm = nil hits whenever the line does not contain at least a number, a blank and at least one more character. If the Gedcom file is malformed or has blank lines, this may be the case.

    Hope this helps.