androidxml-parsinggarbage-collectionandroid-xmlpullparserobjectbox-android

Parse large xml file with XMLPullParser or Sax-Parser in Android causes lags


I'm having following problem: In my android tv app I can add (and later update) epg-sources, as .xml, .gz or .xz file (.gz and .xz are decompressed to .xml). So the user adds an url for a file, it gets downloaded and then parsed and saved to the objectbox-database. I tried the XmlPullParser and Sax-Parser and everything was working fine, for a xml with about 50mb and 700.000 lines (350 channels and about 80.000 programs) it took:

XmlPullParser -> 50 seconds on Emulator, 1min 30sec directly on my TV Sax-Parser -> 55 seconds on Emulator, 1min 50sec directy on my TV

I prefered that it would be a bit faster, but it was ok. Then I first realized that if I update the epg-source (download the xml again, parse it, and add the new epgdata to the ob-db) and navigate in my app in the meantime,

  1. it last much longer (some minutes for both, XmlPullParser and Sax-Parser

  2. the app began to lag while using it and on my TV it crashed also after some time - probably for memory reasons. If I updated the epg-source without doing anything other in my app, that didn't happen.

I noticed two things when "investigating" the Profiler.

  1. While parsing (especially the programs), the garbage collector is called very oftern, between 20-40 times in 5 seconds .
  2. When the process is finished the java part in the memory profiler jumps up to 200mb and needs some time before it gets gc.

I am not sure, but I read that the constantly calling of the garbage collector could cause the lags in my app. So I tried to minimize the object creations, but somehow it didn't change anything (or maybe I didn't it correct). I tested the process also without creating the database Object for the EpgDataOB and therefore also no EpgData was added to the database. But I could see still the many garbage collector call in the Profiler, so my parsing code should be the problem.

The only thing that helped me, was adding a delay of 100ms after each parsed program (logically that's no possible solution as it increases the process time for hours), or reducing the batch size (what also increases the process time, for example: using a batch-size of 500 = processtime on emulator: 2min 10sec and the garbage collector is called about 6-10 times in 5 seconds, reducing the batch to 100 -> emulator = nearly 3min, gc called 4-5 times in 5 seconds).

I'll post both my versions.

XmlPullParser

Repository code:

 var currentChannel: Channel? = null
    var epgDataBatch = mutableListOf<EpgDataOB>()
    val batchSize = 10000

    suspend fun parseXmlStream(
        inputStream: InputStream,
        epgSourceId: Long,
        maxDays: Int,
        minDays: Int,
        sourceUrl: String
    ): Resource<String> = withContext(Dispatchers.Default) {
        try {
            val thisEpgSource = epgSourceBox.get(epgSourceId)
            val factory = XmlPullParserFactory.newInstance()
            val parser = factory.newPullParser()
            parser.setInput(inputStream, null)
            var eventType = parser.eventType
          
            while (eventType != XmlPullParser.END_DOCUMENT) {
                when (eventType) {
                    XmlPullParser.START_TAG -> {
                        when (parser.name) {
                            "channel" -> {
                                parseChannel(parser, thisEpgSource)
                            }
                            "programme" -> {
                                parseProgram(parser, thisEpgSource)
                            }
                        }
                    }
                }
                eventType = parser.next()
            }
            if (epgDataBatch.isNotEmpty()) {
                epgDataBox.put(epgDataBatch)
            }

            assignEpgDataToChannels(thisEpgSource)

            _epgProcessState.value = ExternEpgProcessState.Success
            Resource.Success("OK")
        } catch (e: Exception) {
            Log.d("ERROR PARSING", "Error parsing XML: ${e.message}")
            _epgProcessState.value = ExternEpgProcessState.Error("Error parsing XML: ${e.message}")
            Resource.Error("Error parsing XML: ${e.message}")
        } finally {
            withContext(Dispatchers.IO) {
                inputStream.close()
            }
        }
    }

    private fun resetChannel() {
        currentChannel = Channel("", mutableListOf(), mutableListOf(), "")
    }

    private fun parseChannel(parser: XmlPullParser, thisEpgSource: EpgSource) {
        resetChannel()
        currentChannel?.id = parser.getAttributeValue(null, "id")

        while (parser.next() != XmlPullParser.END_TAG) {
            if (parser.eventType == XmlPullParser.START_TAG) {
                when (parser.name) {
                    "display-name" -> currentChannel?.displayName = mutableListOf(parser.nextText())
                    "icon" -> currentChannel?.icon = mutableListOf(parser.getAttributeValue(null, "src"))
                    "url" -> currentChannel?.url = parser.nextText()
                }
            }
        }

        val channelInDB = epgChannelBox.query(EpgSourceChannel_.chEpgId.equal("${thisEpgSource.id}_${currentChannel?.id}")).build().findUnique()
        if (channelInDB == null) {
            val epgChannelToAdd = EpgSourceChannel(
                0,
                "${thisEpgSource.id}_${currentChannel?.id}",
                currentChannel?.id ?: "",
                currentChannel?.icon,
                currentChannel?.displayName?.firstOrNull() ?: "",
                thisEpgSource.id,
                currentChannel?.displayName ?: mutableListOf(),
                true
            )
            epgChannelBox.put(epgChannelToAdd)
        } else {
            channelInDB.display_name = currentChannel?.displayName ?: channelInDB.display_name
            channelInDB.icon = currentChannel?.icon
            channelInDB.name = currentChannel?.displayName?.firstOrNull() ?: channelInDB.name
            epgChannelBox.put(channelInDB)
        }
    }

    private fun parseProgram(parser: XmlPullParser, thisEpgSource: EpgSource) {

        val start = SimpleDateFormat("yyyyMMddHHmmss Z", Locale.getDefault())
            .parse(parser.getAttributeValue(null, "start"))?.time ?: -1

        val stop = SimpleDateFormat("yyyyMMddHHmmss Z", Locale.getDefault())
            .parse(parser.getAttributeValue(null, "stop"))?.time ?: -1

        val channel = parser.getAttributeValue(null, "channel")

        val isAnUpdate = if (isUpdating) {
            epgDataBox.query(EpgDataOB_.idByAccountData.equal("${channel}_${start}_${thisEpgSource.id}")).build().findUnique() != null
        } else {
            false
        }

        if (!isAnUpdate) {
            val newEpgData = EpgDataOB(
                id = 0, 
                idByAccountData = "${channel}_${start}_${thisEpgSource.id}",
                epgId = channel ?: "",
                chId = channel ?: "",
                datum = SimpleDateFormat("yyyy-MM-dd", Locale.getDefault()).format(start),
                name = "",
                sub_title = "",
                descr = "",
                category = null,
                director = null,
                actor = null,
                date = "",
                country = null,
                showIcon = "",
                episode_num = "",
                rating = "",
                startTimestamp = start,
                stopTimestamp = stop,
                mark_archive = null,
                accountData = thisEpgSource.url,
                epgSourceId = thisEpgSource.id.toInt(),
                epChId = "${thisEpgSource.id}_${channel}"
            )
     
            while (parser.next() != XmlPullParser.END_TAG) {
                if (parser.eventType == XmlPullParser.START_TAG) {
                    when (parser.name) {
                        "title" -> newEpgData.name = parser.nextText()
                        "sub-title" -> newEpgData.sub_title = parser.nextText()
                        "desc" -> newEpgData.descr = parser.nextText()
                        "director" -> newEpgData.director?.add(parser.nextText())
                        "actor" -> newEpgData.actor?.add(parser.nextText())
                        "date" -> newEpgData.date = parser.nextText()
                        "category" -> newEpgData.category?.add(parser.nextText())
                        "country" -> newEpgData.country?.add(parser.nextText())
                        "episode-num" -> newEpgData.episode_num = parser.nextText()
                        "value" -> newEpgData.rating = parser.nextText()
                        "icon" -> newEpgData.showIcon = parser.getAttributeValue(null, "src") ?: ""
                    }
                }
            }

            epgDataBatch.add(newEpgData)
            if (epgDataBatch.size >= batchSize) {
                epgDataBox.put(epgDataBatch)
                epgDataBatch.clear()
            }
        }
    }

    private fun assignEpgDataToChannels(thisEpgSource: EpgSource) {
        epgChannelBox.query(EpgSourceChannel_.epgSourceId.equal(thisEpgSource.id)).build().find().forEach { epgChannel ->
            epgChannel.epgSource.target = thisEpgSource
            epgChannel.epgDataList.addAll(epgDataBox.query(EpgDataOB_.epChId.equal(epgChannel.chEpgId)).build().find())
            epgChannelBox.put(epgChannel)
        }
        epgDataBatch.clear()
    }

Sax Parser

Repository code:

suspend fun parseXmlStream(
        inputStream: InputStream,
        epgSourceId: Long,
        maxDays: Int,
        minDays: Int,
        sourceUrl: String
    ): Resource<String> = withContext(Dispatchers.Default) {
        try {
            val thisEpgSource = epgSourceBox.get(epgSourceId)
            inputStream.use { input ->
                val saxParserFactory = SAXParserFactory.newInstance()
                val saxParser = saxParserFactory.newSAXParser()
                val handler = EpgSaxHandler(thisEpgSource.id, maxDays, minDays, thisEpgSource.url, isUpdating)
                saxParser.parse(input, handler)
                if (handler.epgDataBatch.isNotEmpty()) {
                    epgDataBox.put(handler.epgDataBatch)
                    handler.epgDataBatch.clear()
                }
                _epgProcessState.value = ExternEpgProcessState.Success
                return@withContext Resource.Success("OK")
            }
        } catch (e: Exception) {
            Log.e("ERROR PARSING", "${e.message}")
            _epgProcessState.value = ExternEpgProcessState.Error("Error parsing XML: ${e.message}")
            return@withContext Resource.Error("Error parsing XML: ${e.message}")
        }
    }

Handler:

class EpgSaxHandler(
    private val epgSourceId: Long,
    private val maxDays: Int,
    private val minDays: Int,
    private val sourceUrl: String,
    private val isUpdating: Boolean
) : DefaultHandler() {

    private val epgSourceBox: Box<EpgSource>
    private val epgChannelBox: Box<EpgSourceChannel>
    private val epgDataBox: Box<EpgDataOB>


    init {
        val store = ObjectBox.store
        epgSourceBox = store.boxFor(EpgSource::class.java)
        epgChannelBox = store.boxFor(EpgSourceChannel::class.java)
        epgDataBox = store.boxFor(EpgDataOB::class.java)
    }

    var epgDataBatch = mutableListOf<EpgDataOB>()
    private val batchSize = 10000
    private var currentElement = ""
    private var currentChannel: Channel? = null
    private var currentProgram: EpgDataOB? = null
    private var stringBuilder = StringBuilder()


    override fun startElement(uri: String?, localName: String?, qName: String?, attributes: Attributes?) {
        currentElement = qName ?: ""
        when (qName) {
            "channel" -> {
                val id = attributes?.getValue("id") ?: ""
                currentChannel = Channel(id, mutableListOf(), mutableListOf(), "")
            }
            "programme" -> {

                val start = SimpleDateFormat("yyyyMMddHHmmss Z", Locale.getDefault())
                    .parse(attributes?.getValue("start") ?: "")?.time ?: -1

                val stop = SimpleDateFormat("yyyyMMddHHmmss Z", Locale.getDefault())
                    .parse(attributes?.getValue("stop") ?: "")?.time ?: -1

                val channel = attributes?.getValue("channel") ?: ""

                if (isUpdating) {
                    val existingProgram = epgDataBox.query(EpgDataOB_.idByAccountData.equal("${channel}_${start}_${epgSourceId}",)).build().findUnique()
                    if (existingProgram != null) {
                        currentProgram = null
                        return
                    }
                }
                currentProgram = EpgDataOB(
                    id = 0,
                    idByAccountData = "${channel}_${start}_${epgSourceId}",
                    epgId = channel,
                    chId = channel,
                    datum = SimpleDateFormat("yyyy-MM-dd", Locale.getDefault()).format(start),
                    name = "",
                    sub_title = "",
                    descr = "",
                    category = mutableListOf(),
                    director = mutableListOf(),
                    actor = mutableListOf(),
                    date = "",
                    country = mutableListOf(),
                    showIcon = "",
                    episode_num = "",
                    rating = "",
                    startTimestamp = start,
                    stopTimestamp = stop,
                    mark_archive = null,
                    accountData = sourceUrl,
                    epgSourceId = epgSourceId.toInt(),
                    epChId = "${epgSourceId}_$channel"
                )
            }
            "icon" -> {
                val src = attributes?.getValue("src") ?: ""
                currentChannel?.icon?.add(src)
                currentProgram?.showIcon = src
            }
            "desc", "title", "sub-title", "episode-num", "rating", "country", "director", "actor", "date", "display-name" -> {
                stringBuilder = StringBuilder()
            }
        }
    }

    override fun characters(ch: CharArray?, start: Int, length: Int) {
        ch?.let {
            stringBuilder.append(it, start, length)
        }
    }

    override fun endElement(uri: String?, localName: String?, qName: String?) {
        when (qName) {
            "channel" -> {
                currentChannel?.let { channel ->
                    val channelInDB = epgChannelBox.query(EpgSourceChannel_.chEpgId.equal("${epgSourceId}_${channel.id}")).build().findUnique()
                    if (channelInDB == null) {
                        val newChannel = EpgSourceChannel(
                            id = 0,
                            chEpgId = "${epgSourceId}_${channel.id}",
                            chId = channel.id,
                            icon = channel.icon,
                            display_name = channel.displayName,
                            name = channel.displayName.firstOrNull() ?: "",
                            epgSourceId = epgSourceId,
                            isExternalEpg = true
                        )
                        epgChannelBox.put(newChannel)
                    } else {
                        channelInDB.display_name = channel.displayName
                        channelInDB.icon = channel.icon
                        channelInDB.name = channel.displayName.firstOrNull() ?: channelInDB.name
                        epgChannelBox.put(channelInDB)
                    }
                }
                currentChannel = null
            }
            "programme" -> {
                currentProgram?.let { program ->
                    addEpgDataToBatch(program)
                }
                currentProgram = null
            }
            "desc" -> {
                currentProgram?.descr = stringBuilder.toString()
            }
            "title" -> {
                currentProgram?.name = stringBuilder.toString()
            }
            "sub-title" -> {
                currentProgram?.sub_title = stringBuilder.toString()
            }
            "episode-num" -> {
                currentProgram?.episode_num = stringBuilder.toString()
            }
            "rating" -> {
                currentProgram?.rating = stringBuilder.toString()
            }
            "country" -> {
                currentProgram?.country?.add(stringBuilder.toString())
            }
            "director" -> {
                currentProgram?.director?.add(stringBuilder.toString())
            }
            "actor" -> {
                currentProgram?.actor?.add(stringBuilder.toString())
            }
            "date" -> {
                currentProgram?.date = stringBuilder.toString()
            }
            "display-name" -> {
                currentChannel?.displayName?.add(stringBuilder.toString())
            }
        }
        currentElement = ""
    }



    private fun addEpgDataToBatch(epgData: EpgDataOB) {
        epgDataBatch.add(epgData)
        if (epgDataBatch.size >= batchSize) {
            processEpgDataBatch()
        }
    }

    private fun processEpgDataBatch() {
        if (epgDataBatch.isNotEmpty()) {
            epgDataBox.put(epgDataBatch)
            epgDataBatch.clear()
        }
    }
}

So I am searching for a fast way to parse the xml-data and insert it to the database, without having lags or crashes in my app :-) :-) Is there something wrong in my code that causes the lags? Or isn't it simple possible without slow down the parsing and database inserting process?

If any other code is needed, I can post it. Here what the Memory-Profiler looks like while parsing the Programs with XmlPullParser: Parsing with XmlPullParser

UPDATE:

Memory usage & gc -> only parsing, no database usage I used data classes Channel & Programme to parse the data somewhere, and reused always the same channel/programme: Memory usage & gc -> only parsing, no database usage

Memory usage & gc -> parsing and creating EpgDataOB Objects (no db inserting) Memory usage & gc -> parsing and creating EpgDataOB Objects (no db inserting)

Memory usage & gc -> parsing and add data to the database (db = last 10 seconds) Memory usage & gc -> parsing and add data to the database (db = last 10 seconds)

Memory usage & gc -> parsing, adding data to db & manage relation epg-channel with list of EpgData with:

 private fun addEpgDataToDatabase() {
        GlobalScope.launch {
            withContext(Dispatchers.IO) {
                epgDataBatch.chunked(15000).forEach { batch ->
                    epgDataBox.put(batch)
                    epgChannelBatch.forEach { epgChannel ->
                        epgChannel.epgDataList.addAll(batch.filter { it.epChId == epgChannel.chEpgId })
                    }
                    Log.d("EPGPARSING ADD TO DB", "OK")
                    delay(500)
                }
                epgDataBatch.clear()
            }
        }
    }

Memory usage & gc -> parsing, adding data to db & manage relation epg-channel with list of EpgData

New code for putting the parsed data into the data (tested also 3 times on TV, it's running much better then with the code of my question). Adding the whole epgDataBatch (= mutableListof) with one put into the database is even a little faster.

 private fun addEpgDataToDatabase() {
        epgDataBatch.chunked(30000).forEach { batch ->
            epgDataBox.store.runInTx {
                epgDataBox.put(batch)
                epgDataBox.closeThreadResources()
            }
        }
        addEpgDataToChannel()
    }

    private fun addEpgDataToChannel() {
        epgChannelBox.store.runInTx {
            for (epgCh in epgChannelBatch) {
                epgCh.epgDataList.addAll(epgDataBatch.filter { it.epChId == epgCh.chEpgId })
            }
            epgChannelBox.put(epgChannelBatch)
            epgChannelBox.closeThreadResources()
        }
        epgChannelBatch.clear()
        epgDataBatch.clear()
    }

Solution

  • Database inserts can be costly if you are doing a lot of them when inserting your parsed xml data after data object. From ObjectBox docs.

    This is because it uses blocking I/O and file locks to write the database to disk as each put is in an implicit transaction.

    Thus you can speed up parsing by speeding up the database inserts.

    You can batch up the data in to an array and put (insert) them all in one go and thus are in only one transaction, this will cost more memory but be faster.

    Or ObjectBox does have BoxStore.runInTx() that takes a Runnable to do multiple puts in a single transaction.

    ObjectBox seems want you to avoid just beginning a transaction at the start of the xml parsing and ending it when you have finished xml parsing. It does have an Internal low level method to do this.

    Note this also applies to other file based databases like sqlite.