mysqlscaladockerapache-sparktidb

Dropped rows in Spark when modifying database in MySQL


I've been following the 5 min how to for setting up an htap databse with tidb_tispark and everything works until I get to the section Launch TiSpark. My first issue occurs when executing the line:

docker-compose exec tispark-master  /opt/spark-2.1.1-bin-hadoop2.7/bin/spark-shell

But I got around that by modifying the spark version to the version I found inside the container:

docker-compose exec tispark-master  /opt/spark-2.3.3-bin-hadoop2.7/bin/spark-shell

My second issue occurs when executing the three line block:

import org.apache.spark.sql.TiContext
val ti = new TiContext(spark)
ti.tidbMapDatabase("TPCH_001")

When I run the last statement I get the following output

scala> ti.tidbMapDatabase("TPCH_001")
2019-07-11 16:14:32 WARN  General:96 - Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/opt/spark/jars/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/opt/spark-2.3.3-bin-hadoop2.7/jars/datanucleus-core-3.2.10.jar."
2019-07-11 16:14:32 WARN  General:96 - Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/opt/spark/jars/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/opt/spark-2.3.3-bin-hadoop2.7/jars/datanucleus-api-jdo-3.2.6.jar."
2019-07-11 16:14:32 WARN  General:96 - Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/opt/spark/jars/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/opt/spark-2.3.3-bin-hadoop2.7/jars/datanucleus-rdbms-3.2.9.jar."
2019-07-11 16:14:36 WARN  ObjectStore:568 - Failed to get database global_temp, returning NoSuchObjectException

This doesn't prevent me from running the query:

spark.sql("select * from nation").show(30);

But when I follow the further steps of the tutorial to modify the db from MySQL, the changes are not reflected immediately in Spark. Furthermore, at some point in the future (I believe > 5 minutes later), the row that was modified stops showing up in Spark SQL queries.

I'm rather new to this kind of setup and don't really know how to debug this issue. Searches for the warnings I received weren't illuminating.

I don't know if it's helpful but when I connect MySQL this is the server version I get:

Server version: 5.7.25-TiDB-v3.0.0-rc.1-309-g8c20289c7 MySQL Community Server (Apache License 2.0)

Solution

  • I'm one of the main dev of TiSpark. Sorry for your bad experience with it.

    Due to my docker problem, I cannot directly reproduce your issue but it seems you hit one of the bug fixed recently. https://github.com/pingcap/tispark/pull/862/files

    1. The tutorial document is not quite up-to-date and points to an older version. That's why it didn't work with spark 2.1.1 as in tutorial. We will update it ASAP.
    2. Newer version of TiSpark doesn't use tidbMapDatabase anymore but hooks with catalog directly instead. Method tidbMapDatabase remains for backward compatibility. Unfortunately, the tidbMapDatabase had a bug(when we ported it from older version) that it retrieves timestamp for query only once you call the function. That causes TiSpark always uses old timestamp to do snapshot reading and newer data would never be seen by it.

    In newer version of TiSpark (TiSpark 2.0+ with Spark 2.3+), databases and tables are directly hooked into catalog services and you can directly call

    spark.sql("use TPCH_001").show
    spark.sql("select * from nation").show
    

    This should give you fresh data. So try restart your Spark driver, just try the two lines of code above and see if it works.

    Let me know if this fix your problem. On the other hand, we will check our docker image to make sure if it contains the fix already.

    If things still get wrong, would you please help to run below code and let us know the version of TiSpark.

    spark.sql("select ti_version()").show
    

    Again, sorry for causing you trouble and thanks for trying.

    EDIT

    To address your comment: The warning is due to spark itself will try to locate the database in its native catalog first and this will cause a Failed to get warning. But the failover process will delegate the search to tispark and then behave correctly. So this warning can be ignored. It's recommended that add below lines to your log4j.properties in conf folder of your spark.

     log4j.logger.org.apache.hadoop.hive.metastore.ObjectStore=ERROR
    

    We will polish the docker tutorial image soon. Thank you so much for trying.