object-storagelakefs

How to hard delete objects older than n-days in LakeFS?


How to find and hard delete objects older than n-days in LakeFS? Later it'll be a scheduled job.


Solution

  • To do that you should use the Garbage Collection (GC) feature in lakeFS.

    Note: This feature cleans objects from the storage only after they are deleted from your branches in lakeFS.

    You will need to:

    1. Define GC rules to set your desired retention period.

      From the lakeFS UI, go to the repository you would like to hard delete objects from -> Settings -> Retention, and define the GC rule for each branch under the repository. For example -

      {
          "default_retention_days": 21,
          "branches": [
              {"branch_id": "main", "retention_days": 28},
              {"branch_id": "dev", "retention_days": 7}
          ]
      }
      
    2. Run the GC Spark job that does the actual cleanup, with -

      spark-submit --class io.treeverse.clients.GarbageCollector \
        -c spark.hadoop.lakefs.api.url=https://lakefs.example.com:8000/api/v1  \
        -c spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY> \
        -c spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_KEY> \
        -c spark.hadoop.fs.s3a.access.key=<S3_ACCESS_KEY> \
        -c spark.hadoop.fs.s3a.secret.key=<S3_SECRET_KEY> \
        --packages io.lakefs:lakefs-spark-client-301_2.12:0.5.0 \
        example-repo us-east-1