Edgewall Software
Modify

Opened 15 years ago

Closed 13 years ago

Last modified 12 years ago

#462 closed defect (fixed)

Upgrade procedure needs to be "idempotent"

Reported by: sam.hendley@… Owned by: hodgestar
Priority: major Milestone: 0.6
Component: Infrastructure Version: dev
Keywords: Cc: thomas.moschny@…, hodgestar, osimons
Operating System: Linux

Description

I recently updated our bitten installation to the head revision from 0.6. The process went relativley smoothly, but there was an issue with the upgrade. It got through the first 7 steps without an issue but then choked on adding the index to the build_info table (can't remember real name) beacuse there was multiple entries for a revision. It printed a helpful error message telling me which records conflicted and how to delete them (upgrades.py#385). This is well done and more than you get from most systems. Two problems though, you can't use the method indicated since the "Build" model expects the table to have certain fields and the delete function fails if the table isn't the correct shape (which by definition it isn't since this is an upgrade.) The only way I found to delete the entries was to do:

sqlite3 <db> "delete from bitten_build where id = <build_id>"

Once I deleted the offending rows I then re-ran the upgrade script, it bombs on the "drop table" commands. I tried effing with the upgrades.py file to have it more tolerant of failure but there were too many steps and I wasn't entirely sure of the ramifications of what I was doing. Luckily I was doing this from a backup and was able to just revert to the backup, delete the rows then upgrade.

Either the upgrade should be idempotent or the check for a consitent database should be done before attempting the upgrade.

Attachments (1)

check-logs-dir-r911.diff (2.2 KB) - added by hodgestar 13 years ago.
Patch against r911 to sanity check logs folder on Bitten master start-up.

Download all attachments as: .zip

Change History (19)

comment:1 follow-ups: Changed 15 years ago by osimons

  • Milestone changed from 0.7 to 0.6
  • Owner set to hodgestar

Oh. Interesting. Ties in well with the ongoing work to get the upgrades in better shape - see #461 and #452.

The essence of your problem is then that you try to run our 'clean' code, but that fails because the now up-to-date code can't work with the build table in the old state that you have. Catch-22...

Thinking about the upgrades, there may also be several more issues not yet handled when this step fails: We transfer the all content of the logs from db => files. If an error occurs in subsequent upgrade, all the files on disk remain but the table is not dropped as expected. Not sure what happens on re-run, but really it should be moved out of the way before upgrade if the logs/bitten folder exists.

I wonder if we should somehow see if we could change the strategy for upgrades - instead of running through all steps needed and then commit at the end, we should perform each step isolated + bump version and commit for each successful upgrade. That way the upgrade continues from the method where it last stopped. That would likely mean we should ignore the shared db passed in, and instead use our own in isolation. It may not be 100% best-practice according to Trac code, but I'm quite sure it would make a more robust upgrade infrastructure. (We can't commit on the db connection passed in, as that may also contain changes from other plugins or Trac itself.)

Needs fixing before we exit beta for 0.6.

comment:2 in reply to: ↑ 1 Changed 15 years ago by cboos

I'm afraid the suggested approach can't work, let me explain:

Replying to osimons:

... That would likely mean we should ignore the shared db passed in, and instead use our own in isolation.

If you would use something like db = env.get_db_cnx(), which I assume is what you had in mind, you will still get the same connection, because of the pool.

If you manage to bypass the pool and create a separate connection to the same db, then you'll simply get into a lock, as you couldn't commit in case another transaction would still be in progress in the same thread.

(We can't commit on the db connection passed in, as that may also contain changes from other plugins or Trac itself.)

At least with SQLite, each DDL such as create/drop table will perform an implicit commit. PostgreSQL seems to be ahead of the pack and has transactional DDL, but that's about it, MySQL seems to behave like SQLite here.

So I think it doesn't really matter if you decide to do extra commits when you see fit, as upgrade routines are supposed to do DDL operations and those will anyway perform implicit commits. Think about the final "commit" as a safeguard done in case the last operation was not yet subjected to a previous commit.

comment:3 Changed 15 years ago by osimons

Thanks for input, cboos. I'm just thinking aloud and passing the buck here, as I can safely say that databases is not my forte - no doubt emphasized by my points above :-)

However, I still think each upgrade method/step/version should be its own entity and commit and finalize its work indepenently of the potential failure of other upgrades. It may be that the Trac API is flawed, and that we abuse it by also doing non-database work that cannot be rolled back if some other upgrade mathod fail (some Trac upgrades I believe also does this - writing config values and more, like workflow).

So, as I interpret you (cboos) as somewhat agreeing with this, I think we should expand each upgrade to update the version in the system table and commit the work done - and if anything fails the upgrade method should also know how to deal with its own non-database cleanup (if needed).

comment:4 Changed 14 years ago by thomas.moschny@…

  • Cc thomas.moschny@… added

Same problem here. After migrating half-way, it told me Duplicate builds found., and a second try to migrate the database failed with table old_log already exists, which was to be expected.

After

  1. reverting trac.db to the backup db
  2. moving log/bitten out of the way
  3. reverting to the previous version of bitten
  4. removing duplicates (where did these come from? I had 491 of them)

upgrade worked.

comment:5 Changed 14 years ago by osimons

From IRC chat with Thomas, it was also noted that the directory we create may have wrong permissions (from umask and owner:group from the person running the upgrades).

We could perhaps copy the stat information from the logs directory and apply it to the newly created logs/bitten directory? Better than nothing, but I suppose Bitten should also check permissions on first information from new builds from slaves and make sure the process can write to the directory. Need to make sure we don't get inconsistencies by db pointing to files that don't exist.

comment:6 Changed 14 years ago by thomas.moschny@…

Furthermore having all logs in a single directory is really not optimal. I have about 33000 files there now for about 3000 builds.

So, there should be sub directories. (For example named after the build number - because at it is now, there's no easy connection between a build and its logs.)

comment:7 in reply to: ↑ 1 Changed 14 years ago by osimons

Replying to osimons:

I wonder if we should somehow see if we could change the strategy for upgrades - instead of running through all steps needed and then commit at the end, we should perform each step isolated + bump version and commit for each successful upgrade. That way the upgrade continues from the method where it last stopped.

I've looked at this as a simple reorganization seems to work fine with me on SQLite at least:

  • bitten/main.py

    a b  
    7171                for function in upgrades.map.get(version):
    7272                    print textwrap.fill(inspect.getdoc(function))
    7373                    function(self.env, db)
    74                     print 'Done.'
    75             cursor.execute("UPDATE system SET value=%s WHERE "
    76                            "name='bitten_version'", (schema_version,))
    77             self.log.info('Upgraded Bitten tables from version %d to %d',
    78                           current_version, schema_version)
     74                    cursor.execute("UPDATE system SET value=%s WHERE "
     75                           "name='bitten_version'", (version,))
     76                    db.commit()
     77                    print "Bitten upgrade to version %d done." % version
     78                    self.log.info('Upgraded Bitten tables to version %d',
     79                                    version)
    7980
    8081class BuildSystem(Component):
    8182

Needs to be tested on other databases, and integrated into main upgrade-improvements patch.

comment:8 Changed 14 years ago by hodgestar

I've updated my patch in ticket #452 to include osimon's changes above and to include a test that runs an upgrade using the code in bitten/main.py that osimon's patch changes. Test passes on MySQL, PostGreSQL and SQLite.

comment:9 Changed 14 years ago by hodgestar

I committed a tweaked version of osimon's change (with the commit and version update happening after each version change and not each upgrade function call) in r786 (and to 0.6.x in r787).

I'm leaving this ticket open for the moment so we can discuss whether anything else is needed before closing it or whether what we have now is good enough.

comment:10 Changed 14 years ago by hodgestar

After looking through the ticket discussion again, I think the only items outstanding are:

  • Setting the bitten log files directory permissions correctly.
  • Splitting the log files out into folders. One level probably won't be enough here if we're intending to handle more than about 3000 or so builds. Maybe we need folders like: log/bitten/3/3100, log/300/300100, log/1000/1000100. That would make us good up to about a million builds assuming 1000 files per directory.

Maybe these two items should be pulled out into a separate ticket since they're unrelated to the original issue?

comment:11 follow-up: Changed 14 years ago by osimons

As for permissions in the logs directory, I think that the Master component should check that on initialization - not as part of upgrade. Permissions can change for so many reasons other than upgrades. So, we should have a small __init__() method that just creates and deletes a temporary file in the logs directory - any exceptions on component loading will log the error and disable the component (in theory). Which is basically what we want.

Redoing how files are organized is a new ticket, imho. I haven't looked at the details of the structure myself, but redoing it is a new upgrade + various internal fixes. I'd also like to reconsider the separate log-level file for that rework. I suggest leaving it for now though.

comment:12 in reply to: ↑ 11 Changed 14 years ago by osimons

Replying to osimons:

Redoing how files are organized is a new ticket, imho. I haven't looked at the details of the structure myself, but redoing it is a new upgrade + various internal fixes. I'd also like to reconsider the separate log-level file for that rework. I suggest leaving it for now though.

A new bug report for an upgrade issue that is strongly related to how logs files are written: #517

comment:13 Changed 14 years ago by hodgestar

  • Cc hodgestar added

comment:14 Changed 13 years ago by hodgestar

I've added #627 to track reorganizing log files. That just leaves sorting out the bitten log folder permissions for this ticket.

Changed 13 years ago by hodgestar

Patch against r911 to sanity check logs folder on Bitten master start-up.

comment:15 Changed 13 years ago by hodgestar

  • Cc osimons added
  • Status changed from new to assigned

@osimons: Would you mind looking over the log folder patch and check that it's in line with what you had in mind?

Currently the patch lets the raw exception raised flow back up and out of Build Master?.init. Should I try wrap it in a custom exception with a message explaining that the log folder sanity check failed?

comment:16 Changed 13 years ago by osimons

Suggestions / questions / comments:

  • The master already has logs_dir = Option() defined. Should the master that should check and verify the directory, and not the model? The model code should usually not worry to much about its surroundings, but for our unit tests for instance it should also be self contained. Is it best that all is kept in model? If so, perhaps master init isn't best place to trigger this? I'm not sure.
  • If location somehow is 'off', forcing a file write of fixed 'test' name may break things? Perhaps use tempfile.mkstemp(dir=<logdir>) instead to have Python find a suitable name?

comment:17 Changed 13 years ago by osimons

Having reviewed the current code and the patch again, I don't think the explicit and recurring check for directory existence and permissions is needed. In running code, nothing will get committed to database unless the files gets written without error, so no real problem with things going out of sync. Errors will propogate and get logged further up in the chain, for admins to look into like any other runtime error.

Re-reading the ticket, what we do not get correctly is the check for existence of logs/bitten. When running the upgrade procedure we should not "create if it doesn't exist" - we should in fact raise an error if it exists and request that the user removes/renames the current folder and gets it out of the way before running again. If not, each failure will produce the full set of files but nothing committed to database. Which is why Thomas got ~33000 files for just 3000 builds. With all the files inter-mingled there is no easy way to say which is which... We need to improve the upgrade script for this.

comment:18 Changed 13 years ago by hodgestar

  • Resolution set to fixed
  • Status changed from assigned to closed

Thanks for looking over the code and the patch and coming up with a much cleaner understanding of what was happening. Patch to raise an error if the logs folder already exists when the upgrade is performed committed in r938 and backported in r939.

I think that resolves this ticket. :)

Add Comment

Modify Ticket

Change Properties
Set your email in Preferences
Action
as closed The owner will remain hodgestar.
The resolution will be deleted. Next status will be 'reopened'.
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.