Summary of thoughts on data and migrations 17 Oct 2009

I shared a summary of my thoughts today with my project team (we’re distributed, multi-national team)… and I thought what the heck why not share with the rest of the world to. Here goes…

I’m not sure what everyone’s conventions or views are on handling the removal of data in the app that is no longer used, so here are some of the guidelines I follow to help promote sharing and consistency.

Seed data

To load data that has to exist for the application to run, I like to use seed data and I am a big fan of Michael Bleigh’s seed-fu plugin. Seed data should be setup so that it can be loaded at any point during the lifetime of an application. It should not destroy data that leaves orphaned data and it should not duplicate data.

Updating data

I like to use migrations to update existing data. If the migration and data update is complex and irreversible I will add a unit test around the migration because due to the high risk of causing irreversible damage (unless a full DB backup is loaded).

Updating application-required data

When the data that the application depends on changes (ie; the seed data) I tend to update both the seed files as well as add a migration to properly handle any existing data, whether this means removing the data, marking it as delete or inactive, etc.

Updating data

In development modifying data directly through MySQL or ruby scripts is much more forgivable and often very fast and efficient. We have no risk of causing users of the application harm or loss of data. Once something is figured out I like to either add a seed-file and/or a migration if it’s a change that needs to get made on staging and production.

In staging modifying data directly is less appealing because staging is supposed to represent what happens when we go to production. If we have to modify data directly in staging then we’ll probably have to modify it in production, so a migration and/or seed-file is a better place to track the change. I find it tempting sometimes to script/console and modify data directly.

A problem with this is that we risk screwing up staging for folks who are using it to preview features, check out bug fixes, etc. I find it better to dump the database and pull it over to my local development machine. Once I have it I can load it up locally, recreate the issue, and do whatever I need to do to find the source of the issue. This removes the risk that you fill further break staging for any users using it. Once I find the issue if I need to I will add/update a migration and/or seed file.

In production modifying data directly is super dangerous. It’s a change that occurs outside of version control and potentially without being run in development and/or staging which could catch bugs in the updates. The risk of causing irrevocable damage is so high I think doing this should be avoided.

I find it’s always been better to take an extra few minutes to make sure the fix is right and then to deploy, then it is to make a quick fix directly in production and find the change caused new issues. Production should never turn into a debugging sandbox. Pulling over a backup if necessary is safer and allows us developers more freedom to freely change the data to find the appropriate fix.

If you have to use script/console not MySQL

If you find you need to update data directly it’s better to use script/console, then by touching the database directly using MySQL. For example, Scott once asked me to change his user name. Changing his user name was a very low risk change, so I loaded up script/console and made it.

The benefit of script/console is that you let all of the rules and validations inside the application be run when you make the change. If I had made Scott’s username too short, too long, include inadmissable characters I would not find out if I did it directly through MySQL, unless I had duplicated all of the logic in MySQL stored procedures and triggers (which I cringe at the thought of).

Conclusion

Well, that’s pretty much a summary of my personal guidelines when it comes to dealing with data updates, migrations, seed-files, etc. If you have any other guidelines you follow that you find helpful please share. I think being on the same page here will help us as we push forward.