Top 10 Best Practices about SSIS and SSRS I've learned the hard way
Over the past 7 or 8 years, I've gone from "0 to 60" when it comes to database design, development, ETL and BI. Most of the skills I've learned were a result of mistakes I've made, and I've been mostly self-taught with the exception of some more recent "formalized" learning with programs from Pragmatic Works. As I've grown more and more satisfaction with the process, I've gone on and started working towards my MCSA (2 of 3 complete) in SQL 2012, as well as speaking at SQL Saturdays and local user groups in New England. It's become one of the most rewarding, exciting and challenging aspects of my career. As a result, I've posted some blog articles about some of the challenges I've overcome, though not frequently enough, and attempted to become more active in some SQL forums. The list below is far from complete when it comes to all of the best practices I've learned over the years, however, many of these lessons and best practices have really helped me to be organized when it comes to good BI architecture. I hope to provide at least one item that will benefit a newbie or even a seasoned pro. So, without further ado, here are the top 10 best practices I've learned the hard way...
- Use Source Control
- For anyone who was a developer in their past life, or is one now, this is a no-brainer, no-alternative best practice. In my case, because I come from a management and systems background, I've had to learn this the hardway. If this is your first foray into development, get ready, because you're in for some mistakes, and you're going to delete or change some code you really wish you didn't. Whether it be for reference purposes on something you want to change, or something you do by accident, you're going to need that code you just got rid of yesterday, and we both know you didn't back up your Visual Studio jobs... Hence, source control. Github and Microsoft offer great solutions for Visual Studio, and Redgate offers a great solution for SSMS. I highly recommend checking them out and using the tools! There are some other options out there that are free, or will save your code to local storage locations, but the cloud is there for a reason, and many of us are on the go, so having it available from all locations is very helpful.
- Standardize Transform Names
- Leaving the default names for the OLE DB connector or the Execute SQL Task seems like something very silly a person might do because it's so easy to label them for what they actually are, but I have to admit here, I'm guilty of doing so. I've found myself in situations where I've thrown together a quick ETL package for the purpose of testing, or further work on the database side, and then I've forgotten to go back and fix them. Fast-Forward 6 months, and saying to myself, "I think I had a package like that once before", only to find it, open it, and not have a clue what it's actually doing. This of course requires me to go through each component and refresh my memory on what I did. Truth be told, in this day of resources, memory, etc, there is absolutely no need to not give all the details needed within the title of the tool being used. Don't be lazy, you never know when it might bite you!
- Document inside your code and ETL workflow
- If you're using the script transforms, it opens a pretty standard development window anyone familiar with Visual Studio will recognize. As with any development, good code comes with good documentation. Use it! Your predecessors, if not you, yourself, will be very appreciative down the road. Name your functions appropriately, and explain what you're doing throughout the code. Further, as you build your workflows, you have the ability to document what each step of the process is doing, use it! This also goes back to point 2. With standardized names for your transforms alongside documenting the workflow as you go, it paints a very nice picture of what your workflow is doing.
- Setup detailed alerts for failures
- The traditional workflows in SSIS allow for users to create mail notifications for successful and unsuccessful steps within the workflow. Of course, depending on how your packages are being run, you could have the same type of notification directly from the SQL Server running a SQL Job and also send notification, however, why use the SQL Alerts to tell you that "Job A" failed, and no real good information when you can have your package tell you exactly which transform or component failed, and what the error was when it failed. There are some limitations to the canned SSIS component as to the methods you can use to send an email, however, there are also some workarounds like the one here on Stackoverflow that shows how to use the script task to connect to and send mail through gmail. Either way, there is plenty of functionality that will help you to be informed of exactly what is happening inside a job and where there are warnings and errors for each component. Taking the time to do this is much better than getting a phone call with complaints about data not being updated properly!
- Standardize File and Folder Names and Locations
- Ok, ok, I know this is getting a bit redundant... but remember, these are the mistakes I made as a newb, and I really want to help you out as you start to use the software more and more, and get more and more complex with workflows. This one was a really big one for me. Because I do all of my work in visual studio, and all of the BI jobs look the same when it comes to the file level, I really needed to be able to show the separation between my ETL jobs, SSRS jobs and SSAS jobs. This also helped me out with my source control structure as well. I separated SSIS, SSRS, and SSAS jobs (you can even go as far as separating Tabular and Multi-dimentional if necessary) into separate folders, then labeled each type with "ETL" or "reports" as part of the file name. It saves me time when I'm opening a recent job because typically I'm working on tweaking the ETL at the same time as developing the reports to get just the right mix.
- Have a "Report Request Form"
- When you first start writing reports, it's really exciting. You're delivering value to the business and really helping people do their jobs, especially when you start transforming and aggregating data... but soon, you become more and more relied upon for for those reports, and no two reports are alike it would seem. A common best practice for people who are spitting out report after report is to have a report requirements request form like the one here from SQL Chick. This request form is pretty in-depth, so tweak as necessary, but it will really help you to prioritize and design reports going forward.
- Experiment to fine tune and improve performance
- So, this best practice item is really a whole blog post unto itself, but it's something to be aware of. Just as a quick for instance, the "OLE DB command" transform is a great tool in theory, however, because of some of the nuances of the tool, if you're using a large dataset, it can be significantly slower than using the "Execute SQL Task", but the only way to know this is to compare them side by side, which I had to do, and realized the SQL Task took about 3 min, and the OLE DB Command took about 45 minutes. Moral of the story: if something seems to take a long time, there may, and most likely is, a better way to do it, go out and play!
- Set your backup type to Basic or Bulk-logged on staging tables
- Ok, so normally I would stress the importance of backups (always do them automatically, including logs, and before you change anything), but that would be more of a blog post about maintenance and configuration, but this is more to focus on ETL and Reporting, so let's talk about the effect loading lots of data will have on your database, or better yet, why not check out this article that has a very good description of the 3 different backup models. The basics are just this... if it's a staging table or db for data, you can probably get away with the basic model. If it's a critical db, but you're doing lots of data loads, the bulk-logged model will ignore SELECT INTO, bcp, INSERT SELECT and BULK INSERT DML operations so your transaction logs don't get huge, fast.
- Temp Tables will likely make your ETL run faster than staging tables
- I can't really take credit for this little nugget here. When doing a project with one of my coding buddies, he came in to my office one day and said "Hey, did you know that using temp tables in SQL will allow you to use multiple processors on the server at one time?" I did not know this, and man did it make a difference. What a huge performance boost it was for my project. Now, like everything else, there are exceptions to the rule. Some guys who are much smarter than me had this discussion on a forum that sheds some more light on the topic, and here are some more scenarios where it might not make sense. I think it's good information overall, and the more information you have, then better off you'll be when you're designing your BI Architecture
- Have a development and testing platform
- As with some of the other best practices listed above, for some, this is a no-brainer. For others... we have no brains when it comes to this stuff and we need to have it beaten into our heads (I'm the latter if that wasn't already clear). I can't stress enough how much this will save you. You should never, ever, be doing development on a production environment. There are just too many things that can go wrong. Even those "quick" or "minor" changes can really cause a calamity and ruin your day quickly. Now, there can be challenges if you don't have a proper production/dev/test environment at your office or your client's location, however, with SQL Server Developer's Edition now being a free tool, and PCs these days having tons of resources, you should be able to do your testing on even the simplest of computers and get a warm and fuzzy that you're going to be able to deploy this latest package, report, or code successfully. Performance tuning might not be truly possible to do comparisons against a beefy production server, but you should be able to establish a baseline and have a general idea of how performance will be for various configurations.
Well, that's it for this post. I really hope you're able to provide even the slightest hint of learning something new here because that's always my goal. If you have questions, you can follow me or send me a message on twitter @bizdataviz as I'm always happy to hear how I can write better blog posts to help people out whom are just getting their feet wet.