Over the years I have worked in various roles but the most challenging has been when I assumed control of an already established system administration team.
Several times, a former colleague moved to a new company and inherited a system administration organization that needed some improvements.
I am proud to say that those colleagues convinced me to work for them again to overhaul the system administration operations.
Each of us has our own leadership style and everyone's approach may differ from organization to organization; nonetheless, I wanted write about my approach. First of all, I have a handful of generic questions I ask:
- What am I responsible for?
- Does the architecture make sense? Are there any overlaps of responsibility?
- Who has access to the assets I am responsible for?
- What are we doing to maintain system availability? Backups?
- What is our configuration management process? How fast do we turn up systems and introduce new applications?
- What are the strengths and weaknesses of my team?
- What are we missing?
When I enter the organization, I don't immediately request access to the systems but rather I begin going over architectural diagrams, operational procedures, and just peering over the shoulders of the system administrators. If the aforementioned documents are not present, then we have a problem.
It is imperative that there is a clear understanding of the system components within the architecture. I would immediately have the team begin compiling diagrams and work flows so that we can understand the system's architecture.
This would include high-level diagrams as well as a detailed asset management inventory of EVERY host. I want to see every host, its operating system version, and respective application versions (e.g., Tomcat, Oracle, Apache).
I would also require networking diagrams and a mapping of each system component to a particular organizational group. For example, a particular database contains billing information and it is used by group XYZ.
Who maintains that database schema and the software which manages the data? I would begin mapping those groups to system components so my team has a clear understanding what organizations they are supporting.
Next, I would set out to change the password of every privileged account in the system. This is where some people become upset, but remind yourself it is for the best.
As we make changes to the system to stabilize it or improve performance, we need to know exactly what changes were made. This password changing step is imperative in order to clearly identify who has access to the system.
I would ensure that every system is logging and auditing accordingly so you can see who is attempting and gaining access to privileged accounts. Furthermore, I would no longer allow any privileged account (e.g., root and oracle) to be logged into directly.
Secondly, I would start changing the root password on every system and only give it out to a select few senior system administrators. We can sort out sudo access for junior administrators as we move forward. Next, I would work with the lead database administrator and have the appropriate account passwords changed.
As people start complaining they no longer have access, we will evaluate each individual's role and determine if they truly need access. Too often system developers have root access to production systems. If developers need access to production systems, then, in my opinion, the application isn't ready for production.
Of course, while reviewing these critical system and application accounts the system administrator accounts should also be reviewed. Sometimes, you will find accounts for individuals who are no longer employed so it should be removed immediately.
I would also set password aging on system administrator accounts so that unused accounts are locked. This will help identify dormant accounts.
Once I have narrowed down the systems I am responsible for and who has access to them, I will closely examine the architecture and processes to ensure business continuity. Are we doing backups? Do we have redundant or mirrored storage solutions? How often do we test fail over and recovery procedures?
By this time, you should already have a good understanding of the existing change management processes. If it is insufficient, then lobby to get it fixed! Do you have a high provisioning rate? In other words, are new systems routinely being inserted into production with little-to-no testing?
Understanding the team is critical. You're always going to find an eclectic blend of personalities and talent in a group of system administrators. It's your job as a leader to determine who really “knows their stuff” and who has everyone bamboozled. Find out who the information hoarders are and those who have been stuck in a role they aren't happy in.
Lastly, you might stumble across some strange component in the architecture that is either antiquated or is just completely different from everything else. For example, the architecture is comprised of 99% Red Hat Linux and you have one HP-UX box running one small application and none of the system administrators knows anything about it.
My first question is how did it get into the architecture? What's the long term plan to maintain and support it? Will the team get any training on it? Or is this one of those situations where someone outside of the operations group has been granted exclusive root access to the system? [Cringe]
In the end, it is really your experience and leadership which can help to improve a system administration team. It has been my experience that having a clear picture of the environment you are expected to build and maintain is a critical first step to ensuring the success of the team.