Posts Tagged ‘computer’

I have been at my new job for around 8 weeks now.  There have been 2 major outages.  We have one employee put in their notice.  After this week my equal in one of offices will no longer be here.  The guy has been around IT for a while and had a lot of resources.  He knew a lot of different people he could count on and different ways to find answers.  That is a big loss for the company.

The First Outage:
The system have needed to be replaced a while ago.  I don’t blame the people that were here before me.  They were doing the best they could with the cards dealt to them.  We have had two major outages.  The first outage was caused by bad power from our power company Lincoln Electric System (LES).  The power dipped low enough to damage the equipment but not low enough to trip the UPS.  The power dip caused one of the switches in our core stack to blow the power supply in it.  Cisco was able to send a new switch, but at first they could get us one until Monday.  The outage happened Thursday night / Friday night.  After working with Alexander Open Systems (AOS), they were able to get a switch from Cisco to us the next day.  We had a former employee help with the configuration of the switch since the network has multiple vlans and vlan are configured at the port level.  Normally this isn’t a problem but we didn’t have a backup of the switch configuration.  The outage also caused problems to the firewall in the form of the firewall loosing its configuration.  We had a person from AOS help us reconfigure the switches and firewall so that everything was good again.

The Second Outage:
First our file and print server decided to deny access to the share and then the share no longer showed online.  Next all the printers on the server disappeared.  Shortly after this one of our main SANs decided to stop working.  So the team got to working on these issues and then we no longer to remote in.  Wait.. We can’t get to anything from outside.  Got have no email, no websites, and no VPN.  Researching the firewall we noticed it was denying everything from the outside.  I started looking in to that and then I noticed that the external dns addresses weren’t resolving.  First I thought it was due to the firewall blocking everything but from an external location I querried Google’s DNS Server for a site of ours.  Nothing came back.  Normally it takes hours before DNS will exhaust.  So I try to log in to our DNS server and I can’t get connected.  So I jump to the VM server and notice an error.  The ESX server doesn’t have enough room for the VM.  The physical disk is out of space so it halted the VM session.  Just great.  I go to reboot the server and get the equivalent of the BSOD on a Windows server.  I hard power the server and after the 5 minutes it takes to start loading the OS I get a kernel panic.  The OS has an issue with the hard drive.  At this point I need a DNS server up and I need it now.  I start building a new ESX server out of scraps of different servers.  The idea is to build a server and move the hard drives to the new server.  While I am doing this I am taking the DNS zone files and creating a new DNS server on another VM host just encase the ESX server idea doesn’t pan out.  Plus I went to a DNS hosting company and start moving our DNS file to this host.  I got the new ESX server built.  I got the new DNS server built.  I have the DNS moved to the DNS hosting company.  I created a CNAME to our name servers to point to the DNS server names for our new DNS hosting company.  This got us back in business.  A coworker got AOS on the phone and remotely got them to be able to configure the firewall.  Looks like the firewall flipped back to a new version of the firmware that we downgraded.  Once they downgraded us we were back in business.  My coworkers got the file server and SAN already figured out.  It is Sunday mid-day and so at this point the systems have been down for 48 hours.  I have been helping since 11am on Sunday when I returned home from Church.  At around 5pm I went in to work to do all the things that needed done physically (build the ESX server and trying to copy the files off the drive). At 6:45am the next day I have the new ESX server up, files being copied down from the old ESX server which is back up, and the zones built for the external DNS hosting provider.  By 10am the files were done copying down to a portable drive.  I was able to start up the vm server for the host.  Everything was basically up and going.  Tuesday we noticed that some emails were not going through and sites were down.  When creating the CNAME to fix the issue it caused an issue with the other sites we were hosting in the fact that the external provider didn’t have zone files for the other sites.  So Tuesday afternoon I removed the CNAME.  By Tuesday evening some of the sites were up and happy again.

Facts about IT:

  • You will get yelled at
  • Very rarely will you get a thank you from anyone outside of IT
  • You will loose a lot of sleep in your career.  Doing a 36 to 40 hour shift is typical.
  • Even if you spend all weekend fixing things, users still expect you to be there to help them get their new songs on their blackberry
  • In small to medium sized companies the burn out rate in IT is high.  Everyone wants everything but they don’t want to pay for it and why haven’t you got it done already?
  • In larger company most IT people feel like they are just part of a large heard of cattle and that they can and will be replaced at any given time.

This will help to track down failed logins.  This could be due to someone changing their password and still are logged in to a server with the old account information.  The other side is that someone could be trying to brute force an account.

Type=”Failure Audit” sourcetype=”WinEventLog:Security” | chart count by User_Name | sort – count

So I am a full convert and profit of Splunk now.  I have been using it at work for around 4 months now.  I have rolled it out to our domain controllers and have started rolling it to all our Windows and *nix servers.  The ability to find out who did what has made my job so much easier.  There was an incident where an OU was deleted in our AD.  I was able to see exactly who and when did it.  Normally this type of searching wasn’t possible or at least hard to get due to the size of our infrastructure.  Our Event Logs roll over around once an hour.  The OU was deleted 8 hours before we were contacted.

Here is a few of the reports I have scheduled to get every morning to take a look at what has happened in my environment.

User Accounts deleted:

EventCode=”630″ | fields Caller_User_Name, Target_Domain,  Target_Account_Name, host | collect | rename Caller_User_Name as Who_Did_It | rename Target_Account_Name as Deleted_Account | rename host as DomainController | rename Target_Domain as Users_Domain

User Accounts created:

EventCode=”624″ | fields Caller_User_Name, New_Domain, New_Account_Name, host | collect | rename Caller_User_Name as Who_Did_It | rename Target_Account_Name as Modified_Account | rename host as DomainController | rename New_Domain as New_Account_Domain

Computer Accounts deleted:

EventCode=”647″ | fields Caller_User_Name, Target_Domain, Target_Account_Name, host | collect | rename Caller_User_Name as Who_Did_It | rename Target_Account_Name as Deleted_Computer | rename host as DomainController | rename Target_Domain as Removed_Domain

Computer Accounts created:

EventCode=”645″ | fields Caller_User_Name, New_Domain, New_Account_Name, host | collect | rename Caller_User_Name as Who_Did_It | rename New_Account_Name as New_Computer | rename host as DomainController | rename New_Domain as Joined_Domain