Posts Tagged ‘system’
On January 17th, 2011 I wrote about a script I wrote to gather information on my network (http://www.anthonyreinke.com/?p=384). I ran the scan a few time with a lot of errors. The issues were due to the different domains. It can’t scan all the machines since it runs as the person running. It doesn’t have the ability to try multiple different credentials.
So I started looking for products to get the information. After a lot of trail and errors I found Lansweeper. This software gives you the incite in to your network that is hard to find. It not only scans the machines, but active directories as well. As for the price, it is hard to bet at $299 per year.
I have been at my new job for around 8 weeks now. There have been 2 major outages. We have one employee put in their notice. After this week my equal in one of offices will no longer be here. The guy has been around IT for a while and had a lot of resources. He knew a lot of different people he could count on and different ways to find answers. That is a big loss for the company.
The First Outage:
The system have needed to be replaced a while ago. I don’t blame the people that were here before me. They were doing the best they could with the cards dealt to them. We have had two major outages. The first outage was caused by bad power from our power company Lincoln Electric System (LES). The power dipped low enough to damage the equipment but not low enough to trip the UPS. The power dip caused one of the switches in our core stack to blow the power supply in it. Cisco was able to send a new switch, but at first they could get us one until Monday. The outage happened Thursday night / Friday night. After working with Alexander Open Systems (AOS), they were able to get a switch from Cisco to us the next day. We had a former employee help with the configuration of the switch since the network has multiple vlans and vlan are configured at the port level. Normally this isn’t a problem but we didn’t have a backup of the switch configuration. The outage also caused problems to the firewall in the form of the firewall loosing its configuration. We had a person from AOS help us reconfigure the switches and firewall so that everything was good again.
The Second Outage:
First our file and print server decided to deny access to the share and then the share no longer showed online. Next all the printers on the server disappeared. Shortly after this one of our main SANs decided to stop working. So the team got to working on these issues and then we no longer to remote in. Wait.. We can’t get to anything from outside. Got have no email, no websites, and no VPN. Researching the firewall we noticed it was denying everything from the outside. I started looking in to that and then I noticed that the external dns addresses weren’t resolving. First I thought it was due to the firewall blocking everything but from an external location I querried Google’s DNS Server for a site of ours. Nothing came back. Normally it takes hours before DNS will exhaust. So I try to log in to our DNS server and I can’t get connected. So I jump to the VM server and notice an error. The ESX server doesn’t have enough room for the VM. The physical disk is out of space so it halted the VM session. Just great. I go to reboot the server and get the equivalent of the BSOD on a Windows server. I hard power the server and after the 5 minutes it takes to start loading the OS I get a kernel panic. The OS has an issue with the hard drive. At this point I need a DNS server up and I need it now. I start building a new ESX server out of scraps of different servers. The idea is to build a server and move the hard drives to the new server. While I am doing this I am taking the DNS zone files and creating a new DNS server on another VM host just encase the ESX server idea doesn’t pan out. Plus I went to a DNS hosting company and start moving our DNS file to this host. I got the new ESX server built. I got the new DNS server built. I have the DNS moved to the DNS hosting company. I created a CNAME to our name servers to point to the DNS server names for our new DNS hosting company. This got us back in business. A coworker got AOS on the phone and remotely got them to be able to configure the firewall. Looks like the firewall flipped back to a new version of the firmware that we downgraded. Once they downgraded us we were back in business. My coworkers got the file server and SAN already figured out. It is Sunday mid-day and so at this point the systems have been down for 48 hours. I have been helping since 11am on Sunday when I returned home from Church. At around 5pm I went in to work to do all the things that needed done physically (build the ESX server and trying to copy the files off the drive). At 6:45am the next day I have the new ESX server up, files being copied down from the old ESX server which is back up, and the zones built for the external DNS hosting provider. By 10am the files were done copying down to a portable drive. I was able to start up the vm server for the host. Everything was basically up and going. Tuesday we noticed that some emails were not going through and sites were down. When creating the CNAME to fix the issue it caused an issue with the other sites we were hosting in the fact that the external provider didn’t have zone files for the other sites. So Tuesday afternoon I removed the CNAME. By Tuesday evening some of the sites were up and happy again.
Facts about IT:
- You will get yelled at
- Very rarely will you get a thank you from anyone outside of IT
- You will loose a lot of sleep in your career. Doing a 36 to 40 hour shift is typical.
- Even if you spend all weekend fixing things, users still expect you to be there to help them get their new songs on their blackberry
- In small to medium sized companies the burn out rate in IT is high. Everyone wants everything but they don’t want to pay for it and why haven’t you got it done already?
- In larger company most IT people feel like they are just part of a large heard of cattle and that they can and will be replaced at any given time.
I have begun building my own dashboards in Splunk. Once I have the custom views built, I will post them up here. So far everything I have been working on is with a system’s administrator in mind because that is what I have been doing for the past 12 years (wow, thats a long time). Currently I am building a view for searching failed logins and the source of lockouts. They tie in to one another. Our technicians want to be more involved in the systems administration and hopefully this will help them respond quicker to our customers. Everything comes from Splunk being installed on all our domain controllers. From there we get all the logs in to our central logging system (Splunk). Due to the amount of data we are pushing now everyday, we might have to build a backup environment just for our Splunk data. How awesome is this!
I have used OSSEC in the past to watch the file system for changes. When I found that I can have the Splunk agent handle the monitoring itself, I was pretty excited. Since I would send my OSSEC data to Splunk anyways, it just seemed logical to have Splunk do everything.
In Windows, you need to edit the “c:program filesSplunketcsystemlocalinputs.conf” file. Of course your path could be different if you installed it in a different place. There are a lot of options and switches you can use. I went for the simplest set.
[fschange:d:temp]
recurse=true
pollPeriod=3600
This will monitor the d:temp folder and all files and folders under it. It will check the system every 3600 seconds (1 hour).
This has helped me keep track of the changes in my servers. I can see when a file was add/deleted/changed (due to the hash) and then look at who was logged in during the period that the file was changed.
Splunk article on the switches and FSCHANGE.
http://www.splunk.com/base/Documentation/4.0.3/Admin/Monitorchangestoyourfilesystem
recurse=true
followLinks=false
pollPeriod=60