what did you learn today? (part 2)

Status
You're currently viewing only Klockwerk's posts. Click here to go back to viewing the entire thread.

Klockwerk

Ars Praefectus
3,757
Subscriptor
[url=http://meincmagazine.com/civis/viewtopic.php?p=24475155#p24475155:3ht05s6h said:
SandyTech[/url]":3ht05s6h]
[url=http://meincmagazine.com/civis/viewtopic.php?p=24475031#p24475031:3ht05s6h said:
kperrier[/url]":3ht05s6h]
[url=http://meincmagazine.com/civis/viewtopic.php?p=24474529#p24474529:3ht05s6h said:
SandyTech[/url]":3ht05s6h]
[url=http://meincmagazine.com/civis/viewtopic.php?p=24473919#p24473919:3ht05s6h said:
Big Wooly Mammoth[/url]":3ht05s6h]Your backup provider can't/won't dump the data to a disk and ship it to you?

They are, but it'll be Thursday before I get anything restored.
Well, Friday, at the latest!

*shrug* I warned them about the time for a large restore, but they said it was acceptable. It could be two weeks from now and I still wouldn't be overly concerned with the issue. CYA memos are a wonderful thing.

Have you tried repeatedly fscking the drives to get the data back? Perhaps making snowangels?

(Fond memories of viewtopic.php?f=25&t=238085)
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
[url=http://meincmagazine.com/civis/viewtopic.php?p=24504897#p24504897:3scoavwv said:
theevilsharpie[/url]":3scoavwv]
[url=http://meincmagazine.com/civis/viewtopic.php?p=24504713#p24504713:3scoavwv said:
Sunner[/url]":3scoavwv]
[url=http://meincmagazine.com/civis/viewtopic.php?p=24503137#p24503137:3scoavwv said:
M. Jones[/url]":3scoavwv]I learned that having a freshly-minted devop 'chmod -R 777 /etc' isn't just a bedtime story to scare young children. I am amazed.

Bad permissions on /etc/securetty will prevent you from logging in on console, and of course sshd will fail-safe due to open permissions. Relaunch or redeploy the VM, or boot from media.
Yeah, had this happen on a Solaris zone, except said idiot did it on / instead. Even trying to restore all that crap was a no go so we restored and said idiot had to work a bit extra. Of course he did it again on /opt the next time, which screwed up all our monitoring and a bunch of other tools, so yeah, another restore.

I can understand screw ups (you're tired/distracted/finger slipped/whatever), but if I damaged a server's file system so badly that it had to be restored from a backup, I'd be damn sure to verify what I'm typing the next time around.

How the hell does someone make a mistake like that twice, and more importantly, why are they still employed? :scared: :confused:

Sheesh, I did somewhat the same. First mistake on a monday was deleting the wrong redo file when trying to resolve a snapshot issue on a virtual machine. The vm was offline until I (with some help from a senior) recreated the vmdk descriptor file. I had been working for quite a few hours and was tired - I identified the right file to remove and then deleted exactly the wrong one.

Second vm I took down was on wednesday, two days later. I thought I knew what I was doing, identified what was wrong, and removed the drive from the vsa proxy. I then deleted the files that weren't in use, only I didn't realise that removing the drive caused vsphere to take another snapshot, and the files I ended up removing were thus actually now in use. VM was offline until I worked out what had happened and fixed it.

I learned redo files are scary, scary things to play around with, that I was very lucky that there was no data loss, that working tired was a stupid idea, that I did not have enough of an understanding of the snapshot process and that I needed to verify conditions and assumptions after changes occur.
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
TIL that vmware hosts really, really like holding onto mac addresses, to the point that if you move a network card from one host to another host in the same management VLAN, you're going to have a bad time *and* freak out the network guys as well.

They even have a KB article on it. It's 'by design', 'sucks to be you', and 'oh, if you don't want that to happen there's some stuff you can do. You'll have to reboot though afterwards'.

Then it got interesting. On top of their own range, they're using Xerox's first ever range when autogenerating mac addresses.
 

Klockwerk

Ars Praefectus
3,757
Subscriptor

Klockwerk

Ars Praefectus
3,757
Subscriptor
[url=http://meincmagazine.com/civis/viewtopic.php?p=24973779#p24973779:h9h2xux5 said:
Danger Mouse[/url]":h9h2xux5]
[url=http://meincmagazine.com/civis/viewtopic.php?p=24969901#p24969901:h9h2xux5 said:
ferzerp[/url]":h9h2xux5]
[url=http://meincmagazine.com/civis/viewtopic.php?p=24968435#p24968435:h9h2xux5 said:
ambit[/url]":h9h2xux5]

Friday, 11:54pm SPB rebooted
Saturday, 4:01pm SPA rebooted
5:40pm SPB rebooted
10:54pm both reboot, the VNX goes down
Sunday ~5:25am the VNX is recovered and comes back up

Tuesday ~4pm SPA reboots again.
8:50pm we shutdown and replace SPA with new hardware
9:43pm we we shutdown and replace SPB with new hardware


This chain of events concerns me. Were the first two events maintenance events, or did they spontaneously reboot? At that point, I'd have had support working on it. There is no reason at all for an SP to *ever* reboot on its own that doesn't warrant opening a high priority case in my opinion.

I've had it happen, but it wasn't spontaneous. Heat related crash of the Cisco Nexus 5020s, lead to other horrible things, including the SPs and Control Station rebooting. The Data Movers did not reboot.

There was no data loss. :scared:

---------

Had a battery die in the SPA of one of our Celerras over the weekend.

The SPs in my production NS480 were randomly crashing from time to time, but not since the move and I redid all the fiber in nice neat loops rather than the preexisting random jumble.

I'm thinking we may not wind up replacing the Celerras within 18 months as I'd hoped.

So not happy about it.

EMC emailed me, asking for SPCollects. Told em it would be delayed for 2 hours, since I was at the movies. :D (Despicable Me 2)

At work we played with an isilon cluster for a POC - I would take one of those over a celerra any day of the week, twice on sundays.
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
[url=http://meincmagazine.com/civis/viewtopic.php?p=24980375#p24980375:1h7tt88v said:
sporkme[/url]":1h7tt88v]
[url=http://meincmagazine.com/civis/viewtopic.php?p=24980261#p24980261:1h7tt88v said:
Barmaglot[/url]":1h7tt88v]Eeek. Turns out, Google is not renewing tenants' leases in 111 8th av. anymore :( Here comes fourth datacenter move in six years...

Oh fuck me.

Way too many cross-connects there, not moving.

Level3 said many things about their long-term agreements, not going anywhere, blah blah blah (while also refusing to sell new cabs, which seems a bit fishy); do you have any information about *which* tenants don't get a renewal? Would they really terminate someone like Switch & Data/Equinox or Telx?

OT: Have you seen a single Googler there wearing Glass? I haven't.

I understand all the words, but putting them together I'm still missing context. Can someone explain to me what is happening here and the significance?

What I understand is that google own 111 8th av. and at this time they're not renewing leases - what's the impact of this, and why is this unusual?
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
[url=http://meincmagazine.com/civis/viewtopic.php?p=25072293#p25072293:1c1r3rsn said:
PaveHawk-[/url]":1c1r3rsn]
[url=http://meincmagazine.com/civis/viewtopic.php?p=25071683#p25071683:1c1r3rsn said:
Cookie.Monster[/url]":1c1r3rsn]When is the right time to make changes, out of curiosity? Make them during business hours and you can get all-hands-on-deck response, but your business suffers. Make them off hours, e.g. Friday night, and you don't get all-hands-on-deck and might burn out some folks, but you don't affect the business as much. I think it all depends on the change, the business, the employees, etc.

Not that I make the call, but I'm not sure I'm a full advocate of no changes on Friday.

edit: beat me to it.

edit2: Now, small changes to my own processes and systems that I have full leverage over? That's waiting until after the weekend : )

Unplanned break/fix type work where its not super critical is prohibited on Friday from 10am onwards. Planned project or break/fix work, intended to stretch into the weekend, is allowed.

At least, thats how we operate.

The key is "is it planned, what are the contingencies, when is the v1 point?" (to use aircraft take-off parlance)

If its unplanned, it simply cannot occur on a Friday - we're not geared up for it.

If it's not urgent, and since if I break it I fix it until it's working, I don't do major work on friday afternoons. Friday mornings are a bit iffy as well.
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
[url=http://meincmagazine.com/civis/viewtopic.php?p=25165743#p25165743:2qwf6jld said:
SandyTech[/url]":2qwf6jld]I learned 3 things today...

1: Apparently a former (though he doesn't know it yet) colleague thought it would be an ace idea to give our devs root access to their ESXi cluster
2: One of the aforementioned devs decided to delete a few .vmdk files, because apparently they weren't particularly important
3: Recreating vmdk files is (fortunately) very easy.

They're a couple kb of data. How important can they really be?
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
[url=http://meincmagazine.com/civis/viewtopic.php?p=25602225#p25602225:2y7h69f9 said:
sporkme[/url]":2y7h69f9]Is no one else curious why the VM went "poof"? Or how one gets it back?

I'd enjoy an explanation, but stuff like that happens.

Just yesterday night I had a vm that had been shut down that went grayed out and 'not responding', only one in it's cluster like that. Somehow in that state I was able to power it on, and now it's working fine. It's what you deal with when you use vmware, or any other complex piece of software. If I do happen on something like that, I'll google it to find a solution, and if I can't find a clear cut explanation of how to fix it, I'll be lodging my own vmware case so I can make sure I don't make things worse.
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
[url=http://meincmagazine.com/civis/viewtopic.php?p=25833757#p25833757:11if6jwe said:
SandyTech[/url]":11if6jwe]Email that went out to our desktop techs today:

When you guys are assigning static IPs to a piece of equipment, please make sure its not in the DHCP pool.

:facepalm:

My reply:

Let's do things the right way guys; send Networking the MAC address & name of the equipment so we can put a DCHP reservation in instead.

Or make sure it's not a NAS datamover ip address. That one got discovered fast.
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
TIL that the colour of LTO tapes is important to the troubleshooting process.

We received two boxes of LTO tapes last friday from a new supplier and of a different (cheaper) brand than normal, one with 10 tapes (half full) and one with 20 tapes (full). I put the 10 tapes into the tape library and all was well.

Today I noticed that once again there were no spares so I put in 5 tapes from the full box - and the auxiliary copy kept failing. I went through the commvault media agent logs and then the commcell logs. I force mounted the new tapes and watched as they would be loaded into the drive, and the drive would reject with "There is no tape in the drive". I tried with an old full tape (no problem), a new full tape (no problem) and two of the new tapes (no dice). I pulled two of the new tapes out, noticed the colour was different than normal but nothing clicked.

It wasn't until I pulled the rest of the nonworking new tapes out, along with one of the working new tapes out that I noticed they were different colours.

Somehow, our order of 30 LTO4 tapes had turned into an order of 10 LTO4 tapes and 20 LTO6 tapes. The LTO4 tape drives weren't recognising the LTO6 tapes at all (as you would expect). Glad I hadn't put all 20 tapes in at once.
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
[url=http://meincmagazine.com/civis/viewtopic.php?p=26302337#p26302337:3kc907cs said:
Danger Mouse[/url]":3kc907cs]
[url=http://meincmagazine.com/civis/viewtopic.php?p=26300933#p26300933:3kc907cs said:
westyx[/url]":3kc907cs]TIL that the colour of LTO tapes is important to the troubleshooting process.

We received two boxes of LTO tapes last friday from a new supplier and of a different (cheaper) brand than normal, one with 10 tapes (half full) and one with 20 tapes (full). I put the 10 tapes into the tape library and all was well.

Today I noticed that once again there were no spares so I put in 5 tapes from the full box - and the auxiliary copy kept failing. I went through the commvault media agent logs and then the commcell logs. I force mounted the new tapes and watched as they would be loaded into the drive, and the drive would reject with "There is no tape in the drive". I tried with an old full tape (no problem), a new full tape (no problem) and two of the new tapes (no dice). I pulled two of the new tapes out, noticed the colour was different than normal but nothing clicked.

It wasn't until I pulled the rest of the nonworking new tapes out, along with one of the working new tapes out that I noticed they were different colours.

Somehow, our order of 30 LTO4 tapes had turned into an order of 10 LTO4 tapes and 20 LTO6 tapes. The LTO4 tape drives weren't recognising the LTO6 tapes at all (as you would expect). Glad I hadn't put all 20 tapes in at once.

Yah, LTO4 drives will read/write LTO4 and LTO3 tapes, and probably read LTO2.

There's backwards compatibility, but not forwards compatibility.

So, if you had a LTO6 drive, it would read/write LTO6 and LTO5 and read only LTO4.

You're sure those are LTO6 tapes? I thought there was a substantial price difference? And it would show on the invoice as such? And the tapes weren't marked as LTO6?

I ask all these questions, because I may wind up running into the same thing with my next tape order :p

The box containing the tapes had LTO6 all over it, and the tapes themselves have LTO6/6.5TB on them as well. Unfortunately, I zoned out while unpacking/labelling (because they're tapes, I've done this a million times, and if you fuck up with tapes .. nothing bad happens) so didn't notice.

There was no price difference, because the vendor fucked up when they sent out our order. I get the feeling they grabbed 10 LTO4 and stuck them in a box, grabbed the next box to hand, wrapped in plastic and passed to courier. The vendor ended up couriering out another box of 20 tapes post haste which are right now quite happily being written to. I happened to go over the alerts on the library, and they're all 'hey, dumbass, I can't read these tapes. maybe you should have a looksee why'. ++ learning moment.
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
[url=http://meincmagazine.com/civis/viewtopic.php?p=26312051#p26312051:hf01sy6p said:
Danger Mouse[/url]":hf01sy6p]
[url=http://meincmagazine.com/civis/viewtopic.php?p=26310653#p26310653:hf01sy6p said:
westyx[/url]":hf01sy6p]
[url=http://meincmagazine.com/civis/viewtopic.php?p=26302337#p26302337:hf01sy6p said:
Danger Mouse[/url]":hf01sy6p]
[url=http://meincmagazine.com/civis/viewtopic.php?p=26300933#p26300933:hf01sy6p said:
westyx[/url]":hf01sy6p]TIL that the colour of LTO tapes is important to the troubleshooting process.

We received two boxes of LTO tapes last friday from a new supplier and of a different (cheaper) brand than normal, one with 10 tapes (half full) and one with 20 tapes (full). I put the 10 tapes into the tape library and all was well.

Today I noticed that once again there were no spares so I put in 5 tapes from the full box - and the auxiliary copy kept failing. I went through the commvault media agent logs and then the commcell logs. I force mounted the new tapes and watched as they would be loaded into the drive, and the drive would reject with "There is no tape in the drive". I tried with an old full tape (no problem), a new full tape (no problem) and two of the new tapes (no dice). I pulled two of the new tapes out, noticed the colour was different than normal but nothing clicked.

It wasn't until I pulled the rest of the nonworking new tapes out, along with one of the working new tapes out that I noticed they were different colours.

Somehow, our order of 30 LTO4 tapes had turned into an order of 10 LTO4 tapes and 20 LTO6 tapes. The LTO4 tape drives weren't recognising the LTO6 tapes at all (as you would expect). Glad I hadn't put all 20 tapes in at once.

Yah, LTO4 drives will read/write LTO4 and LTO3 tapes, and probably read LTO2.

There's backwards compatibility, but not forwards compatibility.

So, if you had a LTO6 drive, it would read/write LTO6 and LTO5 and read only LTO4.

You're sure those are LTO6 tapes? I thought there was a substantial price difference? And it would show on the invoice as such? And the tapes weren't marked as LTO6?

I ask all these questions, because I may wind up running into the same thing with my next tape order :p

The box containing the tapes had LTO6 all over it, and the tapes themselves have LTO6/6.5TB on them as well. Unfortunately, I zoned out while unpacking/labelling (because they're tapes, I've done this a million times, and if you fuck up with tapes .. nothing bad happens) so didn't notice.

There was no price difference, because the vendor fucked up when they sent out our order. I get the feeling they grabbed 10 LTO4 and stuck them in a box, grabbed the next box to hand, wrapped in plastic and passed to courier. The vendor ended up couriering out another box of 20 tapes post haste which are right now quite happily being written to. I happened to go over the alerts on the library, and they're all 'hey, dumbass, I can't read these tapes. maybe you should have a looksee why'. ++ learning moment.

Did they let you keep the LTO6 tapes? :D

I kid. No price difference :eek:

Nope. But we got a box of LTO4 real quick from them :)
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
TIL that you can remove unused luns from a bunch of esxi hosts with varying results: some will discover the luns are gone for good after a rescan, and others will still see some removed luns as unavailable (but each host seeing different unavailable hosts). You can also add space to a backend lun, have the extension fail without reason ('a general error occured' is not helpful) and see a 1TB vmfs datastore occupy 1.5TB of space.
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
[url=http://meincmagazine.com/civis/viewtopic.php?p=26409363#p26409363:2cka5u07 said:
Barmaglot[/url]":2cka5u07]
[url=http://meincmagazine.com/civis/viewtopic.php?p=26409325#p26409325:2cka5u07 said:
westyx[/url]":2cka5u07]TIL that you can remove unused luns from a bunch of esxi hosts with varying results: some will discover the luns are gone for good after a rescan, and others will still see some removed luns as unavailable (but each host seeing different unavailable hosts). You can also add space to a backend lun, have the extension fail without reason ('a general error occured' is not helpful) and see a 1TB vmfs datastore occupy 1.5TB of space.

NetApp Virtual Storage Console is great for provisioning/extending/destroying ESXi datastores. Assuming, of course, that your back-end is NetApp :)

Big Blue SVC (which is great at what it's aimed at, and not much else).
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
[url=http://meincmagazine.com/civis/viewtopic.php?p=26438649#p26438649:244dnjbv said:
afidel[/url]":244dnjbv]
[url=http://meincmagazine.com/civis/viewtopic.php?p=26438549#p26438549:244dnjbv said:
Rick25[/url]":244dnjbv]TIL not to touch servers on Friday.

RDP to a server to get an IP address for a AnywhereUSB console.
Launch the config utility and it comes up fine
Launch the viewer and the server reboots....and doesn't come back
Sits at the splash screen, tried last know good with the same response
When I do Safe Mode it's freezing at ACPITABL.DAT

Mounted the drive on another server and currently running chkdsk against it.

No idea what caused it and the two VM backups that I restored boot the exact same way.
ACPITABL.DAT in a VM is almost always caused by going from uniprocessor to SMP HAL, try dropping either dropping it down to 1vcpu if it's setup for more or set it to 2 if it's currently at 1.


Had the same issue (wouldn't boot past ACPITABL.DAT) after a vmware tools upgrade. Eventually had to uninstall vmware tools. There was a looping blue screen issue but I wan't able to capture that. hmm.

I'm really warming to the linux boot style - so much more information on what is happening rather than window's 'Hey, just look at this pretty image'
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
Not sure what you mean by detailed - is that where it shows each library as it loads? Trying to find a youtube video and I'm failing, as I did try a number of options.

At one point the system kinda booted, was doing almost zero cpu and a fair chunk of IO. If it was linux I think there would have been some information in dmesg, but in this case I got to watch a blank console and look at the vm performance metrics.
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
[url=http://meincmagazine.com/civis/viewtopic.php?p=27004201#p27004201:2j7mx25y said:
ramases[/url]":2j7mx25y]
[url=http://meincmagazine.com/civis/viewtopic.php?p=27001425#p27001425:2j7mx25y said:
hawkbox[/url]":2j7mx25y]TIL that storage vmotions are only done during change windows at IBM. While not exactly a bad thing in my mind it seems to demonstrate a lack of understanding of what the whole point is to vmotion. But hey, I'm a dirty contractor so I'll bill extra hours if it makes them happy.

In 5.1 there was at least one issue where a hickup on the source or destination datastore during a storage vmotion will not only cause that vmotion process to get stuck but to also block all other new vmotion processes on that particular host. The only way out was to reboot the host, but since you couldn't move the VMs off that host anymore it meant outage-time for all of them.
Storage vMotion is a bit more brittle than regular vMotion.

IME those types of issues are relatively rare (we run around 60 hosts and I know of only one occurrence of this type of issue at our org), but I can see that they could become more common at a larger deployment to the point where "not rocking the boat unless required by exigent circumstances" becomes a policy.

There's also the point that svmotion can negatively impact the performance of the source/target datastore, which tends to upset the natives should it cause some sort of scheduled job to miss its window.

There's a current purple screen in 5.current if you live storage vmotion a vm with rdms that we hit. That said, we storage vmotion all the time and don't tend to have issues (excluding exchange, because that stuff does *not* like being moved).
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
[url=http://meincmagazine.com/civis/viewtopic.php?p=28169465#p28169465:2k8i5fis said:
Paladin[/url]":2k8i5fis]
[url=http://meincmagazine.com/civis/viewtopic.php?p=28168631#p28168631:2k8i5fis said:
sryan2k1[/url]":2k8i5fis]We ordered like 800 Samsung Evos from CDW. It cleaned them out and some had to ship from Samsung directly.
Can I have one?

[url=http://meincmagazine.com/civis/viewtopic.php?p=28169271#p28169271:2k8i5fis said:
SandyTech[/url]":2k8i5fis]Not that I didn't already have some inkling of this, but unscheduled tests of your system's redundancy suck. :D

Not sure of the root cause yet, but 3 of our ESXi hosts simultaneously PSOD'd on us while the other three were down for scheduled RAM upgrades, forcing all of our Microsoft based hosted services to fail over to the DR facility. A pox on your house Mr. Murphy! Admittedly we brought this on ourselves to a degree by bringing 50% of our hosts down at the same time, but who the hell expects 3 boxes to simultaneously PSOD on them?

Someone like me who is generally considered 'too negative' by his co-workers.

And then I'm right. :p :D

Yeah, I've been bitten before by software/firmware upgrades. One at a time for me, and everything stops if one dies - proceeding with scheduled maintenance with a down host isn't worth the risk of another failure. Far better to have to do another 10pm/1am upgrade than have to work until everything is back up because systems are down.
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
[url=http://meincmagazine.com/civis/viewtopic.php?p=28271277#p28271277:usw8cpkg said:
ZPrime[/url]":usw8cpkg]
I would bet the internal storage is rather limited and it couldn't queue that much. So if the functionality is going to be that limited, then why offer it?
Actually, it can queue up quite a bit, part of their selling point is that it can queue mail for you in case your actual mail server is down. This doesn't happen regularly for us with a 4-node DAG, but the point stands that it does have space for queuing.

I have to wonder why it doesn't keep a cache of the email addresses and only look when either the recipient isn't in the cache or at the end of a configurable TTL for the cache.

It supposedly does keep a cache, but there's no way to know what the TTL or size of that cache is. Our unit is admittedly a bit small for our environment - we have a Model 300 which they suggest for ~1000 recipients, and we actually have ~2500 users (but a large chunk of those might only receive one message per week, or only internal mails that never go through the Barracuda). I have considered getting support involved, but generally when this LDAP process hangs up like this, I'm under pressure to get mail moving again ASAP so I don't have time to rope their support in for in situ debugging to determine the root cause. I'm also worried they'll just blame it on the unit being too small and say they can't do anything for us. (Hoping to budget for a larger unit this year anyway.)

Lodge a ticket with Barracuda, stating that the problem is occuring. They'll come back with troubleshooting steps/data collection steps to do next time the issue occurs. It may be that you can get it to dump a system image or something that they can look at later. If they come back stating they need to examine the issue in situ when it occurs, take it to your manager and they can decide wether it'll be worth the outage. If your manager approves, Barracuda support should be fairly good when you call - you'll be calling in with an established ticket, and with a high priority (service down).

If they come back with 'the unit is too small' you push that back to your manager, and then it might bump getting a bigger barracuda higher in the budget priority list.

Worst case scenario is that barracuda tell you the unit is too small, your manager tells you that it's not worth the extended outage to troubleshoot, and you now have a cover-your-ass email train that you've done what you can to resolve/mitigate the email outage issues.
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
[url=http://meincmagazine.com/civis/viewtopic.php?p=29035449#p29035449:1nq9k99f said:
Paladin[/url]":1nq9k99f]
[url=http://meincmagazine.com/civis/viewtopic.php?p=29034365#p29034365:1nq9k99f said:
Digitlman[/url]":1nq9k99f]Is anybody else seeing this: after applying the latest Windows updates, after you reboot a vmware vm, sometimes the networking doesn't start until you fully power off and power on the vm?
I've seen that happen for years on windows updates with VMware, KVM and XEN virtualization/hypervisors. About 1 in 15 times rebooting for a windows update will cause networking to just 'not work' until it is rebooted again or I disable/enable the NIC in control panel. Never caused enough pain to really track it down.

I had a similiar issue with vmware tools updates that might explain this. The issue was that vmware tools updates in our environment coupled with high storage io wait times (middle of the backup window) meant that windows would detect new hardware on boot and do stuff. By the time it finished doing all the detecting and updating stuff and brought up the network interface, all of the services that required networking (iis etc) had given up. A reboot resolved everything.
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
[url=http://meincmagazine.com/civis/viewtopic.php?p=29069767#p29069767:3svp7e8g said:
SandyTech[/url]":3svp7e8g]For some reason, the guys who maintain the rhsbl.ahbl.org, dnsbl.ahbl.org, and ircbl.ahbl.org blacklists decided to wildcard those zones so that any query to them respond with a positive. And to make things more interesting, the ahbl.org people have decided that the query/removal request load is too high so they're not accepting removal requests anymore.

This is the response you get from a look-up against rhsbl.ahbl.org:

Code:
Remote Server returned '<mail.mycompany.com #5.0.0 smtp; 550-Blacklisted in rhsbl.ahbl.org.: List shut down. See: 550 http://www.ahbl.org/content/last-notice-wildcarding-services-jan-1st>'

And now I'm dealing with a bunch of people whinging that they can't get mail to us because they're using one of those three zones in their filtering chain, with the obvious side effects. Fucking hate being an email provider sometimes.

Yeah, they do that when decommissioning lists. This change was announced March 2014, so all the people who are complaining to haven't been properly maintaining their mail filtering rules.
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
[url=http://meincmagazine.com/civis/viewtopic.php?p=30207267#p30207267:259nzwj8 said:
finni[/url]":259nzwj8]
[url=http://meincmagazine.com/civis/viewtopic.php?p=30206597#p30206597:259nzwj8 said:
Danger Mouse[/url]":259nzwj8]
...my own potential failure points (Cisco DMM server)
Cisco DangerMouseMouse server?

DangerMouseMurderer?

Its a pretty big failure point.

That's alright, we have a RAID (Redundant Array of Inexpensive Dangermice) - any failures or deaths and his ability to post in this thread is protected.
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
[url=http://meincmagazine.com/civis/viewtopic.php?p=30582245#p30582245:1auwzoo0 said:
afidel[/url]":1auwzoo0]
[url=http://meincmagazine.com/civis/viewtopic.php?p=30581945#p30581945:1auwzoo0 said:
sryan2k1[/url]":1auwzoo0]The restart management agent .sh is always safe to run in production. It has roughly the same affect as rebooting vCenter. vMotion and other tasks that rely on vCenter to be alive won't work until they reconnect, but everything else will chug along.
I would evacuate the host first though, I've had it where the agent couldn't restart and then you have no way to get VM's off the host other than powering them off and starting them on a different host.

Very, very occasionally I would have this happen, but things had to be really, really broken. Most of the time for me restating management agents fixed the problem without issue.
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
Looking at solutions for fastest possible, ideally automated failover/recovery between two server rooms located ~500 meters apart (different buildings, same site), at SMB scale (no more than half a dozen VMware hosts, up to 10TB of space, 1-2TB of it on SSD) and with plenty of legacy applications involved (so no application-based multi-site HA). So far the options seem to be:

  • HP 3PAR Remote Copy with Peer Resilience - seems fairly straightforward, simple IP or FC replication and ALUA for clients, they claim they have some special sauce for reducing latency (important with SSDs I suppose). Also seems to be available across their entire product range, so don't need to go for the big boy arrays to get the features.
  • NetApp MetroCluster - conceptually similar to 3PAR, also does automatic transparent failover, but topology is significantly more complex (need ethernet for cluster replication plus FC for cache replication plus at least four SAS-fiber bridges for the disks), and only available with FAS8020 and up.
  • NetApp MetroCluster + FlexArray - potentially somewhat simpler topology (can utilize the same FC fabric for connecting to LUNs and FCVI adapters), plus can possibly reuse existing storage.
  • IBM/Lenovo Storwize V3700 Metro Mirror - configuration seems fairly simple, but actual failover seems to require SRM ($$$$$$$$).
  • EMC VNX with MirrorView - seems to suffer the same caveats as IBM's Metro Mirror.
  • VSAN, Nutanix - both support stretched cluster configurations, but for both these are fairly new (for VSAN, very new, and I'm quite wary of using VMware's latest and greatest) features, and I'm concerned about stability.

Anything else I'm missing from the major vendors?

EMC Vplex does cache coherent storage at separate sites, and don't quote me but I believe you can use qualified third party storage (as well as EMC stuff). Going to be pricey though.
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
Statue of Limitations has passed on not sharing stories from $A_PREVIOUS_WORK (I've stopped caring), so:

At interview:
"We anticipate you moving to more senior role more fitting to your experience after about 3 months, doing more in $AREA"
At post-notice meeting, a month or so later after seeing the writing on the wall:
"We knew hiring you with your experience was a risk. We had planned to move you up after around 18 months in the role"

Manager: "We were expecting to get 5 $ISSUE from customer a week, as per discussions with the customer before the contract was signed"
Me: "We got 5 this afternoon"
Each issue could take 2 to 4 hours or more of wall clock time to either fix/mark not fixable, and could go on for in worst case three weeks of telling the end customer their internet was down, and to expect a technician "on or before the $TELCO commitment date of Y".

The customer were steadily outsourcing parts of their business, and their helpdesk morale was low. They had mid level 100's of sites scattered around. Their helpdesk was supposed to do basic troubleshooting and include contact information for followup. One ticket I received:
"A cable was unplugged"
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
Oh, how much I love Microsoft....Building my first Server 2016 VM and ready to scream. I loaded the OS and VMware tools, try to do a windows update and it finds a couple updates needed. Started to download and it hangs at 88% for a couple hours. I give it a reboot and it hangs downloading updates again, no network traffic. Wiped and reloaded the OS and let it sit at downloading updates 88% all last night and still the same thing. Anyone out there been able to get Server 2016 to update yet? I am dreading a Microsoft call.

There are a couple of test boxes here that are updating without issue, running on vmware.
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
TIL that troubleshooting Dell server issues goes like this:
1. Issue occurs
2. Told to update iDRAC a different way
3. No, wait, there's even a safer way to update iDRAC
4. Your iDRAC and BIOS is Current-1, thus the physical problem wasn't captured in the logs. Better update right now and run diagnostics. There might be an ECC memory issue, in which case you need to go into the datacenter and remove power for 30 seconds. Or until you complain and they think that might not be necessary. Seriously, I need to remove power for 30 seconds to troubleshoot a memory issue?
5. Your iDRAC install is broken? You'll need to physically be present in the datacenter to wipe all iDRAC settings and pull the power cords for 30 seconds, or 2 minutes, whichever number the Dell support picks out of the air.
6. Wait, you're not happy about having to travel to the datacenter each and every time you update firmware because there's a 30% chance it'll break? Uh, let me have a look - ooh, perhaps if you try this command to reset iDRAC while logged into the iDRAC via ssh it might work!
7. Oh, wow, that didn't work, but if I use this other ssh command to reset the power state of the server itself it'll magically fix iDRAC. And I won't have to go into the datacenter.

I don't know if they don't know how to fix these things, or it's an easy 'throw stuff at customer until they grumble and go into the datacenter' tech support thing.

Let's ignore the firmware ISO that doesn't actually update all the firmware on the first pass. Nice to find that out when I'm halfway through updating servers.
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
[url=http://meincmagazine.com/civis/viewtopic.php?p=32415723#p32415723:encknutm said:
Klockwerk[/url]":encknutm]TIL that troubleshooting Dell server issues goes like this:

Is this with regular Dell support, or ProSupport? I've found the latter to be pretty good.

ProSupport. Some seem to be good. My impression is that their training or process heavily favours removing power/resetting iDRAC.


[url=http://meincmagazine.com/civis/viewtopic.php?p=32415723#p32415723:encknutm said:
Klockwerk[/url]":encknutm]You'll need to physically be present in the datacenter

Protip: People on phone calls can't tell where you are in the world.

Clarification: Lie to Dell Support next time to get what you need: SUPPORT.

I've done tech support, and I make mistakes too, so I'm happy to follow the process in case I missed something, and potentially waste everyone's time requiring a tech to come out or similiar. I *will* push back if what they're saying doesn't make sense to me, or if their remote management boils down to "We don't have remote management". It's working well so far in terms of outcomes, but not so well in terms of frustration levels (things will quieten down during the change freeze thankfully).

Or get some network controlled PDUs.


Everything in all of our datacenters has per outlet remote power control.

We don't have these, and Boss^2 has already decided it would be helpful to have these. Perhaps Santa will be nice to me sometime in the new year.

But Dell Support said you have to physically pull the cables.

So about that lying thing....

I'm new to this position, and some things have been left to fester. I'm hoping that the next round of updating will go much smoother as we won't need to jump so many levels. Fingers crossed, anyways.
 

Klockwerk

Ars Praefectus
3,757
Subscriptor
Or, the drive was an internal, non-hot-swap type, and there was a bunch of production gear piled on top of the server in question.

Not quite. Hyperconverged special sauce node

I arrived at the datacenter to find the service company technician and the replacement harddisk. Day was as follows:
* About 2 hours of the $TECH trying to get a phone conversation that was understandable with the vendor. Once we'd had enough of playing around and told the vendor to just call us back on one of our mobiles we got to ..
* Half hour to an hour getting everything ready via webex
* 10 minutes imaging the drive (definitely not hot-swappable, and physically couldn't be mirrored because there was only one SATA port on the unit) off onto other local storage
* 10 minutes swapping the drive after powering down and removing the unit (inclusive)
* One and a halfish hours of imaging the drive back, which includes sitting there watching dd do nothing (would have been nice to see progress and speed, but those arguments weren't used), finally going 'screw this, i'm going to lunch' and coming back to find the dd finished
* Much more time getting the virtualisation layer all working again, plus a bunch of checks and waiting for the hyperconverged cluster to resync.

** Timeline isn't exact, but I arrived at 08:30 and finished work on the unit around 15:00, including a half hour break for lunch.

Vendor support had prepped me with horror stories starting with ".. and if the image off has taken more than 4 hours, we recommend cancelling the image, putting in the new part and install from scratch".
 
Status
You're currently viewing only Klockwerk's posts. Click here to go back to viewing the entire thread.