Tag Archive: NetApp


This is a blog post that I’ve had at the back of my mind for a good 6 months or so. The pieces of the puzzle have come together after the Gestalt IT Tech Field Day event in Boston. After spending the best part of a week with some very very clever virtualisation pro’s I think I’ve managed to marshal the ideas that have been trying to make the cerebral cortex to wordpress migration for some time !

Managing an environment , be it physical or virtual for capacity & performance requires tools that can provide you with a view along the timeline. Often the key difference between dedicated “capacity management” offerings and performance management tools is the very scale of that timeline.

clip_image002

Short Term : Performance & Availability

There we are looking at timings within a few seconds / minutes ( or less ) this is where a toolset is going to be focused for current performance on any particular metric , be it the response time to load a web application , Utilisation of a processor core or command operations rate on a disk array. The tools that are best placed to give us that information need to be capable of processing a large volume of data very quickly due to the requirement to pull in a given metric on a very frequent interval. The more frequently you can sample the data , the better quality output the tool can give. This can present a problem in large scale deployments due to a requirement that many tools have to write this data out to a table in a database – this potentially tethers the performance of a monitoring tool to the underlying storage available for that tools , which of course can be increased but sometimes at quite a significant cost. As a result you many want to scope the use of such tools only to the workloads that require that short term , high resolution monitoring. In a production environment with a known baseline workload , tools that use a dynamic threshold / profile for alerting on a metric can be very useful here ( for example Xangati or vCenter Operations ) If you don’t have a workload that can be suitably base lined ( and note that the baseline can vary on your business cycle , so may well take 12 months to establish ! ) then the dynamic thresholds are not of as much use.

Availability tools have less of a reliance on a high performance data layer as they are essentially storing a single bit of data on a given metric. This means the toolset can scale pretty well. The key part of availability monitoring is the visualisation and reporting layer. There is no point only displaying that data to a beautiful and elegant dashboard if no-one is there to see that dashboard ( and according to the Zen theory of network operations , would it change if there was no one there to watch it ! ) The data needs to be fed into a system that best allow an action to be made – even if it’s an SMS / Page to someone who is asleep. In this kind of case , having suitable thresholds are important – you don’t want to be setting fire alarms off for a blip in a system that does not affect the end service. Know the dependencies on the service and try to ensure that the root cause alert is the first one sent out. You do need to know that the router that affects 10,000 websites is out long before you have alerts for those individual websites.

Medium Term : Trending & Optimisation

Where the timeline goes beyond “what’s wrong now” , you can start to look at what’s going to go wrong soon. This is edge of the crystal ball stuff , where predictions are looking to be made in the order of days / weeks. Based on collected utilisation data in a given period , we can assess if we have sufficient capacity to be able to provide an acceptable service level in the near future. At this stage , adjustments can be made to the infrastructure in the form of resource balancing ( by storage or traditional load ) – tweaks can also be made to virtual machine configuration to “rightsize” an environment. By using these techniques it is possible to reclaim over allocated space and delay potential hardware expansions. This is especially valid where there may be a long lead time on a hardware order. The types of recommendations generated by the capacity optimisation components of VKernel , NetApp ( Akorri ) and Solarwinds products are great examples of rightsizing calculations.  As the environment scales up , not only are we looking for optimisations , but potential automated remediation ( within the bounds of a change controlled environment ) would save time and therefore money.

Long Term capacity analysis : When do we need to migrate Data centers ?

Trying to predict what is going to happen to an IT infrastructure in the long term is a little like trying to predict the weather in 5 years time , you know roughly what might happen but you don’t really know when. Taking a tangent away from the technology side of things , this is where the IT strategy comes in – knowing what applications are likely to come into the pipeline. Without this knowledge you can only guess how much capacity you will need in the long term. The process can be bidirectional though , with the information from a capacity management function being fed back into the wider picture for architectural strategy for example should a lack of physical space be discovered , this may combine with a strategy to refresh existing servers with blades. Larger Enterprises will often deploy dedicated capacity management software to do this ( for example Metron’s Athene product which will model capacity for not only the virtual but the physical environment )  Long term trending is a key part of a capacity management strategy but this will need to be blended with a solution to allow environmental modeling and what if scenarios. Within the virtual environment the scheduled modeling feature of VKernel’s vOperations Suite is possibly the best example of this that I’ve come across so far – all that is missing is an API to link to any particular enterprise architecture applications. When planning for growth not only must the growth of the application set be considered but the expansion in the management framework around it , including but not limited to backup and the short-medium term monitoring solutions.  Unless you are consuming your it infrastructure as a service , you will not be able to get away with a suite that only looks at the Virtual Piece of the puzzle – Power / Cooling & Available space need to be considered – look far enough into the future and you may want to look at some new premises !

We’re going to need a bigger house to fit the one pane of glass into…

“one pane of glass” – is a phrase I hear very often but not something I’ve really seen so far. Given the many facets of a management solution I have touched on above , that single pane of glass is going to need to display a lot ! So many metrics and visualisations to put together , you’d have a very cluttered single pane. Consolidating data from many systems into a mash-up portal is about the best that can occur , but yet there isn’t a single framework to date that can really tick all the boxes. Given the lack of a “savior” product you may feel disheartened , but have faith!. As the ecosystem begins to realise that no single vendor can give you everything and that an integrated management platform that can not only display consolidated data , but act as a databus to facilitate sharing between those discrete facets is very high on the enterprise wishlist , we may see something yet.

I’d like to leave you with some of the inspiration for this post – as seen on a recent “Demotivational Poster” –a quick reminder of perfection being in the eye of the beholder.

“No matter how good she looks, some other guy is sick and tired of putting up with her s***”

I’ve been lucky enough in the last couple of days to get hands on with some Cisco UCS kit. Coming from a 99% HP environment , its been a very new experience. I’ll try to go get too bogged down into technical details , but wanted to note down what I liked and what I didn’t like about my initial ventures into UCS.

 

As ever with things like this, I didn’t spend weeks reading the manual. If I did that I’d do nothing but read manuals with no time to do any actual work Smile I did got through a few blog posts and guides by fellow bloggers who have covered UCS in much more detail than I will 9 at this stage at least.

 

It seems that the unique selling point of the UCS system is “server profiles” rather than setting up a given blade in a given slot , a profile is created and then either assigned to a specific server or allocated from a pool of servers. The profile contains a number of configuration items , such as number and config of NICs & HBA’s that a blade will have , and what order the server will try devices for boot.

 

The last item seems the most critical , because in order to turn our UCS blades into stateless bits of tin , I am building the server profiles to Boot-from-SAN. Specifically they will be booting up into ESXi , stored on a LUN of a Netapp FAS2020 storage unit. the Netapp kit was also a little on the new side to me so I’m looking forward to documenting my journey with that too!

 

Before heading deep into deploying multiple service profiles from a template, I thought I would start with some (relative) baby steps and create a single service profile , apply that profile to a blade and install ESXi onto an attached LUN , which I would then boot from. A colleague had predefined some MAC & WWN pools from me so I didn’t have to worry about what was going to happen with those.

 

Creating the service profile from scratch , using the expert mode ran me through a fairly lengthy wizard that allowed me to deploy a pair of vNIC’s and a pair of vHBA’s on the appropriate fabrics.A boot policy was also defined to enable boot form a virtual CDROM , followed by the SAN boot. At this point I found my first gotcha. It was a lot easier to give the vHBA’s a generic name , such as fc0 and fc1 rather than a device specific one e.g.. SRV01-HBA-A. Using the generic name would later allow me to use the same boot policy for all servers at a template level. As you also have to specify the WWPN for the SAN target, and at the time of writing the lab only had a single SAN , a single Set of WWPN’s can be put in. If you had requirements for different target WWPN’s you would need a number of boot policies.

Working our way back down the stack to the storage , the next task was to create the zone on the Nexus 5000 fabric switches. For cisco “old hands” here is a great video on how to do this via an SSH session.

 

Video thanks to : http://blogs.cisco.com/datacenter/the-san-boot-soon-will-be-making-another-lun/

 

I had just spent a bit of time getting a local install of fabric manager to run due to the local PostGres db. service account loosing rights to run as a service , which was nice Smile so determined to use fabric manager to define the zones. As with zoning on any system you need to persuade the HBA to log into the fabric. As a boot target had already been defined the blade will attempt to log into the fabric on startup , but it did mean powering it on and waiting for the SAN boot to fail. Once this was done the HBA’s can be assigned an alias , then dropped into a zone along with the WWPN of the storage and finally rolled up in to a zone set. Given that the UCS is supposed to be a unified system , this particular step seems to be a little bit clunky and would take me quite some time if I had 100 blades to configure. I will be interested to see if I can find a more elegant solution in the upcoming weeks.

 

Last but not least , I had to configure a disk. For this I used Netapp System Manager to create a lun and associated volume. I then added an initiator group containing the two HBA WWPN and presented the lun to that group. Again this seems like quite a lot of steps to be doing when provisioning a large number of hosts. Any orchestration system to make the this more expansive would have to be able to talk to UCS or the fabric to pull the WWPN’s from , provision the storage and present it accordingly.

 

The last step was to mount an iso to the blade , and install ESXi. This is the only step I’m not really pondering how I would do the install if it was not 1 but 100 hosts I had to deploy. I’d certainly look to PXE boot the servers and deploy ESXi with something like the EDA . By this stage I figured It was time to sit back with a cup of tea and ponder further about how to scale this out a bit. However when I rebooted the server post ESXi install , in stead of ESXi starting , I was dumped back to the “ no boot device found: hit any key “ message.

 

This was a bit of a setback as you can imagine , so I started to troubleshoot from the ground up. Has I zoned it correctly ? Did I present it correctly ? Had I got the boot policy correct ? I worked my way through every blog post and guide I could find but to no avail. I even attempted to create the service profile on the same blade , but again no joy. It would see the LUN to install from , but not to boot from.  As Charlie Sheen has shown , “when the going gets tough , the tough get tweeting” so reached out to the hive mind that is twitter. I had some great replies from @ChrisFendya and @Mike_Laverick who both suggested a hardware reset ( although mike suggested it in a non UCS way. The best way for me to achieve this was to “migrate” the service profile to another blade. This was really easy to do and one reboot later I was very relieved to see it had worked. It seems that sometimes UCS just doesn’t set the boot policy on the HBA, which is resolve by reassociating the profile.

 

I look forward to being able to deploy a few more hosts and making my UCS setup as agile as the marketing materials would suggest !

Reading , wRighting and Recording – Measure how your applications hit your disks!

I’ve spent the last week thinking more about storage than I usually would, particularly in the light of some of the conversations I’ve been having over Tech Field Day with the other delegates & sponsors who have had varying levels of interest & expertise within the storage world. If, like me you have a basic appreciation of storage but want to get in that little bit deeper , a good primer would be Joe Onisick’s storage protocols guide at DefinetheCloud.net

Admins working in smaller shops probably have a little closer control over the storage they buy as they are likely to be the ones specifying , configuring and crying over it when it goes wrong ; It’s one of the con’s of working for a large enterprise is that the storage team tends to be separate – they guard their skills and disk shelves quite closely , sometimes a little too closely – I do wonder if their school reports used to say “does not play well with others” . The SAN is seen as a bit of a black box by the rest of the department and generally as long as the required capacity is available to someone when they ask for it , be it a lun or VMware datastore , then everyone is happy to let them get on with it.

As soon as there is a performance issue however , that happy boat starts to rock .The storage team starts to get defensive, casting forth whitepaper & best practice guide as if they were a World of Warcraft character holding a last stand. At some point you may well find that you hit the underlying best performance of the SAN , no matter how well tuned it is. You are then left in a bit of a quandary of what to do, in the worst case you have to bite that bullet and move that application which looked like the lowest of the low hanging fruit back onto a physical server with direct attached storage , where it’ll smugly idle at 5% utilisation for the rest of its life , ever causing reproachful looks when you walk past it in the datacenter.

How do you avoid the sorry tale above ? In a nutshell, “Know your Workload!” When you start to map what your applications are actually using you can start to size your environment accordingly. One of the bigger shocks that I’ve come across when doing such an exercise is a much heavier proportion of writes than the industry would have us expect. This causes a big problem for storage vendors who rely on flash based cache to be able to hit their headline performance figures. When reading from a cache , of course the performance will be great , but under a heavy write intensive load the performance of the underlying disk starts to be exposed and it seems to come down to number and speed of spindles. Running a system that uses intelligent tiering to write hot blocks in the fastest way then cascade them down the array as they get cooler could help in this instance. Depending on your preference for File or Block level storage , there are a number of vendors who could help you with this, for example Avere Systems or 3PAR or the next Generation of EMC’s FAST technology.

At Tech Field Day , NetApp , VMware and Cisco presented on their flexpod solution for a scalable and secure multi tenant virtualised infrastructure. If you’d like to watch the recording of the presentation, its available here . What would appear to differentiate the flexpod from other products is that is a not a blackbox device , designed to drop into a data centre to provide X number of VM’s , when you have X+1 VM’s, you just go out and buy another device.

While you can approach a VAR and order a flexpod as a single unit , the design and architecture is what makes it a “flexpod” – being a single bill of materials that can be put together to give a known configuration. The view of this being that it offers a greater agility of design , for example using a NetApp VServer head to present storage from another vendor to the solution.

To me , this seems a little bit like buying a kit car.

imageYou get a known design and list of components you have to source – although the design may well recommend where you source the components. Sometimes you can get them part built or pre built, but if you want to run it with a different engine , you can drop one in should you so desire.

 

The VBlock from the VCE guys is a different kettle of fish – its not a design guide , its a product. You chose the VBlock that suits the size of deployment that you want to do , order it and sit back and wait for the ready built solution to arrive on the back of a lorry ( truck to our US friends ;) ) This is like ordering a car from a dealership.

image

Of course you could just go to any reseller and buy a bunch of servers , network & hardware and then install ESX on it. The Stack vendors might compare this to trying to hand cast your car from a single block of metal !

image 

At the moment many of us who can already design a solution from scratch are at that hand casting level , and while I wont deny we’ve been through a few pain points , we’ve usually been able to fix them. Its part of the skill that keeps us employed. By going for an “off the shelf product” the pain of that part of a system design is divorced from the solution and perhaps it would allow focus on what may be the next part of the design at the service and application level –don’t worry about build a car , worry about driving it! . If you need a car to drive to work and do the weekly shopping in, you buy one from a dealership – but if you have a specific need , then you may have to get into the workshop and build a car that meets those needs.If you want to concentrate

When a prebuilt solution  develops a problem that requires support , the offerings from the major vendors seem to differ a little. If you have a VBlock, you have one throat to choke ( presumably not your own , its only a computer problem , don’t let it get to you ;) ) and one number to call. They will let the engineers from the different divisions fight it out and fix your problem , which is ultimately the only thing of concern to you as an owner.

The situation with a flexpod seems a little less intuitive. As its not a single SKU – you would require a separate support contact with each vendor ( of course this may be marshalled by the VAR you purchase through ) , You would initiate contact with the vendor of your choice – they then have a channel under the skin to be able to work with engineering functions of the other partners at the network, storage , compute & hypervisor arms as required. I would like to think this does not mean the the buck gets passed for a couple of rounds before anyone takes ownership of the problem , but I’ve yet to hear of anyone requiring this level of support. If you have and had a positive or negative experience , please get in contact.

If you have “rolled your own” solution , then support is up to you ! make sure that you have a similar SLA across the stack , or you could find yourself in a situation where you have a very fast response from your hypervisor people , but when they work out its your storage at fault , they might make you wait till the next day / end of the week. If this does happen to you , then I’m sure you’ll have plenty of time to clear your desk….

 

image

I’ve almost recovered from my Hectic week of Jet-setting for this year , starting with the VCAP-DCD Beta Exam in Amsterdam and culminating in a few days of visiting vendors for talks and roundtables in Silicon Valley. It was my first visit to the west coast , so I was initially star struck by it all – names you only ever see as a URL on buildings really pushes home how close you are to the technology and its not hard to get caught up with the buzz of it – I lost count of the number of startup ideas I heard over the course of the event!

For those of you who haven’t heard of the Tech Field Day concept before , here is a brief guide.  Following on from a concept launched by HP , the field day brings a number of delegates from the user community together with a vendor or vendors for a session that should be a little bit more in depth that your average marketing pitch. The delegates are not there to buy anything , and are no way obliged to write about their experiences, although Food & Drink , Travel & Accommodation expenses are covered by the sponsoring vendors.

This particular event marked a new direction for TFD in that it was streamed live over the web via ustream.tv . This potentially changed things in a couple of ways – The cameras were far form hidden and I wonder if the fact that they were being broadcast affected some peoples candour and in a couple of circumstances the sponsors where prepared to say some things off camera that they were not prepared to when they were rolling. That said , the greater audience did mean that a few questions were asked that may have not been bought up had it not for being mentioned on twitter by someone watching the stream. I would like to think that I was as honest as I’d have been on and off camera!

I think the event is possibly better suited to the smaller vendors with a less refined marketing function – Of the larger vendors that we saw , the sessions felt a little pre-canned with PowerPoint hitting a critical mass at one particular site. Making use of an “Executive Briefing Centre” , while it gives you access to nice comfy rooms with wireless internet access does nudge conversations to wards that more marketing side of things. Just using a regular conference room facilitated a more in depth discussion and 2 way communication.Perhaps there is a case for presentations to be done “in the round” to use a theatrical example , with delegates sitting in a "”doughnut” around the presenter.Presenters that had a real passion about their product held the audience much better , a prime example of which was Dave Hitz – founder of NetApp. He was only booked in for a 15 minute slot , but stayed for most of the 4 hours session , which is a lot of time to dedicate for a guy in his position. Outside of his own slides he was active in the discussions around the topics. It was a shame he wasn’t able to stay for lunch, where I believe the best dialog with the NetApp guys occurred.

I my next few blog posts I’m going to try and write about subjects that came up during the sessions , rather than a summary of each session , which you would better off getting from watching the excellent recordings made by the PrimeImage Media guys.

 

For those that missed it , have a look at the following video from the day (my wonderful piece to camera is at about 1:41 )

Tech Field Day 4 – Day 1 – NetApp 2 from Stephen Foskett on Vimeo.

 

One last thing – you may well have noticed my fledgling upper lip furniture – I’m growing a moustache this month as part of Movember – donating my face to men’s health. If you would like to donate to help men who have problems growing good facial hair like myself , then my MoSpace page is at http://uk.movember.com/mospace/1067584/