Wednesday, October 19, 2011

Simple auto-scale with HAProxy

Few years back I started to collect some "old" hardware at work trying to make of it a little "cluster" where I'd do my own experiments and run some web apps (Ruby, Sinatra, Rails, Django, ecc), HDFS, Map-Reduce and stuff like that.

Basically, it goes like this:


Very simple, right.
Well, we started to fill it up with more apps and recently I realized we'll run out of memory sooner or later. So, what I tried is I changed the way passenger was getting started: I reduced "--min-instances" to 0 and left it to minimun (1) on one server only.

It worked... but not the way I wanted. The problem is, while Nginx is load balancing incoming requests it naturally heath-checks the nodes (server 1, 2, ...N) which is a good thing, but, my "--min-instances 0" startup parameter had a very small impact, because:
  1. Nginx doesn't know about my "min-instances" parameter and consider all the nodes always to "be ready to serve";
  2. when a new request gets routed (by Nginx) to a node with "min-instances 0", it might take quite a few moments for the first response to be spit out by the rails app instance (i.e. a sort of "warming up"), so the whole thing started to feel much slower;
  3. since Nginx does a round-robin on load balancing, my rails instances were starting up and then shut down (by Phusion Passenger which is, by the way, doing a great job) because the activity, i.e. incoming requests, was actually not that high.
So, I wanted to make something very simple, with a minimun code/whatever writing/configuring.
Here's what I came up with:
Nothing fancy, you'll say. A "standard" stack... - exactly! There's a little thing though, this "haproxy_autoscale" python script I wrote. What it does is something very simple:
  1. hook up on HAProxy logfile (I actually use syslog and option httplog)
  2. calculate an average response time over 10-15 latest requests (i.e. how fast my rails instances are responding back to the client/browser)
  3. if the average is starting to get above a threshold for a while, start to scale up.
  4. when requests activity goes down (e.g. no high load within 5 min), scale down.
By "scale up" and "scale down" I simply mean assigning to a backend server the MAINT status using HAProxy socket commands:

scale up:

echo "disable server app1-backend/server1" | socat stdio /haproxy/socket

scale down:

echo "disable server app1-backend/server1" | socat stdio /haproxy/socket

That's it. What happens is most of my app servers and rails instances do not get bothered anymore unless there really is a high load. That way I got my RAM back and can stuff up even more apps/hadoop map-red/whatever.

In case you're interested, here's that haproxy_autoscale.py script.
I have to warn you, though: by no way you should use it in a production environment as is. It's an ongoing  experiment I'm running these days. This little script still needs quite a few touches, but it'll give you an idea.


import sys
import sys
import os
import time
import re
from threading import Timer
from datetime import datetime

import urllib2
import random

# response time threshold in milliseconds: when backend starts responding 
# slower than the threshold we scale up, otherwise scale down.
THRESHOLD = 500

# num of requests to calc average
NUM_REQ = 15

# need this backend to set correct initial status of backend servers
BACKEND = "app1backend"

# any url that goes straight to the backend is fine as warmup_url
# active == True will initially set its status as UP, MAINT otherwise
# always_up - never set MAINT on that backend (leave at least one host as always_up)
SERVERS = {
  'server1': { 'active': True,  
              'always_up': True,  
              'warmup_url': 'http://server1:1234/' },
              
  'server2'  : { 'active': False, 
              'always_up': False, 
              'warmup_url': 'http://server2:1234/' },
              
  'server3' : { 'active': False, 
              'always_up': False, 
              'warmup_url': 'http://server3:1234/' },
              
  'server4' : { 'active': False, 
              'always_up': False, 
              'warmup_url': 'http://server4:1234/' },
}

# see http://code.google.com/p/haproxy-docs/wiki/UnixSocketCommands
CMD_DISABLE    = 'echo "disable server b-%s/%s" | socat stdio /haproxy/socket'
CMD_ENABLE     = 'echo "enable server b-%s/%s" | socat stdio /haproxy/socket'
CMD_SET_WEIGHT = 'echo "set weight b-%s/%s %d" | socat stdio /haproxy/socket'

def watch(thefile):
  """
  opens thefile and keeps reading new lines.
  this is supposed to be a syslog log file.
  """
  thefile.seek(0,2)      # Go to the end of the file
  while True:
    line = thefile.readline()
    if not line:
      time.sleep(0.1)    # Sleep briefly
      continue
    yield line

def host_to_scaleup():
  """
  searches through the list of not yet active backends 
  and returns a random choice, otherwise returns None
  """
  hosts = filter(lambda h: not SERVERS[h]['active'], SERVERS)
  if len(hosts):
    return random.choice(hosts)
  # otherwise return None, nothing to scale up
  
def host_to_scaledown():
  """
  filters only active hosts and returns a random choice, 
  None otherwise.
  """
  hosts = filter(lambda h: SERVERS[h]['active'] and not SERVERS[h]['always_up'], SERVERS)
  if len(hosts):
    return random.choice(hosts)
  # otherwise return None, nothing to scale down
  
def scale_up(backend, host):
  """
  send a 'warmup' request to the host in question
  and adds it to the HAProxy's active backend servers list,
  i.e. sets UP status
  """
  warmup_url = SERVERS[host]['warmup_url']
  print "%s: warming up at %s" % (datetime.now(), warmup_url)
  req = urllib2.Request(warmup_url)
  req.add_header('User-Agent', 'haproxy_autoscale')
  try: 
    r = urllib2.urlopen(req)
    #print r.info()
  except urllib2.HTTPError, e:
    print "*** didn't get a 200/OK response, sorry: ", e.code
  except urllib2.URLError, e:
    print "*** couldn't reach the backend server: ", e.reason
  else:
    # send socket commands to (re-)enable the backend
    cmd1 = CMD_ENABLE % (backend, host)
    cmd2 = CMD_SET_WEIGHT % (backend, host, 10)
    os.system(cmd1)
    os.system(cmd2)
    SERVERS[host]['active'] = True
  
def scale_down(backend, host):
  """
  removes host from HAProxy active backend servers list,
  i.e. sets MAINT status
  """
  print "%s: turning DOWN b-%s/%s" % (datetime.now(), backend, host)
  cmd1 = CMD_SET_WEIGHT % (backend, host, 0)
  cmd2 = CMD_DISABLE % (backend, host)
  os.system(cmd1)
  # for some reasong cmd1 does not always work
  # so we set weight to 0, just in case.
  os.system(cmd2)
  SERVERS[host]['active'] = False

# this is where we store response times
resps = []

def avg_resp_time(new_val):
  """
  adds new_val to the resps arrays 
  and returns average over all requests in the list.
  """
  resps.append(new_val)
  if len(resps) > NUM_REQ: 
    # keep list length up to the NUM_REQ maximum items
    del(resps[0])
    return sum(resps) / len(resps)
  # otherwise we return None: not enough data

def random_scale_up(backend):
  """does the opposite of random_scale_down()"""
  h_up = host_to_scaleup()
  if h_up: 
    scale_up(backend, h_up)
  reset_cooldown_timer(backend)
  
def random_scale_down(backend):
  """
  runs after about 5 mins of inactivity 
  (e.g. no incoming requests)
  """
  h_down = host_to_scaledown()
  if h_down:
    print "%s: scaling down %s" % (datetime.now(), h_down)
    scale_down(backend, h_down)
    SERVERS[h_down]['active'] = False
    reset_cooldown_timer(backend)
  
# when no requests are coming in anymore we still 
# want to scale down automatically, after some time.
cooldown_timer = None

def reset_cooldown_timer(backend):
  """
  creates new timer to scale down 
  after 5 min of inactivity
  """
  global cooldown_timer
  if cooldown_timer: cooldown_timer.cancel()
  cooldown_timer = Timer(60*5, random_scale_down, [backend])
  cooldown_timer.start()
  
# set initial status of every backend server
for h in SERVERS:
  if SERVERS[h]['active']:
    scale_up(BACKEND, h)
  else:
    scale_down(BACKEND, h)
  
  
print "watching %s ..." % sys.argv[1]

# regexp to match against haproxy log file
p = re.compile('.*b-([a-zA-Z0-9\-_]+)/([a-zA-Z0-9\-_]+) \d+/\d+/\d+/(\d+)/.*')

# scale up/down count threshold
scale_threshold_count = 0

# endless loop
for line in watch(open(sys.argv[1])):
  r = p.match(line)
  if r:
    backend, host, rt = r.groups()
    if host in SERVERS:
      SERVERS[host]['active'] = True # set as active since it's in the logs
      
    # calculate average response time
    resp_time = avg_resp_time(int(rt))
    
    # check whether we have enough data to reason
    if resp_time is None: 
      continue
      
    elif resp_time > THRESHOLD:
      # scale up, if we can and need to
      scale_threshold_count += 1
      
      if scale_threshold_count < 3:
        # haven't reached count max
        continue
        
      # please, do scale
      print "%s: avg resp time: %d" % (datetime.now(), resp_time)
      
      random_scale_up(backend)
      scale_threshold_count = 0 # reset the counter

If you want to try it, just change /haproxy/socket in CMD_DISABLE, CMD_ENABLE and CMD_SET_WEIGHT (at the beginning of the script) to where your haproxy socket is and run it like that:

python haproxy_autoscale.py /path/to/your/haproxy/httplog

Let me know what you think.

Sunday, September 25, 2011

ISP in Italy: Tiscali

I'd like to share my experience with local Internet Providers as a regular user, not a tech guy. Hope it'll help those looking for ISP offers.

Disclaimer: I'm not affiliated with none of them; nobody pays me for doing this. This post is valid for Northern Italy. The deal might be different in your region.


This is a second post of the "ISP in Italy" series. Follow this link if you're interested in Fastweb: alex.cloudware.it/2011/09/isp-in-italy-fastweb.html

TiscaliThings I currently don't like

  1. Really bad PR moves I've seen so far. This is NOT cool removing feedback of your clients, be it positive or negative.
  2. As you can guess, the tech support sucks even MORE than Fastweb. Who the hell designed your tech support workflow and how the f..k do they think your clients would reach you by your internal phone number or an online form if the ADSL line simply DOES. NOT. WORK?

    They do have "alternative" phone numbers reachable outside of Tiscali network, but those are pay-per-minute tech support line. No, I don't mean regular costs, an extra per minute charge just to get this precious tech support of the ISP that I am a client of! This is more than ridiculous.

    I did try other methods at the beginning when my ADSL line just stopped working. There's an online guide (again, how the f..k could I reach that online guide if I didn't have some side internet connection?) which, at the end, says something like:

    "A notification has been sent to our tech staff.
    You'll get notified about the status on your cell phone"

    Well, I waited about 40 hours. Not a word. The funny thing is there's no even a ticket/something ID to check its status online. So I decided to call their paid (!) service. After about 7th or 8th try (!) I finally got through. Previous attempts were just hangups which, by the way, cost me money 'cuz they do charge you for the phone pick-up.

    Obviously, we went over a "standard procedure" with their operator (which, by the way, I did try by myself about 10 times before calling them), like turning off/on my ADSL modem and, no, it didn't work. Guess what, 
    - after expected "alright, I'll take it to the next level and will have you notified over an SMS" response,
    -  and mine "any idea on how long will it take?",
    - I got this: "well, from 24 ours up to 7 days". What?! A week? Are you guys f..king kidding me? 
  3. Some mixup with their initial setup workflow/timings. You get your ADSL line activated BEFORE your modem arrives. And no, you can't use your own hardware even if technically it were compatible.
  4. They don't provide you with a phone set (Fastweb did), just the modem. Even though in my region they can only subscribe you to the "internet connection plus voice". There's no way to get rid of the useless (in my case) voice service.
Things I liked so far
  1. You can do VPN connections all you want, no problem.
  2. The speed is great but of course it depends on where you're connecting to. I guess Fastweb has some major backbone networks, that's why sites like http://speedtest.net will show you a faster connection from a Fastweb ADSL.
  3. Easy setup. I did everything online. There was no need for their tech staff to stop by in person and check my phone line, etc. Hardware arrived by regular mail even though with a little delay (see n.4)
  4. Tiscali ADSL modem comes with built-in DHCP server and NAT support so, you'll have to problem connecting 3+ devices to your home network.
  5. Clean web interface of "My tiscali" pages where you can get all the info about your contract, bills, etc.
  6. well, maybe their logo/design but, it doesn't help even a little to get my internet connection working.