During a road trip we discussed how this tool was interesting but not necessarily useful. My friend could plug in numbers and see what happened, but that didn’t help make staffing decisions any easier, as it was basically educated trial and error. What would be more helpful was a system where you could set a target SLA for throughout the day and the system would give you the best agent schedule to match that target, given variable incoming call parameters and agent handle times. That is a third great pillar of OR: **optimization**.

The issue was that with optimization you needed a formula to optimize, and we had a simulation. However, it dawned on me that there was no reason we couldn’t use the simulation as the objective function for an optimization problem. It takes a set of inputs, and with enough runs of the simulation, gives you a confident output. Luckily, 30 years ago researchers had the same thought and have been studying the field of Simulation Optimization ever since. These smart people had done all the hard work for me. [3, 4]

So, again, I got to work. Now the core of Simulation Optimization is creating a set of inputs, running it on your simulation, and choosing the next set of inputs in an intelligent way until you reach your objective. My optimization problem looked something like this:

My objective is to build a schedule. Schedules are made out of shifts. Shifts are set times throughout the simulation “day” (say 9am — 5pm, 10am — 4pm) during which an agent can work. A schedule is the set of all shifts, with how many agents are working on each shift. A good schedule is one that keeps the SLA hovering right around the target and minimizes the total number of worker hours (number of workers * length of their shifts). Certain shifts have a cap with how many people can work (ex. only 10 people can work the half day morning shift) and each shift has a certain break schedule. The SLA should never cross a hard ceiling above the target (ex. our target is an SLA of 3 min but no one should ever wait more than 10 min).

I set up the framework so all of these constraints were accounted for, and now I was ready to optimize. Some problems talked about in Simulation Optimization research are difficult because their simulations are black boxes, meaning you don’t get any useful signals besides the simulation answer. Luckily, this simulation is different. We are measuring the time-in-queue throughout the simulation “day” so we can see how certain shifts do in comparison to each other.

For example, if the morning shift has more agents than the late shift but the late shift has more incoming calls, we will likely see that the late shift is more off target than the morning shift. In our next attempt we should probably increase the number of people on the late shift.

It’s not as simple as this — with complicated overlapping shifts and other parameters there will be interdependency, but we can plug any schedule into the simulation and see what happens.

These signals that we get from the simulation led me to a clear strategy, gradients.

Gradient Based Search can be understood as a ball rolling down a hill, where with every step you are trying to get the ball to a lower state until it reaches the bottom. This is complicated by the fact that it is not an even hill. There will be false bottoms where any direction you pick goes back up the hill, but the true bottom requires you to start rolling from a different location. In practice, we’ll never know if we have reached the true bottom, but there are smart ways to make sure we do enough tests to know that we are pretty close. In our case the ball is a schedule, and the bottom of the hill is the perfect line where all incoming calls wait for exactly the target SLA.

So, what is the most natural gradient step? Take the worst schedule (the one on average farthest away from the target) and add an agent to the shift if the average is above your target, or take away an agent if the average is below the target. If you are stuck where all shift changes make the schedule worse but the SLA is not close to the target, go back to a previously good schedule and start from there again. I tried other methods of picking a next schedule but this was the most consistent. How do you know when to stop? Set a tolerance to say that if every shift is on average less than 30 seconds (for example) from the target, consider this solved. Stop if too many steps have been taken, and pick the best schedule from the ones simulated.

In an ideal world we’d run this program forever, trying every single schedule until we find the best, but we needed this program to pick a schedule in under 5 min. The most interesting part of this project was testing optimization algorithms to find the one that can get a good solution the quickest.

The solution with the best results used the fact that you can control the number of times you run a simulation, the trade off being that the fewer times you run a simulation, the less confident you are in its results. Here’s what I found to work:

Start off with fewer simulations when beginning to run the optimization algorithm, and as the objective approaches, increase the number. This works because in the beginning many shift changes will put the schedule in the direction of the objective, so it’s not as important to be as confident in the direction you chose. As a good solution inches closer, it matters more where the schedule goes. Combine this with layered runs all starting from the same null point to ensure a wider breadth of explored solutions.

What’s great about this approach is that I am able to control how long the algorithm runs. I can see how long simulation runs take, and adjust the number of iterations accordingly, I can also pick how many times I begin from the starting spot. Every machine you run this algorithm on will operate a little differently, and consistent algorithm run time is of paramount importance to make sure this project is useful in the real world.

`#schedule is {(start_time, end_time): num_agents}`{'Schedule':

{(0, 540): 5,

(30, 570): 2,

(60, 600): 1,

(90, 630): 1,

(120, 660): 4,

(150, 690): 1,

(180, 720): 0,

(210, 750): 1,

(240, 780): 18,

(0, 660): 0,

(30, 690): 0,

(60, 720): 0,

(90, 750): 0,

(120, 780): 3}

'Average Wait': 0.5738814985851588

'Worker Units': 660.0

'Worst Shift Time In Queue, Relative to Target': 0.5965600329117189}