Building a New IPAM System using Netbox and Batfish
At the University of Arizona we had an old IP Address Management (IPAM) tool based on a MySQL database that we called CID. It worked pretty well at the time but the lack of an API was limiting our ability to automate our network. It also lacked the ability to track devices, interfaces, VLANs, or VRFs. We wanted to move to a more API-friendly system and away from maintaining local code and Netbox fit the bill. Being designed and maintained by and for network engineers, its data model made a lot of sense for us.
Moving to Netbox is really a tremendous task. We had to conform whatever data our old IPAM had to the data-model of Netbox. When we moved to Netbox we wanted to comb through our IPAM system and double check everything before we moved it. Otherwise we would run into the garbage-in-garbage-out problem where we couldn’t totally trust what Netbox was saying.
Our requirements
We wanted to start with a firm source of truth. This started by modeling what information we wanted to put in Netbox. Everyone is going to have a slightly different way of modeling their network into the Netbox style. I actually made a couple mock ups and documented in git what each model should be filled with. Here are the required items for creating a prefix in our Netbox model:
Site
VLAN group (if attached to a VLAN)
VLAN (if attached to a VLAN)
Tenant
VRF
Description (optional if it isn’t in CID or the config required moving forward)
Status (Active, Reserved, Container, Deprecated)
Role (matching the vlan group role)
We wanted to track VLANs, VLAN groups, VRFs, Sites, and prefixes. We needed the building number, the VLAN, and prefix to fill in site and VLAN group. VLAN group was useful for determining which router a VLAN/prefix was attached to in our Netbox instance. VLAN group was helpful when we had more than one set of routers in a particular site so we can tell which device is really serving as the “IP owner”. We had to extract the route target and route description for importing the VRF of the prefix into Netbox. Ultimately I wanted to fill out the prefix view with something like this:
Populating Netbox using Batfish
I had an idea that we could scrape the route table for each VRF on our central PE router, then we would determine which prefixes were live on the network. This worked pretty well but it would not have captured information on non-preferred paths.
Batfish was a logical choice to get non-preferred paths. I input all of our configurations for Batfish to analyze them. This was my first time using Batfish so I was a bit skeptical that it could parse it all in one shot but it did parse all of our configs quickly. Using the PyBatfish IP owners command, bfq.ipOwners().answer().frame()
batfish provided output from that looked something like this:
Index,Node,VRF,Interface,IP,Mask,Active
0,139–1-router2,GREEN,vlan100,192.0.2.253,24,True
1,139–1-router1,GREEN,vlan100,192.0.2.252,24,True
2,139–1-router2,default,vlan101,192.0.3.1,24,True
Batfish was key because it provided a concrete place where I could look up where networks lived, because it listed the hostnames of every device, not just the BGP next-hop. From Batfish I can get VLAN data since we use SVI’s because the SVI’s are the VLAN ID. We were not tracking the mapping from VLAN to prefix before, so Batfish was critical to establishing that relationship. Batfish also helped with assigning prefixes to sites. I could also parse our router hostnames since the first section of the hostname was the site number.
Coupling our route table scraping and information from CID with the configured state of our network that was provided from Batfish, we had three sources of truth about our network that could be used to clean up our IPAM data. To tackle this migration I made a prefix class in python:
class prefix:
def __init__(self, network_object):
self.network_object = network_object
self.in_CID = False
self.in_route_table = False
self.in_batfish = False
self.batfish_devices = list()
self.description = None
self.vrf = None
self.role = None #Dev : 2, Prod : 1 , Management : 3
self.interfaces_batfish = list()
self.status = None #Active, Container, Reserved, Deprecated
self.owners_in_CID = [
'not_in_route_table',
'not_in_Batfish',
'not_in_CID',
'api',
'joel'
]
self.vlan_group = None
self.vlan = None
self.in_netbox = False
self.site_id = None
self.vlan_id = None
In this example, the owners_in_cid
variable is preloaded with five owners. When the prefix was found in one of the three sources of truth I would remove the corresponding tag for that object. For example, when the code found the prefix in the route table, the code would remove the not_in_the_route_table
tag. This was a bonus feature, since the intended use for that tag was to track the point of contact for a network.
After this I created a method to onboard class objects for each of the three sources of truth. In order to do this, I instantiated the prefixes using route table data, then Batfish, then CID. I used the network_object
variable to compare if they were the same and added data along the way while I initialized the objects. Afterward, they were added to Netbox using the API. Since I used classes I could write methods on the class objects rather than on functions, where I would have to pass along a lot of data.
Once the class objects are filled out, I could print them, which yields something like this:
{
“network_object”: “192.0.2.0/24”,
“in_CID”: true,
“in_route_table”: false,
“in_batfish”: true,
“batfish_devices”: [
“139–1-router2”,
“139-1-router1”
],
“description”: “student_labs”,
“vrf”: “GREEN”,
“role”: null,
“interfaces_batfish”: [
“Vlan100”,
“Vlan100”
],
“status”: null,
“owners_in_CID”: [
“api”,
“joel”,
“Juan Miller”,
“not_in_route_table”,
“Desiree Martinez”
],
“vlan_group”: null,
“vlan”: “100”,
“in_netbox”: false,
“site_id”: 14470,
“vlan_id”: 31159
}
With this data in hand I combed through each prefix in our network using those tags to find early problems. In particular, I filtered using each VRF to break up the work and pull up prefixes with the tags not_in_*
Then I’d do some digging with the output of all of the class objects by hand. I assumed that class objects I found in all three sources were correct. We have roughly 5,000 allocated prefixes in our network but only around 500 were not in any of the sources of data. This took some time, but was worth it to get the full picture.
All in all this was a good way to migrate our IPAM system into Netbox. Using python classes made the code simple. Honestly if I did this again I would use the excellent pyNetbox to interact with Netbox. It’s a lot easier to use than the Netbox API once you are already in python. PyNetbox doesn’t support bulk updates but if you thread or multi-process your code it won’t make that much of a difference.
pyNetbox is now our main way of making changes to Netbox. The GUI is mostly used to inspect what things look like or and is helpful because it allows people other than network engineers to understand what is provisioned on the network. Everyone on the team is strongly encouraged to keep it up to date.
Looking ahead: Automatic, continuous validation
Manual updates are challenging and error-prone. Having used Netbox in production for a year, it’s obvious at this point that we need to build tooling to ensure that our source of truth stays up to date. Our next step is to use Batfish and some of the same code to do a regular audit (e.g., nightly or weekly) that provides a report about prefixes and devices. It would be like “a build your own network discovery” tool. That would really drive out errors both in our configurations and in our source of truth.
If you are interested in learning more about what we did please reach out to me and I’ll see how I can help.