After launching IPFire Location, we have gradually been making changes to the way how to aggregate and process our database sources. In fact, we have been adding more sources like Geofeeds that have become hugely popular. With that much data, how are we making sure the database doesn't explode?
A Deep Dive Into Data Structures
IPFire Location is a classic Geolocation database. But instead of distributing it as a classic CSV file like many others are doing it, the database is stored in a specially created format. This format is crucial for us, since it allows us to handle the data very easily, search through it very quickly and store it compactly. To understand how, we will need to do a little deep dive into data structures...
An IP address is just a collection of bits. An IPv6 address has 128 of them, an an IPv4 address has 32. In fact, not a single plain IP address is stored in the IPFire Location database. Instead, we have created a tree structure that allows us to save space, and perform a very fast and efficient search. But more about that later. First we have to understand how this data is being stored.
If you resolve ''www.ipfire.org'', you will get the following IP address: ''2001:678:b28::''. If you would split that into individual bytes, and split those into individual bits, you get the following result:
Binary: 01000000 0000001 : 00000110 01111000 : 00001011 00101000 : 00000000 00000000 : ...
Hexadecimal: 20 01 : 06 78 : 0b 28 :: / 48
If you know about the format of an IPv6 address, you will have noticed, that this address is rather short as the double colon (::) represents zeroes only, which will fill the IP address up to its 128 bits, when actually we only use the first 48 of them and have 80 zeroes at the end.
Why does binary matter? Because it is an easy way to store data. In the database, the tree starts at a root note that simply has two paths to follow: One and zero. For our address in the example, that would be a zero, which will bring us to the next node in the tree. From there, we will follow the path of the one to get to the next one, then a couple of zeroes, a one again and so on. Once we have done this for a 128 times, we will have reached the end of the tree and found our IP address. This is called a binary tree and is a well-studied concept in computer science. All sorts of database and filesystems are based on this as they are a great way to structure large amounts of data.
But how do we store a whole network? The next IP address in the network would be ''2001:678:b28::1''. It is the same in binary except that the very last place is a one. And of course we can keep counting until we find the last IP address in this special network which would be ''2001:678:b28:ffff:ffff:ffff:ffff:ffff''. That would be a large number of IP addresses to store (2^48 = 281,474,976,710,656 to be exact) which will of course be all in the same country as the first one. And even if we wanted to, as the IPv6 address space is so large, there is no possible way to store this much data.
So we had to slightly modify our tree to only store the part that we care about and drop the rest. That is done by encoding the prefix - which in our case is /48 - into the tree as well. That way, we know with the base address and the prefix which IP addresses are falling into a subnet. Simple! Just count the steps that we are making on the way, and there we have it. But now? We have reached a node that is not at the end? No problem either. We just add a pointer to a list of networks that have the required information about the network instead of only storing this information at the end. That way, the search could stop at any point in the tree.
Any search for any arbitrary IP address would now start at the root and walk the tree for as long as possible. If we have reached a node that has no more pointers to any following nodes, we can check whether the IP address matches the network at the last node that we have found. If so, we found a match. If not, then the IP address is not known to the database.
This has been the way to store data from the inception of IPFire Location and there is no known way to science to do it any better. A tree like this is very quick so search through as there is only a maximum of 128 steps. It is also easy to walk through it in order so that we can list everything that is stored in this tree easily without any post-processing like external sorting.
Deduplicating Networks
When we source all data that make the database in the end, we have a lot of overlaps. We do not care about a single IP address but networks to avoid exploding too much in disk space usage and keep processing easy. But sometimes, networks don't match exactly. For example could there be an IP network allocated to an organisation, but only a part of it is actually being in use and announced over BGP. Or we could parse a Geofeed that splits an allocated network into several smaller ones that are used in different data centres across a continent. There is thousands of different scenarios like that.
To organise all that data, we store everything in a large PostgreSQL database first and add a little bit of extra metadata. How this works precisely is maybe a store for another article, but all you need to know is that at the end, we have a large number of networks that have been lots of different pieces of information attached to them from all sorts of different sources.
Formerly, we simply took all this data and put it into the tree and shipped it. That worked well and was a very important first step for this project. However, for various reasons, it would be nice to remove any duplicate information so that we can keep the database smaller. Obviously with a smaller database, there is less bandwidth used to download it and less disk space used to store it. But also the search will be a lot quicker since there are fewer nodes and networks to process in the tree.
For example, as mentioned before, we will store a network with its prefix. That means that if we have the same base address, we could go back one node and find any subnets that are larger. Very often, all subnets of a network share the same information. That is for example as illustrated before when a company has split one allocated network across various data centres. Sometimes they are in the same country and sometimes not, but if they are, we don't need to store a subnets at all. And this is where we save a huge amount of space.
Merging Networks
Another optimisation that we have discovered and implemented is that we can merge neighbouring networks that share the same metadata. For example, consider two IP networks like 10.0.0.0/24 and 10.0.1.0/24. They could be represented together as 10.0.0.0/23 and stored as only one network. As some sources cause networks being split, or sometimes organisations do this themselves for various reasons, we can put this all back together again to save space.
It's All About That Space
Both changes together have allowed us to bring down the size of the database by quite a bit. Initially the file size was around 56 MiB and has been reduced to around 30 MiB. That is roughly a saving of 50%. Remember that saving disk space is great, but being able to search the database faster is even better for many applications.
Since then, we have increased the size of the database slightly again since we now have the space to include more information. That is that networks smaller than /48 for IPv6 and /24 for IPv4 have been dropped from the database due to space concerns. Those could now be added back again and the database is still a lot smaller.
There are lots of other nice side-effects now as well:
- The text dump is a lot short. We use that to store the database in a Git repository to have a human-readable and auditable trail of changes. Git isn't that great with large files and so this has become much faster now with a smaller text file to handle.
- In IPFire, we export some networks to be loaded into the firewall. That has to be done in text form which is now a lot smaller and therefore wastes less memory and is faster to search through, too. With a lot of location rules, there is a smaller impact on opening new IP connections now.
- Our database is cryptographically signed. That is so that attackers cannot convince consumers to use a crafted version of a database. When using location information in places like firewalls, Intrusion Prevention Systems and anything else security-related, this is an absolute must for us. Cryptography is never really fast, but with a smaller file size the verification process as well as the download are faster leading to faster database updates.
All these changes are available already with our official database that we are publushing and does not require any update of your version of libloc, the lightweight library that is being used to read and write our database. So, please go forth and enjoy these great improvements now. A lot of engineering effort has been invested into them to finalise this project that we have started a little while ago and make it more versatile for many more applications.
More...
There are currently 3 users browsing this thread. (0 members and 3 guests)
Bookmarks