Why does generating a static HashMap with ~30K entries at compile time consume so many resources?
Asked Answered
M

1

6

I'm trying to write a build.rs script that creates an up-to-date HashMap that maps the first 6 characters of a MAC address with its corresponding vendor.

It has 29231 key-value pairs which causes cargo check to spend more than 7 minutes on my source code. Before this, it was less than 20 seconds. It also uses all 8GB of the RAM available on my laptop and I cannot use it during those 7-8 minutes.

I think this is either a rustc/cargo bug, or I am doing something wrong, and I'm pretty sure is the latter. What is the correct way of generating code like this?

main.rs

use std::collections::{HashMap, HashSet};
use rustc_hash::{FxHashMap, FxHashSet, FxHasher};
type CustomHasher = BuildHasherDefault<FxHasher>;
include!(concat!(env!("OUT_DIR"), "/map_oui.rs"));

map_oui.rs

#[rustfmt::skip]
lazy_static! {
    static ref MAP_MACS: FxHashMap<&'static [u8; 6], &'static str> = {
    let mut map_macs = HashMap::with_capacity_and_hasher(29231, CustomHasher::default());
    map_macs.insert(b"002272", "American Micro-Fuel Device Corp.");
    map_macs.insert(b"00D0EF", "IGT");
//...

build.rs

use std::env;
use std::fs::File;
use std::io::prelude::*;
use std::io::{BufReader, BufWriter};
use std::path::Path;

fn main() {
    let out_dir = env::var_os("OUT_DIR").unwrap();
    let dest_path = Path::new(&out_dir).join("map_oui.rs");
    let handle = File::create(dest_path).unwrap();
    let mut writer = BufWriter::new(handle);
    let response = ureq::get("http://standards-oui.ieee.org/oui.txt")
        .call()
        .expect("Conection Error");
    let mut reader = BufReader::new(response.into_reader());
    let mut line = Vec::new();

    writer
        .write(
            b"#[rustfmt::skip]
lazy_static! {
    static ref MAP_MACS: FxHashMap<&'static [u8; 6], &'static str> = {
    let mut map_macs = HashMap::with_capacity_and_hasher(29231, CustomHasher::default());\n",
        )
        .unwrap();
    loop {
        match reader.read_until('\n' as u8, &mut line) {
            Ok(bytes_read) => {
                if bytes_read == 0 {
                    break;
                }
                if line.get(12..=18).map_or(false, |s| s == b"base 16") {
                    let mac_oui = String::from_utf8_lossy(&line[0..6]);
                    let vendor = String::from_utf8_lossy(&line[22..]);
                    writer.write(b"    map_macs.insert(b\"").unwrap();
                    writer.write(mac_oui.as_bytes()).unwrap();
                    writer.write(b"\", \"").unwrap();
                    writer.write(vendor.trim().as_bytes()).unwrap();
                    writer.write(b"\");\n").unwrap();
                }
                line.clear();
            }
            Err(_) => (),
        }
    }
    writer
        .write(
            b"    map_macs
    };
}
",
        )
        .unwrap();
    writer.flush().unwrap();
    println!("cargo:rerun-if-changed=build.rs");
}
Micra answered 11/1, 2021 at 16:44 Comment(8)
Maybe github.com/sfackler/rust-phf can help you?Wadleigh
If you need a workaround: you could embed the data file directly as a string. Sort it and zero-pad to the length of the longest string, so you can do a binary search inside it. Or, if you can afford the space, just stick those zero-padded strings in an array indexed on the 5 hex digits, which is at most 1M entries, leaving the unused entries blank.Becalmed
I wouldn't be surprised if creating a huge array of tuples ended up being faster to compile and execute: [(key, value)]. Another workaround would be to move all of this to a different crate completely; that way it should be built less frequently.Sforza
Outputting a slice of tuples and compiling that takes 2 seconds and my shell reports it took ~145 MiB of RAM.Sforza
rust-phf seemed promising but it is slightly slower. I followed @Becalmed and @Sforza suggestions and it worked. Currently build.rs generates a const MAP_MACS: [([u8; 6], &str); 29246] and I wrote a wrapper function called vendor_lookup around a binary search of the array. Should I post the code as an answer for future reference?Archibaldo
An answer is certainly appropriate. You might want to withhold accepting in case someone can answer the underlying question (which is IMHO still pertinent): why is the creation of this hash map at compile time so slow and memory intensive?Rovelli
So the question is “Why does compiling a function with 30,000 instruction uses a lot of resources?”?.Overstuffed
@Overstuffed maybe with the implied "compared to this other way of doing it (slice of tuples) that is way faster"?Sforza
M
0

I followed @Thomas and @Shepmaster suggestions and it worked. Currently build.rs generates a const MAP_MACS: [([u8; 6], &str); 29246] and I wrote a wrapper function called vendor_lookup around a binary search of the array. However, it would be good to know how to use a HashMap with a custom Hasher.

main.rs

include!(concat!(env!("OUT_DIR"), "/map_oui.rs"));

fn vendor_lookup(mac_oui: &[u8; 6]) -> &'static str {
    let idx = MAP_MACS
        .binary_search_by(|probe| probe.0.cmp(mac_oui))
        .unwrap(); // this should be a `?`
    MAP_MACS[idx].1
}
fn main() {
    assert_eq!(vendor_lookup(b"4C3C16"), "Samsung Electronics Co.,Ltd");
}

map_oui.rs

const MAP_MACS: [([u8; 6], &str); 29246] = [
    ([48, 48, 48, 48, 48, 48], "XEROX CORPORATION"),
    ([48, 48, 48, 48, 48, 49], "XEROX CORPORATION"),
    ([48, 48, 48, 48, 48, 50], "XEROX CORPORATION"),
    //---snip---
]

build.rs

use std::env;
use std::fs::File;
use std::io::prelude::*;
use std::io::{BufReader, BufWriter};
use std::path::Path;

fn main() {
    let response = ureq::get("http://standards-oui.ieee.org/oui.txt")
        .call()
        .expect("Conection Error");
    let mut reader = BufReader::new(response.into_reader());

    let mut data: Vec<(Vec<u8>, String)> = Vec::new();
    let mut line = Vec::new();
    while reader.read_until(b'\n', &mut line).unwrap() != 0 {
        if line.get(12..=18).map_or(false, |s| s == b"base 16") {
            let mac_oui = line[0..6].to_owned();
            let vendor = String::from_utf8_lossy(&line[22..]).trim().to_owned();
            data.push((mac_oui, vendor));
        }
        line.clear();
    }
    data.sort_unstable();

    let out_dir = env::var_os("OUT_DIR").unwrap();
    let dest_path = Path::new(&out_dir).join("map_oui.rs");
    let handle = File::create(dest_path).unwrap();
    let mut writer = BufWriter::new(handle);
    writeln!(
        &mut writer,
        "const MAP_MACS: [([u8; 6], &str); {}] = [",
        data.len()
    )
    .unwrap();
    for (key, value) in data {
        writeln!(&mut writer, "    ({:?}, \"{}\"),", key, value).unwrap();
    }
    writeln!(&mut writer, "];").unwrap();
    writer.flush().unwrap();
    println!("cargo:rerun-if-changed=build.rs");
}
Micra answered 11/1, 2021 at 20:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.