Any good surname databases?
Asked Answered
L

4

7

I'm looking to generate some database test data, specifically table columns containing people's names. In order to get a good indication of how well indexing works with regard to name based searches I want to get as close as possible to real world names and their true frequency distribution, e.g. lots of different names with frequencies distributed over some power law distribution.

Ideally I'm looking for a freely available data file with names followed by a single frequency value (or equivalently a probability) per name.

Anglo-saxon based names would be fine, although names from other cultures would be useful also.

Leuco answered 13/6, 2011 at 14:55 Comment(3)
Google's first hit: surnamedb.comShovelnose
@Shovelnose I don't think you can download their dataset, though, and it's name origin not frequency.Heaume
@Rup: I didn't look at it much, hence why it's a comment and not an answer. I figured it might be a place to start looking.Shovelnose
L
5

I found some US census data which fits the requirement. The only caveat is that it lists only names that occur at least 100 times...

Found via this blog entry that also shows the power law distribution curve

Further to this you can sample from the list using Roulette Wheel Selection, e.g. (not tested)

struct NameEntry
{
    public string _name;
    public int _frequency;
}

int _frequencyTotal; // Precalculate this.


public string SampleName(NameEntry[] nameEntryArr, Random rng)
{
    // Throw the roulette ball.
    int throwValue = rng.NextDouble() * frequencyTotal;
    int accumulator = 0.0;

    for(int i=0; i<nameEntryArr.Length; i++)
    {
        accumulator += nameEntryArr[i]._frequency;
        if(throwValue <= accumulator) {
            return nameEntryArr[i]._name;
        }
    }

    // If we get here then we have an array of zero fequencies.
    throw new ApplicationException("Invalid operation. No non-zero frequencies to select.");
}
Leuco answered 13/6, 2011 at 21:36 Comment(2)
Although in hindsight it's more efficient to use a binary search approach on an accumulating list of frequencies. Or for fixed numbers of samples you can use stochastic universal sampling (en.wikipedia.org/wiki/Stochastic_universal_sampling)Leuco
The links in this answer are no longer valid. 1990 name data is available: census.gov/topics/population/genealogy/data/1990_census/…Hebbe
D
4

Oxford University provides word lists on their public FTP site as compressed .gz files at ftp://ftp.ox.ac.uk/pub/wordlists/names/.

Domingodominguez answered 13/6, 2011 at 15:28 Comment(1)
Thanks. There's no frequency data there but I could sample from a given list with an appropriate distribution to give more realistic test data.Leuco
S
3

You can also check out jFairy project. It's written in Java and produces fake data (like for example names). http://codearte.github.io/jfairy/

Fairy fairy = Fairy.create(); 
Person person = fairy.person();
System.out.println(person.firstName());           // Chloe
System.out.println(person.lastName());            // Barker
System.out.println(person.fullName());            // Chloe Barker
Saharanpur answered 22/11, 2013 at 11:48 Comment(0)
C
0

For generating realistic database test data with true frequency distributions of names, I recommend exploring the free preview offered by census.name. They provide a comprehensive database with millions of names from various cultures, including frequency distributions that can help you simulate real-world scenarios. While the full database is paid, you can access the Name Census Top 100 for free on GitHub and Kaggle, which includes names with frequency values—a great starting point for your needs.

Cutlerr answered 10/8 at 14:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.