Randomly distribute files into train/test given a ratio
Asked Answered
P

3

6

I am at the moment trying make a setup script, capable of setting up a workspace up for me, such that I don't need to do it manually. I started doing this in bash, but quickly realized that would not work that well.

My next idea was to do it using python, but can't seem to do it a proper way.. My idea was to make a list (a list being a .txt files with the paths for all the datafiles), shuffle this list, and then move each file to either my train dir or test dir, given the ratio....

But this is python, isn't there a more simpler way to do it, it seems like I am doing an unessesary workaround just to split the files.

Bash Code:

# Partition data randomly into train and test. 
cd ${PATH_TO_DATASET}
SPLIT=0.5 #train/test split
NUMBER_OF_FILES=$(ls ${PATH_TO_DATASET} |  wc -l) ## number of directories in the dataset
even=1
echo ${NUMBER_OF_FILES}

if [ `echo "${NUMBER_OF_FILES} % 2" | bc` -eq 0 ]
then    
        even=1
        echo "Even is true"
else
        even=0
        echo "Even is false"
fi

echo -e "${BLUE}Seperating files in to train and test set!${NC}"

for ((i=1; i<=${NUMBER_OF_FILES}; i++))
do
    ran=$(python -c "import random;print(random.uniform(0.0, 1.0))")    
    if [[ ${ran} < ${SPLIT} ]]
    then 
        ##echo "test ${ran}"
        cp -R  $(ls -d */|sed "${i}q;d") ${WORKSPACE_SETUP_ROOT}/../${WORKSPACE}/data/test/
    else
        ##echo "train ${ran}"       
        cp -R  $(ls -d */|sed "${i}q;d") ${WORKSPACE_SETUP_ROOT}/../${WORKSPACE}/data/train/
    fi

    ##echo $(ls -d */|sed "${i}q;d")
done    

cd ${WORKSPACE_SETUP_ROOT}/../${WORKSPACE}/data
NUMBER_TRAIN_FILES=$(ls train/ |  wc -l)
NUMBER_TEST_FILES=$(ls test/ |  wc -l)

echo "${NUMBER_TRAIN_FILES} and ${NUMBER_TEST_FILES}..."
echo $(calc ${NUMBER_TRAIN_FILES}/${NUMBER_OF_FILES})

if [[ ${even} = 1  ]] && [[ ${NUMBER_TRAIN_FILES}/${NUMBER_OF_FILES} != ${SPLIT} ]]
    then 
    echo "Something need to be fixed!"
    if [[  $(calc ${NUMBER_TRAIN_FILES}/${NUMBER_OF_FILES}) > ${SPLIT} ]]
    then
        echo "Too many files in the TRAIN set move some to TEST"
        cd train
        echo $(pwd)
        while [[ ${NUMBER_TRAIN_FILES}/${NUMBER_TEST_FILES} != ${SPLIT} ]]
        do
            mv $(ls -d */|sed "1q;d") ../test/
            echo $(calc ${NUMBER_TRAIN_FILES}/${NUMBER_OF_FILES})
        done
    else
        echo "Too many files in the TEST set move some to TRAIN"
        cd test
        while [[ ${NUMBER_TRAIN_FILES}/${NUMBER_TEST_FILES} != ${SPLIT} ]]
        do
            mv $(ls -d */|sed "1q;d") ../train/
            echo $(calc ${NUMBER_TRAIN_FILES}/${NUMBER_OF_FILES})
        done
    fi

fi   

My problem were the last part. Since i picking the numbers by random, I would not be sure that the data would be partitioned as hoped, which my last if statement were to check whether the partition was done right, and if not then fix it.. This was not possible since i am checking floating points, and the solution in general became more like a quick fix.

Phalanstery answered 29/8, 2016 at 16:17 Comment(5)
I'd be interested in seeing some sample data and the code you were having problems with in bash. What does "assign" mean? Are you moving files? Inserting data into arrays? If you could also include more information about the criteria you're using to decide what happens, it might be possible for us to provide helpful answers.Classicize
The data is just .wav files. The problem with my bash code was i trying to work with floating point, which didn't seem that ideal for bash. I am moving/copying it from a data folder to either a train or test folderRodriquez
Okay, so what are the criteria you use to decide whether something gets sent to one folder or the other? Can you include your not-working code in your question?Classicize
Coded added.. I only added part of the code, since it would be ridicolous to post the unessesary part..Rodriquez
I've added an answer that shows you how you might handle this using bash alone by leveraging the power of arrays and parameter expansion. For future reference, problems are best solved when they answer a Minimal, Complete, Verifiable Example.Classicize
L
12

scikit-learn comes to the rescue =)

>>> import numpy as np
>>> from sklearn.cross_validation import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
>>> y
[0, 1, 2, 3, 4]


# If i want 1/4 of the data for testing 
# and i set a random seed of 42.
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
>>> X_train
array([[4, 5],
       [0, 1],
       [6, 7]])
>>> X_test
array([[2, 3],
       [8, 9]])
>>> y_train
[2, 0, 3]
>>> y_test
[1, 4]

See http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html


To demonstrate:

alvas@ubi:~$ mkdir splitfileproblem
alvas@ubi:~$ cd splitfileproblem/
alvas@ubi:~/splitfileproblem$ mkdir original
alvas@ubi:~/splitfileproblem$ mkdir train
alvas@ubi:~/splitfileproblem$ mkdir test
alvas@ubi:~/splitfileproblem$ ls
original  train  test
alvas@ubi:~/splitfileproblem$ cd original/
alvas@ubi:~/splitfileproblem/original$ ls
alvas@ubi:~/splitfileproblem/original$ echo 'abc' > a.txt
alvas@ubi:~/splitfileproblem/original$ echo 'def\nghi' > b.txt
alvas@ubi:~/splitfileproblem/original$ cat a.txt 
abc
alvas@ubi:~/splitfileproblem/original$ echo -e 'def\nghi' > b.txt
alvas@ubi:~/splitfileproblem/original$ cat b.txt 
def
ghi
alvas@ubi:~/splitfileproblem/original$ echo -e 'jkl' > c.txt
alvas@ubi:~/splitfileproblem/original$ echo -e 'mno' > d.txt
alvas@ubi:~/splitfileproblem/original$ ls
a.txt  b.txt  c.txt  d.txt

In Python:

alvas@ubi:~/splitfileproblem$ ls
original  test  train
alvas@ubi:~/splitfileproblem$ python
Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> from sklearn.cross_validation import train_test_split
>>> os.listdir('original')
['b.txt', 'd.txt', 'c.txt', 'a.txt']
>>> X = y= os.listdir('original')
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
>>> X_train
['a.txt', 'd.txt', 'b.txt']
>>> X_test
['c.txt']

Now move the files:

>>> for x in X_train:
...     os.rename('original/'+x , 'train/'+x)
... 
>>> for x in X_test:
...     os.rename('original/'+x , 'test/'+x)
... 
>>> os.listdir('test')
['c.txt']
>>> os.listdir('train')
['b.txt', 'd.txt', 'a.txt']
>>> os.listdir('original')
[]

See also: How to move a file in Python

Lapillus answered 29/8, 2016 at 16:21 Comment(5)
The files aren't loaded into python... They actual file need to be moved from A to B..Rodriquez
It'll be nice to see a bash solution, I would suspect it involves shuffle, mv, awk, ls =)Lapillus
The problem here is that files has to be randomly divided into train and test, with the given ratio/split.Rodriquez
test_size parameter is the "ratio" ;PLapillus
@Lapillus - I've provided a bash-only solution. It doesn't use shuffle, as that's vendor-specific (Linux-only I believe) or awk, as that's another language entirely, and everything can be achieved within bash. As for mv, I believe it's secondary to the central problem of "how do you split a random set". And in cases like this, I would hope not to see an answer that parses ls, due to the well known pitfall. Great answer, btw. I didn't know about scikit-learn, and it's great to get more exposure to python.Classicize
C
3

Here's a simple example that uses bash's $RANDOM to move things to one of two target directories.

$ touch {1..10}
$ mkdir red blue
$ a=(*/)
$ RANDOM=$$
$ for f in [0-9]*; do mv -v "$f" "${a[$((RANDOM/(32768/${#a[@]})))]}"; done
1 -> red/1
10 -> red/10
2 -> blue/2
3 -> red/3
4 -> red/4
5 -> red/5
6 -> red/6
7 -> blue/7
8 -> blue/8
9 -> blue/9

This example starts with the creation of 10 files and two target directories. It sets an array to */ which expands to "all the directories within the current directory". It then runs a for loop with what looks like line noise in it. I'll break it apart for ya.

"${a[$((RANDOM/(32768/${#a[@]})+1))]}" is:

  • ${a[ ... the array "a",
  • $((...)) ... whose subscript is an integer math function.
  • $RANDOM is a bash variable that generates a ramdom(ish) number from 0 to 32767, and our formula divides the denominator of that ratio by:
  • ${#a[@]}, effectively multiplying RANDOM/32768 by the number of elements in the array "a".

The result of all this is that we pick a random array element, a.k.a. a random directory.

If you really want to work from your "list of files", and assuming you leave your list of potential targets in the array "a", you could replace the for loop with a while loop:

while read f; do
  mv -v "$f" "${a[$((RANDOM/(32768/${#a[@]})))]}"
done < /dir/file.txt

Now ... these solutions split results "evenly". That's what happens when you multiply the denominator. And because they're random, there's no way to insure that your random numbers won't put all your files into a single directory. So to get a split, you need to be more creative.

Let's assume we're dealing with only two targets (since I think that's what you're doing). If you're looking for a 25/75 split, slice up the random number range accordingly.

$ declare -a b=([0]="red/" [8192]="blue/")
$ for f in {1..10}; do n=$RANDOM; for i in "${!b[@]}"; do [ $i -gt $n ] && break; o="${b[i]}"; done; mv -v "$f" "$o"; done

Broken out for easier reading, here's what we've got, with comments:

declare -a b=([0]="red/" [8192]="blue/")

for f in {1..10}; do         # Step through our files...
  n=$RANDOM                  # Pick a random number, 0-32767
  for i in "${!b[@]}"; do    # Step through the indices of the array of targets
    [ $i -gt $n ] && break   # If the current index is > than the random number, stop.
    o="${b[i]}"              # If we haven't stopped, name this as our target,
  done
  mv -v "$f" "$o"            # and move the file there.
done

We define our split using the index of an array. 8192 is 25% of 32767, the max value of $RANDOM. You could split things however you like within this range, including amongst more than 2.

If you want to test the results of this method, counting results in an array is a way to do it. Let's build a shell function to help with testing.

$ tester() { declare -A c=(); for f in {1..10000}; do n=$RANDOM; for i in "${!b[@]}"; do [ $i -gt $n ] && break; o="${b[i]}"; done; ((c[$o]++)); done; declare -p c; }
$ declare -a b='([0]="red/" [8192]="blue/")'
$ tester
declare -A c='([blue/]="7540" [red/]="2460" )'
$ b=([0]="red/" [10992]="blue/")
$ tester
declare -A c='([blue/]="6633" [red/]="3367" )'

On the first line, we define our function. Second line sets the "b" array with a 25/75 split, then we run the function, whose output is the the counter array. Then we redefine the "b" array with a 33/67 split (or so), and run the function again to demonstrate results.

So... While you certainly could use python for this, you can almost certainly achieve what you need with bash and a little math.

Classicize answered 31/8, 2016 at 16:4 Comment(0)
F
2

Here's first dry-cut solution, pure Python:

import sys, random, os

def splitdirs(files, dir1, dir2, ratio):
    shuffled = files[:]
    random.shuffle(shuffled)
    num = round(len(shuffled) * ratio)
    to_dir1, to_dir2 = shuffled[:num], shuffled[num:]
    for d in dir1, dir2:
        if not os.path.exists(d):
            os.mkdir(d)
    for file in to_dir1:
        os.symlink(file, os.path.join(dir1, os.path.basename(file)))
    for file in to_dir2:
        os.symlink(file, os.path.join(dir2, os.path.basename(file)))

if __name__ == '__main__':
    if len(sys.argv) != 5:
        sys.exit('Usage: {} files.txt dir1 dir2 ratio'.format(sys.argv[0]))
    else:
        files, dir1, dir2, ratio = sys.argv[1:]
        ratio = float(ratio)
        files = open(files).read().splitlines()
        splitdirs(files, dir1, dir2, ratio)

[thd@aspire ~]$ python ./test.py ./files.txt dev tst 0.4 Here 40% of listed in files.txt goes to dev dir, and 60% -- to tst

It makes symliks instead of copy, if you need true files, change os.symlink to shutil.copy2

Futurity answered 29/8, 2016 at 16:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.