Regex: How to remove extra spaces between strings in Perl
Asked Answered
L

5

6

I am working on a program that take user input for two file names. Unfortunately, the program can easily break if the user does not follow the specified format of the input. I want to write code that improves its resiliency against these types of errors. You'll understand when you see my code:

# Ask the user for the filename of the qseq file and barcode.txt file
print "Please enter the name of the qseq file and the barcode file separated by a comma:";
# user should enter filenames like this: sample1.qseq, barcode.txt

# remove the newline from the qseq filename
chomp ($filenames = <STDIN>);

# an empty array
my @filenames;

# remove the ',' and put the files into an array separated by spaces; indexes the files
push @filename, join(' ', split(',', $filenames))

# the qseq file
my $qseq_filename = shift @filenames;

# the barcode file.
my barcode = shift @filenames;

Obviously this code runs can run into errors if the user enters the wrong type of filename (.tab file instead of .txt or .seq instead of .qseq). I want code that can do some sort of check to see that the user enters the appropriate file type.

Another error that could break the code is if the user enters too many spaces before the filenames. For example: sample1.qseq,(imagine 6 spaces here) barcode.txt (Notice the numerous spaces after the comma)

Another example: (imagine 6 spaces here) sample1.qseq,barcode.txt (This time notice the number of spaces before the first filename)

I also want lines of code that can remove extra spaces so that the program doesn't break. I think the user input has to be in the following kind of format: sample1.qseq, barcode.txt. The user input has to be in this format so that I can properly index the filenames into an array and shift them out later.

Thanks any help or suggestions are greatly appreciated!

Licit answered 9/6, 2012 at 1:30 Comment(1)
I forgot to mention: This is just one of the six scripts I have to modify for a piped run in the command line. In other words, I want the piped run to work like: Script00.pl | Script01.pl | Script02.pl | Script03.pl | Script04.pl | Script05.pl | Script06.pl. This is the first script in the pipe runLicit
S
8

The standard way to deal with this kind of problem is utilising command-line options, not gathering input from STDIN. Getopt::Long comes with Perl and is servicable:

use strict; use warnings FATAL => 'all';
use Getopt::Long qw(GetOptions);
my %opt;
GetOptions(\%opt, 'qseq=s', 'barcode=s') or die;
die <<"USAGE" unless exists $opt{qseq} and $opt{qseq} =~ /^sample\d[.]qseq$/ and exists $opt{barcode} and $opt{barcode} =~ /^barcode.*\.txt$/;
Usage: $0 --qseq sample1.qseq --barcode barcode.txt
       $0 -q sample1.qseq -b barcode.txt
USAGE
printf "q==<%s> b==<%s>\n", $opt{qseq}, $opt{barcode};

The shell will deal with any extraneous whitespace, try it and see. You need to do the validation of the file names, I made up something with regex in the example. Employ Pod::Usage for a fancier way to output helpful documentation to your users who are likely to get the invocation wrong.

There are dozens of more advanced Getopt modules on CPAN.

Selima answered 9/6, 2012 at 2:6 Comment(4)
thanks daxim! it seems like utilising command-line options with Getopt::Long is the way to go. Additionally it looks like you even provide a check to see that the file name is correct. thank you, I wouldn't have figured out myself. Can you quickly explain how each line of the code works? With almost a year of experience, I'm still a relatively novice Perl programmer. I see that you store the file names in a hash %opt. But can you explain how the regex bit works and the USAGE and other parts work? I will look at the Getopt::Long module.Licit
Also, do you think this module will work for the kind of overall project I'm working on? You see, this is just one of the six scripts I have to modify for a piped run in the command line. In other words, I want the piped run to work like: Script00.pl | Script01.pl | Script02.pl | Script03.pl | Script04.pl | Script05.pl | Script06.pl. any follow-up feedback is greatly appreciatedLicit
Piping commands works entirely based upon their output. Basically the output of the first command needs to be what you require as the input for the next command.Haricot
I don't have enough space to explain everything. With one year experience, you should already know about heredocs and regex. These are your keywords to search for, go on and refresh your knowledge: learn.perl.org perl-tutorial.org p3rl.org/retut – I can't answer that question about the pipe chain, too little detail, it is best to open a separate new question.Selima
H
4

First, put use strict; at the top of your code and declare your variables.

Second, this:

# remove the ',' and put the files into an array separated by spaces; indexes the files
push @filename, join(' ', split(',', $filenames))

Is not going to do what you want. split() takes a string and turns it into an array. Join takes a list of items and returns a string. You just want to split:

my @filenames = split(',', $filenames);

That will create an array like you expect.

This function will safely trim white space from the beginning and end of a string:

sub trim {
    my $string = shift;
    $string =~ s/^\s+//;
    $string =~ s/\s+$//;
    return $string;
}

Access it like this:

my $file = trim(shift @filenames);

Depending on your script, it might be easier to pass the strings as command line arguments. You can access them through the @ARGV array but I prefer to use GetOpt::Long:

use strict;
use Getopt::Long;
Getopt::Long::Configure("bundling");

my ($qseq_filename, $barcode);

GetOptions (
    'q|qseq=s' => \$qseq_filename,
    'b|bar=s'  => \$barcode,
);

You can then call this as:

./script.pl -q sample1.qseq -b barcode.txt

And the variables will be properly populated without a need to worry about trimming white space.

Haricot answered 9/6, 2012 at 1:48 Comment(1)
thanks Llion for revising my code. I might use the trim subroutine you provided. That should take care of any leading or trailing white space. the GetOpt::Long module you suggested sounds like just the thing I need however, this just a snippet of the overall project. You see, this is just one of the six scripts I have to modify for a piped run in the command line. In other words, I want the piped run to work like: Script00.pl | Script01.pl | Script02.pl | Script03.pl | Script04.pl | Script05.pl | Script06.pl. I will definitely see if this module works well for that. Thanks againLicit
J
2

You'll need to trim spaces before handling the filename data in your routine, you could check the file extension with yet another regular expression, as nicely described in Is there a regular expression in Perl to find a file's extension?. If it's the actual type of file that matters to you, then it might be more worthwile to check for that instead with File::LibMagicType.

Jarv answered 9/6, 2012 at 1:47 Comment(1)
@Selima thanks for these great links. thanks for the answer HaraldLicit
L
1

While I think your design is a little iffy, the following will work?

my @fileNames = split(',', $filenames);
foreach my $fileName (@fileNames) {
  if($fileName =~ /\s/) {
    print STDERR "Invalid filename.";
    exit -1;
  }
}
my ($qsec, $barcode) = @fileNames;
Lancelot answered 9/6, 2012 at 1:49 Comment(2)
That doesn't really answer the question though. It just errors out when the format is unexpected. What if there's spaces in the file name?Haricot
yeah I imagine something like this will make the user quickly frustrated. I'm trying to write code that is user-friendly. good suggestion though.Licit
G
1

And here is one more way you could do it with regex (if you are reading the input from STDIN):

# read a line from STDIN
my $filenames = <STDIN>;

# parse the line with a regex or die with an error message
my ($qseq_filename, $barcode) = $filenames =~ /^\s*(\S.*?)\s*,\s*(\S.*?)\s*$/
    or die "invalid input '$filenames'";
Gayle answered 9/6, 2012 at 2:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.