How can I find extended ASCII characters in a file using Perl?
Asked Answered
B

6

7

How can I find extended ASCII characters in a file using Perl? Can anyone get the script?

.....thanks in advance.....

Buiron answered 19/5, 2009 at 10:17 Comment(0)
H
10

Since the extended ASCII characters have value 128 and higher, you can just call ord on individual characters and handle those with a value >= 128. The following code reads from stdin and prints only the extended ASCII characters:

while (<>) {
  while (/(.)/g) {
    print($1) if (ord($1) >= 128);
  }
}

Alternatively, unpack together with chr will also work. Example:

while (<>) {
  foreach (unpack("C*", $_)) {
    print(chr($_)) if ($_ >= 128);
  }
}

(I'm sure some Perl guru can condense both of these to two one-liners...)


To print the line numbers instead, you can use the following (this does not remove duplicates, and will have odd behaviour when unicode is passed):

while (<>) {
  while (/(.)/g) {
    print($. . "\n") if (ord($1) >= 128);
  }
}

(Thanks Yaakov Belch for the $. tip.)

Heartworm answered 19/5, 2009 at 10:32 Comment(3)
It is very slow and ineffective approach, see Dave Sherohman's solution #882431 It is far faster and simpler.Chromoprotein
This answer was posted before Dave's. I have seen Dave's approach, and it is to be preferred in most instances. This just shows that I'm a Perl novice. I choose not to delete this answer because the last part appears to do exactly what the questioner wants. Also see #882622Heartworm
...ah, that page has been deleted. Suffice it to say, the question stated that the line number should be printed for each extended ASCII character. This is what my solution does.Heartworm
R
8

The first printable ASCII character is space (32). The last printable ASCII character is ~ (126). So I'd probably use

while (<>) {
  print "$.\n" if /[^ -~]/;
}

although it will, admittedly, also display lines containing control characters as well as extended ASCII.

Edit: Changed to print the line number rather than the line itself.

Rockbound answered 19/5, 2009 at 11:4 Comment(2)
It's easy to print the line number instead of the line: while(<>) { print "$.\n" if /[^ -~]/;} This should solve the stated problemOrlantha
Whoops! I was just reading the question itself and missed that the title specified that he wanted the line number. Thanks for the catch.Rockbound
C
5

Oneliner:

perl -nE'say$.if/[\xE0-\xFF]/'

for older perl versions

perl -lne'print$.if/[\xE0-\xFF]/'
Chromoprotein answered 19/5, 2009 at 12:27 Comment(0)
D
2

A crucial question is whether the

use bytes;

pragma should be in effect. The poster should decide that. For picking characters with codes greater than 127, the following will suffice:

print grep 127 < ord, split // while <>;

or

print grep /[^[:ascii:]]/, split // while <>;
Divergence answered 19/5, 2009 at 12:38 Comment(0)
M
2

Hynek -Pichi- Vychodil's answer:

perl -nE'say$.if/[\xE0-\xFF]/'

only tests a limited part of the non-printing should presumably be

perl -nE'say$.if/[\x80-\xFF]/'

instead.

Mat answered 24/6, 2009 at 14:41 Comment(0)
N
1

What about grep?

grep [\x00-\x1F\x7F-\xFF]+ *
Natica answered 8/1, 2010 at 22:21 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.