Get last n lines or bytes of a huge file in Windows (like Unix's tail). Avoid time consuming options
Asked Answered
G

6

40

I need to retrieve the last n lines of huge files (1-4 Gb), in Windows 7. Due to corporate restrictions, I cannot run any command that is not built-in. The problem is that all solutions I found appear to read the whole file, so they are extremely slow.

Can this be accomplished, fast?

Notes:

  1. I managed to get the first n lines, fast.
  2. It is ok if I get the last n bytes. (I used this https://mcmap.net/q/393436/-get-the-first-n-characters-of-a-large-file-with-powershell for the first n bytes).

Solutions here Unix tail equivalent command in Windows Powershell did not work. Using -wait does not make it fast. I do not have -tail (and I do not know if it will work fast).

PS: There are quite a few related questions for head and tail, but not focused on the issue of speed. Therefore, useful or accepted answers there may not be useful here. E.g.,

Windows equivalent of the 'tail' command

CMD.EXE batch script to display last 10 lines from a txt file

Extract N lines from file using single windows command

https://serverfault.com/questions/490841/how-to-display-the-first-n-lines-of-a-command-output-in-windows-the-equivalent

powershell to get the first x MB of a file

https://superuser.com/questions/859870/windows-equivalent-of-the-head-c-command

Graphology answered 8/4, 2016 at 19:2 Comment(4)
Batch file is a bad choice for that, bacause it is very difficult or even almost impossible to handle binary files correctly (I suppose you are talking about such as you want to extract a certain amount of bytes rather than characters or lines); so I would definitely go for PS...Unsteady
@aschipfl: batch files are much simpler & faster than PSDump
@sancho: as a matter of interest, could you share your solution for reading the first n lines of a big file? I want to view the first couple of "lines" of a binary file that contains some text, but don't want to read the whole thing in...Stranger
@Stranger - This is an old question. I am not sure I keep that old version, and I wouldn't know at the moment where to look. My apologies.Graphology
C
20

How about this (reads last 8 bytes for demo):

$fpath = "C:\10GBfile.dat"
$fs = [IO.File]::OpenRead($fpath)
$fs.Seek(-8, 'End') | Out-Null
for ($i = 0; $i -lt 8; $i++)
{
    $fs.ReadByte()
}

UPDATE. To interpret bytes as string (but be sure to select correct encoding - here UTF8 is used):

$N = 8
$fpath = "C:\10GBfile.dat"
$fs = [IO.File]::OpenRead($fpath)
$fs.Seek(-$N, [System.IO.SeekOrigin]::End) | Out-Null
$buffer = new-object Byte[] $N
$fs.Read($buffer, 0, $N) | Out-Null
$fs.Close()
[System.Text.Encoding]::UTF8.GetString($buffer)

UPDATE 2. To read last M lines, we'll be reading the file by portions until there are more than M newline char sequences in the result:

$M = 3
$fpath = "C:\10GBfile.dat"

$result = ""
$seq = "`r`n"
$buffer_size = 10
$buffer = new-object Byte[] $buffer_size

$fs = [IO.File]::OpenRead($fpath)
while (([regex]::Matches($result, $seq)).Count -lt $M)
{
    $fs.Seek(-($result.Length + $buffer_size), [System.IO.SeekOrigin]::End) | Out-Null
    $fs.Read($buffer, 0, $buffer_size) | Out-Null
    $result = [System.Text.Encoding]::UTF8.GetString($buffer) + $result
}
$fs.Close()

($result -split $seq) | Select -Last $M

Try playing with bigger $buffer_size - this ideally is equal to expected average line length to make fewer disk operations. Also pay attention to $seq - this could be \r\n or just \n. This is very dirty code without any error handling and optimizations.

Canst answered 8/4, 2016 at 22:24 Comment(3)
This actually works fast, but it outputs the decimal code for each byte. I mean to get the corresponding string of chars.Graphology
Updated, please check. Just noticed I forgot $fs.Close() at the first sample but I hope it's not that critical with this proof-of-concept code. Good luck!Canst
Thanks! I was writing my own code, and posted an answer that works too. I do not usually code PS, so it may be rudimentary.Graphology
L
117

If you have PowerShell 3 or higher, you can use the -Tail parameter for Get-Content to get the last n lines.

Get-content -tail 5 PATH_TO_FILE;

On a 34MB text file on my local SSD, this returned in 1 millisecond vs. 8.5 seconds for get-content |select -last 5

Lacework answered 8/4, 2016 at 19:13 Comment(8)
I do not have -Tail.Graphology
Then get your environment upgraded to a recent release of PowerShell. Unless you have some weird compatibility issues that need to be preserved, there's no reason to not upgrade to at least v3, preferably 4 or 5 (whatever the highest one your systems support is).Lacework
Due to the same corporate restrictions that I cannot run any command that is not built-in, I cannot upgrade either. I get what they give me.Graphology
Then your corporate IT environment is broken and I'd recommend looking for someplace that at least attempts to stay current on its software.Lacework
They may be "broken", or overburdened with work, or... This is not uncommon in large companies, where IT takes time to update the "standard environment". It may be a nuisance, but I would not change jobs because of this, unless it becomes a serious hurdle for performing my duties. This is not the case...Graphology
Sorry, but if a piece of software which is considered a core windows component hasn't been considered for an upgrade in the more than 3 years since it was released, I perceive the environment as broken. What else is out of date, or even worse, left unpatched for security & bug fixes? How far can you really advance your own career and technical knowledge when you're saddled with out of date software? That is why you move on - because you can't improve your own skills in such an environment.Lacework
I don't work for a software company. Not having PS3 is not a symptom of a need for going elsewhere (even if it would be convenient to have it!). That is my perception. Thanks for the vividness!Graphology
I don't work for a software company either. But that does't mean that your corporate computing environment gets a free pass on being stuck on ancient software. Staying current on your software is part of the cost of doing business today and if they're not willing to invest there, they're probably not investing in their people or other things that are important to keeping operations running.Lacework
C
20

How about this (reads last 8 bytes for demo):

$fpath = "C:\10GBfile.dat"
$fs = [IO.File]::OpenRead($fpath)
$fs.Seek(-8, 'End') | Out-Null
for ($i = 0; $i -lt 8; $i++)
{
    $fs.ReadByte()
}

UPDATE. To interpret bytes as string (but be sure to select correct encoding - here UTF8 is used):

$N = 8
$fpath = "C:\10GBfile.dat"
$fs = [IO.File]::OpenRead($fpath)
$fs.Seek(-$N, [System.IO.SeekOrigin]::End) | Out-Null
$buffer = new-object Byte[] $N
$fs.Read($buffer, 0, $N) | Out-Null
$fs.Close()
[System.Text.Encoding]::UTF8.GetString($buffer)

UPDATE 2. To read last M lines, we'll be reading the file by portions until there are more than M newline char sequences in the result:

$M = 3
$fpath = "C:\10GBfile.dat"

$result = ""
$seq = "`r`n"
$buffer_size = 10
$buffer = new-object Byte[] $buffer_size

$fs = [IO.File]::OpenRead($fpath)
while (([regex]::Matches($result, $seq)).Count -lt $M)
{
    $fs.Seek(-($result.Length + $buffer_size), [System.IO.SeekOrigin]::End) | Out-Null
    $fs.Read($buffer, 0, $buffer_size) | Out-Null
    $result = [System.Text.Encoding]::UTF8.GetString($buffer) + $result
}
$fs.Close()

($result -split $seq) | Select -Last $M

Try playing with bigger $buffer_size - this ideally is equal to expected average line length to make fewer disk operations. Also pay attention to $seq - this could be \r\n or just \n. This is very dirty code without any error handling and optimizations.

Canst answered 8/4, 2016 at 22:24 Comment(3)
This actually works fast, but it outputs the decimal code for each byte. I mean to get the corresponding string of chars.Graphology
Updated, please check. Just noticed I forgot $fs.Close() at the first sample but I hope it's not that critical with this proof-of-concept code. Good luck!Canst
Thanks! I was writing my own code, and posted an answer that works too. I do not usually code PS, so it may be rudimentary.Graphology
H
7

When the file is already opened, it's better to use

Get-Content $fpath -tail 10

because of "exception calling "OpenRead" with "1" argument(s): "The process cannot access the file..."

Hawkes answered 21/7, 2020 at 19:26 Comment(0)
G
3

With the awesome answer by Aziz Kabyshev, which solves the issue of speed, and with some googling, I ended up using this script

$fpath = $Args[1]
$fs = [IO.File]::OpenRead($fpath)
$fs.Seek(-$Args[0], 'End') | Out-Null
$mystr = ''
for ($i = 0; $i -lt $Args[0]; $i++)
{
    $mystr = ($mystr) + ([char[]]($fs.ReadByte()))
}
$fs.Close()
Write-Host $mystr

which I call from a batch file containing

@PowerShell -NoProfile -ExecutionPolicy Bypass -Command "& '.\myscript.ps1' %1 %2"

(thanks to How to run a PowerShell script from a batch file).

Graphology answered 9/4, 2016 at 11:56 Comment(2)
Byte-to-char is always dependent to encoding, don't forget about itCanst
@AzizKabyshev - That is true. For the files I know I will have, this is ok.Graphology
L
3

This is not an answer, but a large comment as reply to sancho.s' answer.

When you want to use small PowerShell scripts from a Batch file, I suggest you to use the method below, that is simpler and allows to keep all the code in the same Batch file:

@PowerShell  ^
   $fpath = %2;  ^
   $fs = [IO.File]::OpenRead($fpath);  ^
   $fs.Seek(-%1, 'End') ^| Out-Null;  ^
   $mystr = '';  ^
   for ($i = 0; $i -lt %1; $i++)  ^
   {  ^
      $mystr = ($mystr) + ([char[]]($fs.ReadByte()));  ^
   }  ^
   Write-Host $mystr
%End PowerShell%
Luger answered 9/4, 2016 at 14:46 Comment(1)
This is very useful for me. A caveat: the way to execute this is with myscript.bat nbytes 'myfile'. Using the filename with single quotes is mandatory. No quotes or double quotes did not work, as opposed to executing a batch file that calls a ps1 script.Graphology
D
1

Get last n bytes of a file:

set file="C:\Covid.mp4"
set n=7

copy /b %file% tmp
for %i in (tmp) do set /a m=%~zi-%n%
FSUTIL file seteof tmp %m%
fsutil file createnew temp 1
FSUTIL file seteof temp %n%
type temp >> tmp
fc /b tmp %file% | more +1 > temp

REM problem parsing file with byte offsets in hex from fc, to be converted to decimal offsets before output
type nul > tmp
for /f "tokens=1-3 delims=: " %i in (temp) do set /a 0x%i >> tmp & set /p=": " <nul>> tmp & echo %j %k >> tmp

set /a n=%m%+%n%-1

REM output
type nul > temp
for /l %j in (%m%,1,%n%) do (find "%j: "<  tmp || echo doh: la 00)>> temp
(for /f "tokens=3" %i in (temp) do set /p=%i <nul) & del tmp & del temp

Tested on Win 10 cmd Surface Laptop 1
Result: 1.43 GB file processed in 10 seconds

Dump answered 12/10, 2021 at 8:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.