Optimizing a file content parser class written in TypeScript
Asked Answered
G

2

7

I got a typescript module (used by a VSCode extension) which accepts a directory and parses the content contained within the files. For directories containing large number of files this parsing takes a bit of time therefore would like some advice on how to optimize it.

I don't want to copy/paste the entire class files therefore will be using a mock pseudocode containing the parts that I think are relevant.

class Parser {
    constructor(_dir: string) {
        this.dir = _dir;
    }

    parse() {
        let tree: any = getFileTree(this.dir);

        try {
            let parsedObjects: MyDTO[] = await this.iterate(tree.children);
        } catch (err) {
            console.error(err);
        }
    }

    async iterate(children: any[]): Promise<MyDTO[]> {
        let objs: MyDTO[] = [];

        for (let i = 0; i < children.length; i++) {
            let child: any = children[i];

            if (child.type === Constants.FILE) {
                let dto: FileDTO = await this.heavyFileProcessingMethod(file); // this takes time
                objs.push(dto);
            } else {
                // child is a folder
                let dtos: MyDTO[] = await this.iterateChildItems(child.children);
                let dto: FolderDTO = new FolderDTO();
                dto.files = dtos.filter(item => item instanceof FileDTO);
                dto.folders = dtos.filter(item => item instanceof FolderDTO);
                objs.push(FolderDTO);
            }
        }

        return objs;
    }

    async heavyFileProcessingMethod(file: string): Promise<FileDTO> {
        let content: string = readFile(file); // util method to synchronously read file content using fs

        return new FileDTO(await this.parseFileContent(content));
    }

    async parseFileContent(content): Promise<any[]> {
        // parsing happens here and the file content is parsed into separate blocks
        let ast: any = await convertToAST(content); // uses an asynchronous method of an external dependency to convert content to AST
        let blocks = parseToBlocks(ast); // synchronous method called to convert AST to blocks

        return await this.processBlocks(blocks);
    }

    async processBlocks(blocks: any[]): Promise<any[]> {
        for (let i = 0; i < blocks.length; i++) {
            let block: Block = blocks[i];
            if (block.condition === true) {
                // this can take some time because if this condition is true, some external assets will be downloaded (via internet) 
                // on to the caller's machine + some additional processing takes place
                await processBlock(block);
            }
        }
        return blocks;
    }
}

Still sort of a beginner to TypeScript/NodeJS. I am looking for a multithreading/Java-esque solution here if possible. In the context of Java, this.heavyFileProcessingMethod would be a instance of Callable object and this object would be pushed into a List<Callable> which would then be executed parallelly by an ExecutorService returning List<Future<Object>>.

Basically I want all files to be processed parallelly but the function must wait for all the files to be processed before returning from the method (so the entire iterate method will only take as long as the time taken to parse the largest file).

Been reading on running tasks in worker threads in NodeJS, can something like this be used in TypeScript as well? If so, can it be used in this situation? If my Parser class needs to be refactored to accommodate this change (or any other suggested change) it's no issue.

EDIT: Using Promise.all

async iterate(children: any[]): Promise<MyDTO>[] {
    let promises: Promies<MyDTO>[] = [];

    for(let i = 0; i <children.length; i++) {
        let child: any = children[i];

        if (child.type === Constants.FILE) {
            let promise: Promise<FileDTO> = this.heavyFileProcessingMethod(file); // this takes time
            promises.push(promise);
        } else {
            // child is a folder
            let dtos: Promise<MyDTO>[] = this.iterateChildItems(child.children);
            let promise: Promise<FolderDTO> = this.getFolderPromise(dtos);
            promises.push(promise);
        }
    }

    return promises;
}

async getFolderPromise(promises: Promise<MyDTO>[]): Promise<FolderDTO> {
    return Promise.all(promises).then(dtos => {
        let dto: FolderDTO = new FolderDTO();
        dto.files = dtos.filter(item => item instanceof FileDTO);
        dto.folders = dtos.filter(item => item instanceof FolderDTO);
        return dto;
    })
}
Gudgeon answered 17/5, 2021 at 19:52 Comment(2)
You should avoid using await in a for loop - it will effectively block until the next loop, then block again, until it is done. Instead you can create an array of promises and then use await Promise.all(promises) to resolve them concurrently.Neuman
@Neuman thanks for the reply.. I refactored the code a bit to show an attempt with using Promise.all as suggested. Did not try out the code yet but could you please check and let me know if my attempt seems to be correct? ThanksGudgeon
T
5

first: Typescript is really Javascript

Typescript is just Javascript with static type checking, and those static types are erased when the TS is transpiled to JS. Since your question is about algorithms and runtime language features, Typescript has no bearing; your question is a Javascript one. So right off the bat that tells us the answer to

Been reading on running tasks in worker threads in NodeJS, can something like this be used in TypeScript as well?

is YES.

As to the second part of your question,

can it be used in this situation?

the answer is YES, but...

second: Use Worker Threads only if the task is CPU bound.

Can does not necessarily mean you should. It depends on whether your processes are IO bound or CPU bound. If they are IO bound, you're most likely far better off relying on Javascript's longstanding asynchronous programming model (callbacks, Promises). But if they are CPU bound, then using Node's relatively new support for Thread-based parallelism is more likely to result in throughput gains. See Node.js Multithreading!, though I think this one is better: Understanding Worker Threads in Node.js.

While worker threads are lighter weight than previous Node options for parallelism (spawning child processes), it is still relatively heavy weight compared to threads in Java. Each worker runs in its own Node VM, regular variables are not shared (you have to use special data types and/or message passing to share data). It had to be done this way because Javascript is designed around a single-threaded programming model. It's extremely efficient within that model, but that design makes support for multithreading harder. Here's a good SO answer with useful info for you: https://mcmap.net/q/631693/-is-node-js-considered-multithreading-with-worker-threads

My guess is your parsing is more IO bound, and the overhead of spawning worker threads will outweigh any gains. But give it a go and it will be a learning experience. :)

Tarp answered 24/5, 2021 at 12:0 Comment(0)
M
5

It looks like your biggest problem is navigating the nested directory structure and keeping individual per-file and per-dir promises organized. My suggestion would be to do that in a simpler way.

Have a function that takes a directory path and returns a flat list of all files it can find, no matter how deep, in a manner similar to the find program. This function can be like this:

import * as fs from 'fs/promises'
import * as path from 'path'
    
async function fileList(dir: string): Promise<string[]> {
    let entries = await fs.readdir(dir, {withFileTypes: true})

    let files = entries
        .filter(e => e.isFile())
        .map(e => path.join(dir, e.name))
    
    let dirs = entries
        .filter(e => e.isDirectory())
        .map(e => path.join(dir, e.name))

    let subLists = await Promise.all(dirs.map(d => fileList(d)))

    return files.concat(subLists.flat())
}

Basically, obtain directory entries, find (sub)directories and iterate them recursively in parallel. Once the iteration is complete, flatten, merge and return the list.

Using this function, you can apply your heavy task to all files at once by simply using map + Promise.all:

 let allFiles = await fileList(dir)

 let results = await Promise.all(allFiles.map(f => heavyTask(f)))
Mitre answered 27/5, 2021 at 2:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.