Python script to search PII
Asked Answered
P

3

6

I would like to write a script which can search for and report on Personally Identifiable Information like card numbers, etc in a file system. I would like to find it in txt as well as xls word and PDF files.

Any starting tips or which lib to use are welcome.

I'd also like advice on an efficient way to scan large files for patterns like credit cards etc.

Pharmaceutical answered 16/5, 2012 at 18:43 Comment(4)
How, pray tell, can something like this be used in an ethical manner?Amazon
Well, it is when you are working to protect it. You know un-attended PII especially cards pose greater risk and now its required by standads like PCIDSS to scan the environment and protect them, before it is mis-used. So, I have purely ethical reasons.Pharmaceutical
Too much imagination guys. I have explained the intentions once and not going to explian it any more. If someone has something constructive to put here she is welcome else thank you very much we dont need any more self-styled parnoid captain Internet. So be postive else be away.Pharmaceutical
If I had someting else in mind I would have written it as below: I would like to write a script which can search for and report on Specific search string in a file system. I would like to find it in txt as well as xls word and PDF files. Any starting tips or which lib to use are welcome. I'd also like advice on an efficient way to scan large files for Certain patterns .Pharmaceutical
D
5

give piianalyzer a shot: https://pypi.python.org/pypi/piianalyzer/0.1.0

or you can write your own and use a common regular expression dataset like https://github.com/madisonmay/CommonRegex

Darlenedarline answered 17/10, 2015 at 2:11 Comment(0)
E
1

If you're working for a company, you could consider buying a packaged solution. One I've seen advertised is Nuix. Also, Oracle has an end-to-end solution for GDPR (the new EU privacy law), which includes the kind of functionality you describe. See http://www.oracle.com/technetwork/database/security/wp-security-dbsec-gdpr-3073228.pdf.

If you have the Oracle RDBMS, there is a package called CTXSYS (now called Oracle Text) which has amazing search capabilities across documents, including PDFs, the entire Office suite, and many more. CTXSYS is included in the regular license. If you're a home user, you can download Oracle server (the Express version is fine for this function).

If you're using regexes as suggested above, one simple approach would be to search for words that are capitalized in mid-sentence, but that only helps with documents (not so much with XLS, for example). You could also build a dictionary of common names (first/last names, streets, towns). The credit cards and SSNs should be readily regex-able.

Erastian answered 5/11, 2017 at 14:4 Comment(0)
J
1

We are implementing a similar system which allows data entry from dynamic forms and CSV imports. Fields will be classified as either list, numeric range, free-text. Data ends up in one field in a DB table. We are scanning free-text entries to find PHI. The data is entered via a website and is stored in SQL Server. We fire off a command to add the id for any new import batch to a RabbitMQ queue and flag all free-text fields in the batch as pending examination which prevents them from being displayed or exported. All fields considered "safe", such as those generated from dropdowns or based on number ranges are ready for export or display in charts. Only free-text fields are locked temporarily. A python windows service then pulls from the Rabbit queue and scans each text field for PHI and flags them accordingly. If there are fields that look suspect, I get a report and I check the entire text import batch manually. I am currently using Spacy for entity recognition, and aspects of Deduce to find other PHI types.

As the analysis is carried out asynchronously I as able to put the data through multiple scan approaches without impacting performance.

Johnathanjohnathon answered 9/6, 2018 at 9:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.