Python unzip file

3/22/2023

Also, just look at the difference in the code between f1 and f2! With concurrent.futures pool classes you can cap the number of CPUs it's allowed to use but that doesn't feel great either. You're bound to one CPU but the performance is still pretty good. Conclusionĭoing it serially turns out to be quite nice. Since there's other things going on in this server, I'm not sure I'm willing to let on process take over all the other CPUs. What if some of those other CPUs are needed for something else going on in gunicorn? Those other processes would have to patiently wait till there's a CPU available. Perhaps, it could be worth it if the extraction was significantly faster.īut remember! This optimization depends on using up as many CPUs as it possibly can. Not sure what the cost of that it's not likely to be cheap. So in my web server, to use this solution, I'd first have to save the in-memory ZIP file to disk, then invoke this function. The problem with using a pool of processors is that it requires that the original. as_completed ( futures ): total += future. filename, dest, ) ) total = 0 for future in concurrent. ProcessPoolExecutor () as executor : for member in zf. join ( dest, filename ) return _count_file ( fn ) def f3 ( fn, dest ): with open ( fn, 'rb' ) as f : zf = zipfile. Is there perhaps a way to optimize that? Baseline functionįirst it's these common functions that simulate actually doing something with the files in the zip file:ĭef unzip_member_f3 ( zip_filepath, filename, dest ): with open ( zip_filepath, 'rb' ) as f : zf = zipfile. This worked much better but I still noticed the whole unzipping was taking up a huge amount of time. So, the solution, after much testing, was to dump the zip file to disk (in a temporary directory in /tmp) and then iterate over the files.

First you have the 1GB file in RAM, then you unzip each file and now you have possibly 2-3GB all in memory. That failed spectacularly with various memory explosions and EC2 running out of memory. It's not unusual that each zip file contains 100 files and 1-3 of those make up 95% of the zip file size.Īt first I tried unzipping the file, in memory, and deal with one file at a time. Within them, there are mostly plain text files but there are some binary files in there too that are huge. The average is 560MB but some are as much as 1GB. The challenge is that these zip files that come in are huuuge. In this particular application what it does is that it looks at the file's individual name and size, compares that to what has already been uploaded in AWS S3 and if the file is believed to be different or new, it gets uploaded to AWS S3. So the context is this a zip file is uploaded into a web service and Python then needs extract that and analyze and deal with each file within.

0 Comments

Python unzip file

Leave a Reply.

Author

Archives

Categories