File name encoding problem in File Compression Method

  • Problem: specially when cross-platform, file name encoding is difference in different compression tool, like Zip, Gz, Bz2.
    • Solution 1: using tar to compress
    • Solution 2 in linux: unzip -O CP936 non_english_name.zip (means using GBK, GB18030 Chinese code)
    • Solution 3 use Java: jar xvf non_english_name.zip
  • Problem: when zip or winrar uncompress a non-english encoding archive file, sometimes require set system locale language and reboot to get name uncompressed right.
    • Solution 1 (windows method): using already-to-use-build zip and unzip tool from DotNetZip library, which support encoding and decode option
      Unzip.exe -cp 936 chinese_name_content.zip
    • Solution 2 (cross platform) using Python (sometime works, sometimes error encoding):
      python xZip.py non_english.zip decode_language

      (such as gbk for Chinese, decode_language code refer to this https://docs.python.org/2/library/codecs.html )

      • here is the python code for xZip.py
        xZip.py
        # full list of codec: https://docs.python.org/2/library/codecs.html
        # note: 
        # - input from command line is using commandline system default locale encoding
        # - it read the zip file path in unicode format with the given decode method
        # - if you use python print method to print those unicode path in window command windows,
        #   it may error when system default locale codec can't print those unicode characters
        import zipfile   
        import os.path   
        import os
        import sys
         
        class ZFile(object):   
            def __init__(self, filename, mode='r', basedir=''):   
                self.filename = filename   
                self.mode = mode   
                if self.mode in ('w', 'a'):   
                    self.zfile = zipfile.ZipFile(filename, self.mode, compression=zipfile.ZIP_DEFLATED)   
                else:   
                    self.zfile = zipfile.ZipFile(filename, self.mode)   
                self.basedir = basedir   
                if not self.basedir:   
                    self.basedir = os.path.dirname(filename)   
         
            def addfile(self, path, arcname=None):   
                path = path.replace('//', '/')   
                if not arcname:   
                    if path.startswith(self.basedir):   
                        arcname = path[len(self.basedir):]   
                    else:   
                        arcname = ''   
                self.zfile.write(path, arcname)   
         
            def addfiles(self, paths):   
                for path in paths:   
                    if isinstance(path, tuple):   
                        self.addfile(*path)   
                    else:   
                        self.addfile(path)   
         
            def close(self):   
                self.zfile.close()   
         
            def extract_to(self, path, decode):   
                for p in self.zfile.namelist():   
                    self.extract(p, path, decode)   
         
            def extract(self, filename, path, decode):   
                if not filename.endswith('/'):   
                    f = os.path.join(path, filename.decode(decode))   #gbk,gb18030, GB2312, utf-8
                    dir = os.path.dirname(f)
                    if not os.path.exists(dir):
                        os.makedirs(dir)   
                    file(f, 'wb').write(self.zfile.read(filename))   
         
         
        def create(zfile, files):   
            z = ZFile(zfile, 'w')   
            z.addfiles(files)   
            z.close()   
         
        def extract(zfile, path, decode):   
            z = ZFile(zfile)   
            z.extract_to(path, decode)   
            z.close() 
         
        if __name__=="__main__":
            extract(unicode(sys.argv[1]), u'.', sys.argv[2])
  • Alternative solution: extract normally with wrong-encoding names, then fixing those name using python decode and encode
  • Site Notes:
    • in windows commands, chcp is used to change display page code (file name encoding) page code list

Common Problem on compressed File and Solution

  • Problem: Winrar has update the version recently, only winrar can't open some new winrar file.