development

Python에서 파일 해싱

big-blog 2020. 11. 23. 19:36
반응형

Python에서 파일 해싱


나는 파이썬이 EOF를 읽어서 sha1이든 md5이든 적절한 해시를 얻을 수 있기를 원합니다. 도와주세요. 지금까지 내가 가진 것은 다음과 같습니다.

import hashlib

inputFile = raw_input("Enter the name of the file:")
openedFile = open(inputFile)
readFile = openedFile.read()

md5Hash = hashlib.md5(readFile)
md5Hashed = md5Hash.hexdigest()

sha1Hash = hashlib.sha1(readFile)
sha1Hashed = sha1Hash.hexdigest()

print "File Name: %s" % inputFile
print "MD5: %r" % md5Hashed
print "SHA1: %r" % sha1Hashed

TL; DR은 버퍼를 사용하여 메모리를 많이 사용하지 않습니다.

우리는 매우 큰 파일 작업의 메모리 의미를 고려할 때 문제의 핵심에 도달 합니다 . 우리는이 나쁜 소년이 2 기가 바이트 파일을 위해 2 기가 바이트의 램을 뒤 흔드는 것을 원하지 않습니다. 그래서 pasztorpisti가 지적했듯이 우리는 그 큰 파일을 덩어리로 처리해야합니다!

import sys
import hashlib

# BUF_SIZE is totally arbitrary, change for your app!
BUF_SIZE = 65536  # lets read stuff in 64kb chunks!

md5 = hashlib.md5()
sha1 = hashlib.sha1()

with open(sys.argv[1], 'rb') as f:
    while True:
        data = f.read(BUF_SIZE)
        if not data:
            break
        md5.update(data)
        sha1.update(data)

print("MD5: {0}".format(md5.hexdigest()))
print("SHA1: {0}".format(sha1.hexdigest()))

우리가 한 것은 hashlib의 편리한 멋쟁이 업데이트 방법 과 함께 진행하면서이 나쁜 소년의 해시를 64kb 청크로 업데이트하는 것 입니다. 이렇게하면 한 번에 해시하는 데 걸리는 2GB보다 훨씬 적은 메모리를 사용합니다!

다음과 같이 테스트 할 수 있습니다.

$ mkfile 2g bigfile
$ python hashes.py bigfile
MD5: a981130cf2b7e09f4686dc273cf7187e
SHA1: 91d50642dd930e9542c39d36f0516d45f4e1af0d
$ md5 bigfile
MD5 (bigfile) = a981130cf2b7e09f4686dc273cf7187e
$ shasum bigfile
91d50642dd930e9542c39d36f0516d45f4e1af0d  bigfile

도움이 되었기를 바랍니다.

Also all of this is outlined in the linked question on the right hand side: Get MD5 hash of big files in Python


Addendum!

In general when writing python it helps to get into the habit of following pep-8. For example, in python variables are typically underscore separated not camelCased. But that's just style and no one really cares about those things except people who have to read bad style... which might be you reading this code years from now.


For the correct and efficient computation of the hash value of a file (in Python 3):

  • Open the file in binary mode (i.e. add 'b' to the filemode) to avoid character encoding and line-ending conversion issues.
  • Don't read the complete file into memory, since that is a waste of memory. Instead, sequentially read it block by block and update the hash for each block.
  • Eliminate double buffering, i.e. don't use buffered IO, because we already use an optimal block size.
  • Use readinto() to avoid buffer churning.

Example:

import hashlib

def sha256sum(filename):
    h  = hashlib.sha256()
    b  = bytearray(128*1024)
    mv = memoryview(b)
    with open(filename, 'rb', buffering=0) as f:
        for n in iter(lambda : f.readinto(mv), 0):
            h.update(mv[:n])
    return h.hexdigest()

I would propose simply:

def get_digest(file_path):
    h = hashlib.sha256()

    with open(file_path, 'rb') as file:
        while True:
            # Reading is buffered, so we can read smaller chunks.
            chunk = file.read(h.block_size)
            if not chunk:
                break
            h.update(chunk)

    return h.hexdigest()

All other answers here seem to complicate too much. Python is already buffering when reading (in ideal manner, or you configure that buffering if you have more information about underlying storage) and so it is better to read in chunks the hash function finds ideal which makes it faster or at lest less CPU intensive to compute the hash function. So instead of disabling buffering and trying to emulate it yourself, you use Python buffering and control what you should be controlling: what the consumer of your data finds ideal, hash block size.


I have programmed a module wich is able to hash big files with different algorithms.

pip3 install py_essentials

Use the module like this:

from py_essentials import hashing as hs
hash = hs.fileChecksum("path/to/the/file.txt", "sha256")

import hashlib
user = input("Enter ")
h = hashlib.md5(user.encode())
h2 = h.hexdigest()
with open("encrypted.txt","w") as e:
    print(h2,file=e)


with open("encrypted.txt","r") as e:
    p = e.readline().strip()
    print(p)

참고URL : https://stackoverflow.com/questions/22058048/hashing-a-file-in-python

반응형