KEMBAR78
Rework cache to key on hash of file contents instead of mtime by pkch · Pull Request #3437 · python/mypy · GitHub
Skip to content

Conversation

@pkch
Copy link
Contributor

@pkch pkch commented May 24, 2017

Attempt to fix #3403

At present, I'm only adding the ability to validate the cache using a hash even though the module file mtime doesn't match the meta mtime.

However, I'm NOT getting rid of the use of data_mtime (modified time of the cache file) for the purpose of finding dependencies that tell us that a given module needs to be reparsed. This is much more complex than I thought, and it may also be possible to work around that by using @JukkaL suggestion of putting the enitre cache into a tarball to preserve cache mtimes.

To verify that the approach works, I first create a failing test by changing the tests runner to touch all the source files before running mypy for incremental tests. Then in the second commit, I fix the broken tests.

How this whole thing should be tested is something that I'd like feedback on (running the tests twice is too time consuming I think?). As it stands, this PR only tests the new approach, and loses the tests for the old approach (with mtime/size). Maybe some kind of combination of the two would be good?

@pkch pkch changed the title [WIP] [WIP] Rework cache to key on hash of file contents instead of mtime May 24, 2017
@pkch pkch force-pushed the movable-cache branch from b373296 to 9242830 Compare May 24, 2017 10:35
@pkch pkch force-pushed the movable-cache branch from ee319c6 to 1a3e059 Compare May 24, 2017 17:56
@pkch pkch changed the title [WIP] Rework cache to key on hash of file contents instead of mtime Rework cache to key on hash of file contents instead of mtime May 25, 2017
@pkch
Copy link
Contributor Author

pkch commented May 25, 2017

To summarize: this PR eliminates the reliance on mtime (modified time of source files) but leaves untouched the reliance on data_mtime (modified time of cache files).

mypy/build.py Outdated

def compute_module_hash(path: str) -> str:
with open(path, 'r') as f:
return compute_hash(f.read())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty inefficient -- we read the source code in text mode (using the default encoding) and then compute_hash() encodes it back to bytes (using utf-8) before hashing.

I also believe that we have the module source code already as an attribute in State in most cases so there's not even a need to read it from the file (it will be GC'ed by parse_file()).

Also the source should probably be read using read_with_python_encoding() which does a few tricks.

Finally we don't need this hash to be compatible with the interface hash (which is why compute_hash() exists currently).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then perhaps we should hash the binary, and not even bother with decoding to a string in case when we need to read the file from disk?

And yes, sometimes .source is already populated; but it's populated with a string. So I'm plannnig to split read_with_python_encoding into reading the binary, and then decoding, so that in between the two steps we can calculate and store the hash.

If that's too troublesome, I can instead standardize the hash to always use the string.

Any preferences?

@pkch
Copy link
Contributor Author

pkch commented May 25, 2017

@gvanrossum I added the optimization along the lines you suggested. Now whenever the source code is parsed, we also calculate the source hash (based on the byte representation). I don't think there's any need to ever calculate the source hash separately from parsing: we only need the source hash inside write_cache, and by then we of course must have parsed the source.

@pkch pkch force-pushed the movable-cache branch from 2035e3d to de0a794 Compare May 25, 2017 08:47
@pkch
Copy link
Contributor Author

pkch commented May 25, 2017

Actually, we need to calculate source hash in two places: once when parsing the source file, and once when checking if the cache is in sync with the source. In the first case, we can do it without an extra file read; in the second, we do need to read the file again (assuming mtime/size didn't give us the positive answer).

The new commit implements this optimization.

@pkch pkch force-pushed the movable-cache branch from b94f880 to e37ae6a Compare May 25, 2017 10:19
@gvanrossum gvanrossum self-assigned this Jun 5, 2017
Copy link
Member

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this! But there are a few issues still... Let me know if you have time to work on those, else I will take over the development of this PR.

mypy/build.py Outdated
path = manager.maybe_swap_for_shadow_path(self.path)
source = read_with_python_encoding(path, self.options.python_version)
source, self.source_hash = read_with_python_encoding(path,
self.options.python_version)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix indent.

mypy/build.py Outdated
except LookupError as lookuperr:
raise DecodeError(str(lookuperr))
return source_bytearray.decode(encoding)
return source_bytearray.decode(encoding), hashlib.md5(source_bytearray).hexdigest()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect if a BOM marker is present, since on line 705 above the BOM marker is removed from source_bytearray.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c115f13. I'm not sure why we make decision about encoding and BOM after just 2 lines rather than after reading the entire file; is it just because source_bytearray = source_bytearray[3:] is expensive when source_bytearray contains the entire file? Anyway, I assume that's the reason, so to avoid calling f.read() twice, I use hashlib.md5.update().

if st.st_mtime != meta.mtime or st.st_size != meta.size:
manager.log('Metadata abandoned for {}: file {} is modified'.format(id, path))
return False
with open(path, 'rb') as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You only need to compare the hash when the sizes are equal.

There's another subtle issue here: if the mtime differs but the size and hash match, we don't rewrite the meta.json file (AFAICT), so that means that from then on we always hash the file. Now, the hashing seems so fast that this barely matters, but I want to use this for a huge codebase, so I'm still worried about this (else why bother with the mtime check).

Also I'd like to see a log message if the mtime differs, but the size and hash are the same (this helps validating that it works).

Copy link
Contributor Author

@pkch pkch Jun 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You only need to compare the hash when the sizes are equal.

Ah right, I left it as a placeholder but then forgot about it. Originally, I was thinking that maybe we should use a slightly more intelligent hash that ignores comments and indentation non-semantic whitespace (in which case size change doesn't necessarily imply that the file is invalid). Do you think it's worth doing that, or I should just leave the simple hash in place?

Edit: given that interface hash is not too expensive and will take care of simple modifications in the file; and given that removing comments is not super fast (need to use tokenize module); I suppose the "smart hash" isn't worth it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's another subtle issue here: if the mtime differs but the size and hash match, we don't rewrite the meta.json file (AFAICT), so that means that from then on we always hash the file.

Ah that is very subtle. We should either completely remove mtime from the cache logic, or we should update it in this scenario. I prefer the latter because writing meta.json is faster than reading and hashing the source file, and I'm guessing on average we can expect at least one additional read of the source file before it's completely invalidated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attempted to do the mtime optimization in 52a5888.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW "smart hash" sounds like a terrible idea, since it essentially comes down to parsing. People are used to the concept of hashing the contents of a file, and understand it to be some hash of the bytes.

@pkch
Copy link
Contributor Author

pkch commented Jun 10, 2017

Yup, I have time, will work on it this weekend ~

@gvanrossum
Copy link
Member

gvanrossum commented Jun 10, 2017 via email

Copy link
Member

@gvanrossum gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there! I really just have very small refactoring wishes, and they are optional.

if st.st_mtime != meta.mtime or st.st_size != meta.size:
manager.log('Metadata abandoned for {}: file {} is modified'.format(id, path))
return False
with open(path, 'rb') as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW "smart hash" sounds like a terrible idea, since it essentially comes down to parsing. People are used to the concept of hashing the contents of a file, and understand it to be some hash of the bytes.

mypy/build.py Outdated
return hashlib.md5(text.encode('utf-8')).hexdigest()


def atomic_write(filename: str, s: str) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more consistent if atomic_write() preceded the first function that uses it (i.e. before validate_meta()). That would mean random_string() also needs to move up there. It's optional to move these.

mypy/build.py Outdated
# This requires two steps. The first is obvious: we check that the module source file
# contents is the same as it was when the cache file was created. The second is not
# obvious: we need to check that the dependencies we relied on when creating that
# cache file have not changed. We use cache file mtime as a way to propagate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache file --> cache data file (because the rest of this function is about the cache meta file).

mypy/build.py Outdated
else:
manager.log('Metadata ok for {}: file {} (match on size, hash)'.format(id, path))
# Optimization: update meta.mtime (otherwise, this mismatch will not disappear).
meta = meta._replace(mtime = st.st_mtime)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whitespace nit: no spaces around = for keyword args.

mypy/build.py Outdated
else:
meta_str = json.dumps(meta)
meta_json, _ = get_cache_names(id, os.path.abspath(path), manager)
manager.log('Updating mtime {} {} {} {}'.format(id, path, meta_json, meta.mtime))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update this to follow the format of other similar messages, e.g. "Updating mtime for {}: file {}, meta {}, mtime {}" ? Or use trace() so it only shows up for double -v.

mypy/build.py Outdated
os.replace(data_json_tmp, data_json)
data_mtime = os.path.getmtime(data_json)
except os.error as err:
data_str += '\n' # Fast in CPython (it does this in-place if len(data_str) under 10^6)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice bit of research! Alternatively you could make atomic_write() varargs; it could use f.writelines(*args)`. Though in our internal code base, there's only one cache file greater than 1M, so changing that is totally optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored that, because I found out that it's very unpredictable when this optimization kicks in (what I thought was a ~10**6 cutoff turned out to be OS-dependent zone where realloc starts to fail more and more often).

mypy/build.py Outdated
'''
# This requires two steps. The first is obvious: we check that the module source file
# contents is the same as it was when the cache file was created. The second is not
# obvious: we need to check that the dependencies we relied on when creating that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this big comment, but there is still some confusion possible on this line: the second step is not checking the dependencies (which happens after this function returns), but checking data_mtime.

@pkch pkch force-pushed the movable-cache branch 2 times, most recently from 395e156 to 52a5888 Compare June 13, 2017 03:31
@gvanrossum gvanrossum merged commit cbeaeb4 into python:master Jun 13, 2017
@gvanrossum
Copy link
Member

Thanks! I'm going to deploy this ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rework cache to key on hash of file contents instead of mtime

2 participants