KEMBAR78
feat: add blob.uploadfrom(inputstream) by dmitry-fa · Pull Request #162 · googleapis/java-storage · GitHub
Skip to content

Conversation

@dmitry-fa
Copy link
Contributor

@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Mar 4, 2020
@codecov
Copy link

codecov bot commented Mar 4, 2020

Codecov Report

Merging #162 into master will increase coverage by 0.27%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     googleapis/java-core#162      +/-   ##
============================================
+ Coverage     63.32%   63.60%   +0.27%     
- Complexity      529      543      +14     
============================================
  Files            30       30              
  Lines          4723     4770      +47     
  Branches        451      428      -23     
============================================
+ Hits           2991     3034      +43     
- Misses         1571     1576       +5     
+ Partials        161      160       -1     
Impacted Files Coverage Δ Complexity Δ
...e/src/main/java/com/google/cloud/storage/Blob.java 83.95% <100.00%> (+1.60%) 35.00 <6.00> (+5.00)
...in/java/com/google/cloud/storage/PolicyHelper.java 64.86% <0.00%> (-30.38%) 5.00% <0.00%> (-2.00%)
...in/java/com/google/cloud/storage/StorageBatch.java 88.00% <0.00%> (-4.00%) 13.00% <0.00%> (ø%)
...gle/cloud/storage/testing/RemoteStorageHelper.java 64.46% <0.00%> (-0.58%) 9.00% <0.00%> (ø%)
...main/java/com/google/cloud/storage/BucketInfo.java 80.85% <0.00%> (-0.20%) 84.00% <0.00%> (ø%)
...om/google/cloud/storage/spi/v1/HttpStorageRpc.java 1.63% <0.00%> (-0.01%) 1.00% <0.00%> (ø%)
...rc/main/java/com/google/cloud/storage/Storage.java 80.57% <0.00%> (+0.46%) 0.00% <0.00%> (ø%)
...va/com/google/cloud/storage/spi/v1/StorageRpc.java 64.86% <0.00%> (+0.48%) 0.00% <0.00%> (ø%)
...n/java/com/google/cloud/storage/SignatureInfo.java 81.50% <0.00%> (+0.81%) 19.00% <0.00%> (+2.00%)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 53abe4a...53463e3. Read the comment docs.

@dmitry-fa
Copy link
Contributor Author

This is the first part of the feature.

I'm going to add similar functionality to the Storage interface as a separate pull request:

void upload(BlobInfo blobInfo, Path path, BlobWriteOption... options);
void upload(BlobInfo blobInfo, InputStream content, BlobWriteOption... options);
public Blob create(BlobInfo blobInfo, Path path, BlobWriteOption... options);

try (InputStream input = Files.newInputStream(path)) {
return uploadFrom(input, options);
} catch (IOException e) {
throw new StorageException(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IOException should be thrown directly here. It should not be converted to a runtime exception (and StorageException should probably not be a runtime exception)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at this PR.
I agree with all your comments, except ones related to IOException handling.
There are many places in the storage code where it is wraping to StorageException, that allows user to have a single handler for all kind of exceptions. I'm not sure that IOException deserves a dedicated handler.

@frankyn recommended extending Storage interface with upload functionality. I'm going to implement this in a separate PR.
Storage already has create(BlobInfo, InputStream) method which doesn't throw IOException, so adding create(BlobInfo, Path) which throws IOException in addition to StorageException will not look logically. I'm afraid people will write their own utility methods to wrap unnecessary exceptions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StorageException is a huge mistake. It should never have been a runtime exception. Consistency is not a sufficient reason to propagate this broken design to new code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throwing IOException for a few methods will not fix the problem with existing design but make it only worse.
I agree, BaseServiceException should be checked for better design. Does it mean that we cannot extend functionality unless https://github.com/googleapis/java-core/issues/57 is fixed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't fix the existing design, but it won't make the problem worse. Adding new methods that also make this mistake will make the problem worse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

/**
* Uploads the given content to the storage using specified write channel and the given
* buffer size. The default buffer size is 15 MiB. Larger buffer sizes may improve the upload
* performance but require more memory. It could cause OutOfMemoryError or add significant garbage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to give the user the option to set the buffer size? Do we provide sufficient tools and documentation to enable someone to do this?

It would be better to simply pick the best buffer size directly. If we're uncertain, lets leave this method out until we are, since it's easier to add it in later than take it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we really need to give the user the option to play with the buffer size. There is no optimal value for the buffer size. By default, the buffer size is equal to the default Write Channel chunk size (15 MiB), but for uploading large files increasing the buffer will give significant performance benefit. Some statistics of uploading 1.5GB file with various buffer sizes:

15MiB:  162 seconds
30MiB:  149 seconds
75MiB:  141 seconds
100MiB: 140 seconds
150MiB: 139 seconds
300MiB: 138 seconds

On the other hand, one may need to upload hundreds file in parallel, this may require reducing the buffer size to avoid running out of memory. So, giving such flexibility will be certainly a plus.

Do we provide sufficient tools and documentation to enable someone to do this?

I'm afraid no. There was a request to provide a sort of such documentation: #40 which is closed as duplicate of the current one (add Blob.uploadFrom())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case this PR needs to include such documentation. I've seen from experience that tunable I/O buffer size parameters are very effective foot-seeking missiles. I'd estimate that somewhere above 80% of invocations of this method will decrease performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll add notice, that this method is for those who want to experiment with the upload time and the amount of the consumed memory.

* collection overhead. Buffer sizes which are less than 256 KiB are not allowed, they will be
* treated as 256 KiB.
*
* <p>This method does not close either InputStream or WriterChannel</p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These details can be moved to the previous method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous one is intentionally made package private, it will be invoked from the StorageImpl class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about the difference between upload and uploadFrom, so probably users will be too. Stocking with the public methods, I don't see any particular difference between the names upload and uploadFrom. I suggest making both upload.

Also, why does one return a blob and one doesn't. Both should or neither.

Since I am confused, you might have good reasons for this distinction, but in that case let's make sure the docs and the method names make these reasons obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning a Blob instance after upload has completed requires an extra GET request which is not always needed. So, upload works faster than uploadFrom, especially on small files.
I'm going to document this difference in the new method: Storage.upload(BlobInfo, Path)

uploadFrom is named after downloadTo (as requested in the issue).
upload is static and its name is different. Would it be more clear to use writeContnet instead?

/**
  * Writes the given content to the storage using specified write channel and the given
  * buffer size.
  */
public static void Blob.writeContent(InputStream content, WriterChannel writer, int bufferSize)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure.The distinction between the names uploadFrom and upload seems very hazy. Honestly, if I had to guess I'd expect uploadFrom was static rather than the other way around.

Do we actually need both of these? Can we delete the one that makes an extra GET request?

Consider this case:

  1. First request to upload a blob succeeds.
  2. GET request fails.
  3. Exception thrown

It won't be obvious to the user that the content was successfully upload to the server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a good example, quite rare but still valid.
To inform user of successful upload I re-throw an exception with the attendant message.

Do we actually need both of these? Can we delete the one that makes an extra GET request?

Yes, I believe we need both. If blob.uploadFrom() returns void, then blob instance will become useless right after upload. As #149 demonstrates, it's not obvious how to get the updated blob object, so returning the updated updated would be helpful.

* collection overhead. Buffer sizes which are less than 256 KiB are not allowed, they will be
* treated as 256 KiB.
*
* <p>This method does not close either InputStream or WriterChannel</p>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure.The distinction between the names uploadFrom and upload seems very hazy. Honestly, if I had to guess I'd expect uploadFrom was static rather than the other way around.

Do we actually need both of these? Can we delete the one that makes an extra GET request?

Consider this case:

  1. First request to upload a blob succeeds.
  2. GET request fails.
  3. Exception thrown

It won't be obvious to the user that the content was successfully upload to the server.

}

/**
* Uploads the given file path to this blob using specified blob write options.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably just don't understand how this API works, but this sounds off to me. I'd expect a static method that uploads the path to a new Blob and returns it. What does it mean to "upload to this blob"? I'm missing something here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The storage hierarchy is the following:
Class BlobId: A BlobId object includes the name of the containing bucket, the blob's name and possibly the blob's generation. If getGeneration() is null the identifier refers to the latest blob's generation.

Class BlobInfo: encapsulate BlobId and set of properties.

Class Blob extends BlobInfo and adds a number of methods to modify the blob like update, copyTo, downloadTo, etc. Objects of this class are immutable. Operations that modify the blob return a new object.

New Blob.uploadFrom() methods will allow those who already have a blob instance to upload some content to it as simply as blob = blob.uploadFrom(myFile);

I'd expect a static method that uploads the path to a new Blob and returns it.

A static method would require an extra parameter storage, which is not very convenient.
As I said previously I'm going to extend the Storage interface with new methods:

void upload(BlobInfo blobInfo, Path path, BlobWriteOption... options);
void upload(BlobInfo blobInfo, InputStream content, BlobWriteOption... options);
public Blob create(BlobInfo blobInfo, Path path, BlobWriteOption... options);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what's throwing me: the object is immutable but has methods to modify it? Why?

This seems to be the existing API, but it's still extremely wonky. After you call uploadFrom (or downloadTo) how does the new Blob bject returned differ from the original Blob object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can consider a Blob instance like a pointer to {bucket, name, generation} and methods to operate with that pointer. Each modification of the blob causes the generation change, so existing pointer becomes outdated and the Blob instance useless(Immutability does not allow to modify the generation in place). A new object returned by uploadFrom has the new generation value pointing to the latest blob version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if the docs said that, or the classes were named accurately. but c'est la vie.

#176

My immediate question is what exactly has changed in the object? What field has been modified and how?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new Blob objects are constructed from data received from GCS.
In case of uploadFrom only 'generation' and 'size' will be changed. (These fields are inherited from the BlobInfo class.)
uploadFrom uses WriteChannel, so it has to perform a separate GET request to obtain the fresh data.
upload method performs the RPC request and obtain the data to construct new object from the response.

}
}

static void uploadFrom(InputStream input, WriteChannel writer) throws IOException {
Copy link
Contributor

@elharo elharo Mar 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method will be invoked from the StorageImpl class

@googlebot
Copy link

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added cla: no This human has *not* signed the Contributor License Agreement. and removed cla: yes This human has signed the Contributor License Agreement. labels Mar 12, 2020
@googlebot
Copy link

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added cla: yes This human has signed the Contributor License Agreement. and removed cla: no This human has *not* signed the Contributor License Agreement. labels Mar 13, 2020
@dmitry-fa dmitry-fa merged commit 1f53baa into googleapis:master Mar 15, 2020
frankyn added a commit that referenced this pull request Mar 17, 2020
gcf-merge-on-green bot pushed a commit that referenced this pull request Mar 17, 2020
Reverts #162

Need to revert this for now. The issue is that the [Blob constructor is private](https://github.com/googleapis/java-storage/blob/1f53baa969331a94b5a73319df59711157ef2307/google-cloud-storage/src/main/java/com/google/cloud/storage/Blob.java#L545) and it requires creating the object as an [empty object before uploading with an InputStream](https://github.com/googleapis/java-storage/pull/162/files#diff-8c18a40b13cc5c5d90547c273470f5eeR2965-R2966).

The implementation should be derived from the Storage class instead of Blob class to reduce the number of requests.

cc: @dmitry-fa, could you recreate the PR? I haven't prioritized this, apologies for the delay.
@dmitry-fa dmitry-fa deleted the uploadBlob branch June 26, 2020 07:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes This human has signed the Contributor License Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Blob.uploadFrom(InputStream)

3 participants