KEMBAR78
gh-110235: Raise `TypeError` for duplicate/unknown fields in PyStructSequence constructor by XuehaiPan · Pull Request #110258 · python/cpython · GitHub
Skip to content

Conversation

@XuehaiPan
Copy link
Contributor

@XuehaiPan XuehaiPan commented Oct 3, 2023

Fixes #110235

In this PR, we iterate over dict keys and check if there are duplicate fields provided by sequence or there are any unexpected field names.

@XuehaiPan XuehaiPan marked this pull request as ready for review October 3, 2023 06:34
Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in the constructor is too complicated, and it slow down normal case, because it tests fields already passed in the sequence argument.

Simply count the number of fields found in the dict argument. If it less than the size of the dict, then raise a TypeError.

I think that it is not needed to make the error message more specific, test whether some of keys match the fields already specified in sequence argument or are not field names at all -- it only adds a lot of complicated code for a little gain. If users came with requests for more details, then we may provide them, but currently it is a waste of efforts.

@XuehaiPan
Copy link
Contributor Author

I updated the implementation to only count the number of unexpected fields rather than list them in the error message.

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still too complicated. No need to check for multiple values for field.

@XuehaiPan
Copy link
Contributor Author

XuehaiPan commented Oct 4, 2023

It is still too complicated. No need to check for multiple values for field.

The signature for structseq.__new__() is:

SomeStructSeq(sequence: Iterable[Any], dict: Dict[str, Any]) -> SomeStructSeq

In this PR, I raise an error for values that come from both sequence and dict. If we don't do this, in the result tuple, the value from dict will override the value from sequence.

IMO, it should raise an error as we did for namedtuple.__new__(*args, **kwargs). If a value is provided in both args and kwargs, it will raise a TypeError.

mport collections
MyTuple = collections.namedtuple('MyTuple', ['a', 'b', 'c', 'd'])

MyTuple(1, 2, 3, 4, a=5, b=6)  # TypeError: MyTuple.__new__() got multiple values for argument 'a'

But namedtuple and structseq are slightly different. In structseq, the arguments sequence and dict do not have * and ** unpacking. Any thoughts?

def func(a, **kwargs):
    pass

func(1, a=2)  # TypeError: func() got multiple values for argument 'a'

@serhiy-storchaka
Copy link
Member

I meant that it will automatically raise TypeError if dict contains extra keys. No need to check it specially, and no need to generate more specific error message. Simply:

for i in range(len, max_len):
    name = fields[i].name
    if name in dict:
        items[i] = dict[name]
        count++
    else:
        items[i] = None
    i++
if count < len(dict):
    raise TypeError("simple generic error message")

In C it will be just few additional lines of code.

@XuehaiPan
Copy link
Contributor Author

XuehaiPan commented Oct 4, 2023

I meant that it will automatically raise TypeError if dict contains extra keys.

Thanks for the code snippet. My point is #110258 (comment):

The signature for structseq.__new__() is:

SomeStructSeq(sequence: Iterable[Any], dict: Dict[str, Any]) -> SomeStructSeq

In this PR, I raise an error for values that come from both sequence and dict. If we don't do this, in the result tuple, the value from dict will override the value from sequence.

For example:

time.struct_time
fields: [
    "tm_year",    #  0, visiable
    "tm_mon",     #  1, visiable
    "tm_mday",    #  2, visiable
    "tm_hour",    #  3, visiable
    "tm_min",     #  4, visiable
    "tm_sec",     #  5, visiable
    "tm_wday",    #  6, visiable
    "tm_yday",    #  7, visiable
    "tm_isdst",   #  8, visiable
    "tm_zone",    #  9, invisible
    "tm_gmtoff",  # 10, invisible
]
n_sequence_fields: 9
n_unnamed_fields: 0
n_fields: 11

In the code snippet (#110258 (comment)), it will raise an error if there is a visible field name or an unknown name in dict (already part of this PR). But it will silently override the value for invisible fields when the value came from both sequence and dict.

time.struct_time(seqeunce=list(range(10)), dict={'tm_zone': -1})
# tm_zone is firstly set from sequence[9], then it will be override by dict['tm_zone'] without raising an error

The implementation of this PR is roughly equivalent to:

  len = sequence.__len__()
  min_len = n_sequence_fields
  max_len = n_fields
  
  for i, value in enumerate(sequence):
      items[i] = value  # for i >= min_len, it will set invisible fields
  
  count = 0
- for i in range(len, max_len):
+ for i in range(min_len, max_len):
      name = fields[i].name
+     if i < len:
+         if name in dict:
+             raise TypeError("duplicate invisible field values")
+         continue
      if name in dict:
          items[i] = dict[name]
          count += 1
      else:
          items[i] = None
  if count < dict.__len__():
      raise TypeError("simple generic error message")

@serhiy-storchaka
Copy link
Member

The following code:

diff --git a/Objects/structseq.c b/Objects/structseq.c
index 0ca622edc2..08af71a88d 100644
--- a/Objects/structseq.c
+++ b/Objects/structseq.c
@@ -216,6 +216,7 @@ structseq_new_impl(PyTypeObject *type, PyObject *arg, PyObject *dict)
         res->ob_item[i] = Py_NewRef(v);
     }
     Py_DECREF(arg);
+    Py_ssize_t count = 0;
     for (; i < max_len; ++i) {
         PyObject *ob = NULL;
         if (dict != NULL) {
@@ -228,8 +229,18 @@ structseq_new_impl(PyTypeObject *type, PyObject *arg, PyObject *dict)
         if (ob == NULL) {
             ob = Py_NewRef(Py_None);
         }
+        else {
+            count++;
+        }
         res->ob_item[i] = ob;
     }
+    if (dict != NULL && PyDict_GET_SIZE(dict) > count) {
+        PyErr_Format(PyExc_TypeError,
+                     "%.500s() got unexpected field name(s).",
+                     type->tp_name);
+        Py_DECREF(res);
+        return NULL;
+    }
 
     _PyObject_GC_TRACK(res);
     return (PyObject*) res;

passes your tests (ignoring error message). It raises error for your example

>>> time.struct_time(sequence=list(range(10)), dict={'tm_zone': -1})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: time.struct_time() got unexpected field name(s).

Look, your loop can be split on two, range(min_len, len) and range(len, max_len). In the latter loop i < len is always false, so this can be removed. In the former loop it is always true, and the code in this "if" block is the only one that executed in the loop. It raises a TypeError if name is in dict. But if any name is found in dict in the former loop, it cannot be counted in the later loop and you get a TypeError in any case.

@XuehaiPan
Copy link
Contributor Author

passes your tests (ignoring error message). It raises error for your example

I check this and it works. I have updated my implementation to raise a single generic error.

PyObject *ob = NULL;
if (dict != NULL) {
const char *name = type->tp_members[i-n_unnamed_fields].name;
if (dict != NULL && PyDict_GET_SIZE(dict) > 0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add this guard check before the for-loop to avoid unnecessary dictionary lookup for empty dict.

@XuehaiPan
Copy link
Contributor Author

Thanks for the code review and suggestions!

@serhiy-storchaka serhiy-storchaka merged commit 9561648 into python:main Oct 4, 2023
@XuehaiPan XuehaiPan deleted the structseq-new branch October 4, 2023 20:43
Glyphack pushed a commit to Glyphack/cpython that referenced this pull request Sep 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PyStructSequence constructor ignores unknown field names

2 participants