Python3 and strings

Hi everyone,

I am currently writing some python2/python3 libraries to work with ROS messages, and I am in need of some information.

How to treat the ‘string’ message field in python3 ??
There is no info about that in http://wiki.ros.org/msg , but in python3 we need to specify encoder/decoder whenever we change a string into a list of bytes and vice versa…

Any information about this I missed somewhere ? Thanks !

That page does mention:

unicode strings are currently not supported as a ROS data type. utf-8 should be used to be compatible with ROS string serialization. In python 2, this encoding is automatic for unicode objects, but decoding must be done manually. In python 3, both encoding and decoding are automatic.

It also says :

  • Primitive Type: string
  • Serialization: ascii string (4)
  • C++: std::string
  • Python: str

and

  • Primitive Type: uint8[]
  • Serialization: uint32 length prefix
  • C++: std::vector
  • Python: bytes

Also in python3 both encoding and decoding are automatic, based on the platform you are running on, provided you use the right type (bytes or str).

If two platforms use different encodings for two nodes communicating, then messages will probably arrived garbled, if we intend to send a string.

On the other hand, if we do not send a string with an encoding, then we are sending bytes, just like for a uint8[] field.

  • Is string the same as uint8[] ? (and Python type should be bytes)
  • OR should ROS enforce some unicode encoding for string ? (and Python type can be str)

In any instance it seems the wiki page should separately list Python2 and Python3 to avoid confusion…

From the generated python code for a msg, when serializing the message into a buffer to send it, ROS encodes the string field as a utf-8 string (x is a string field in the ROS msg):

_x = self.x
if python3 or type(_x) == unicode:
  _x = _x.encode('utf-8')

And similarly, when deserializing the received buffer, it is converted into a Python str with utf-8 encoding:

if python3:
  self.x = str[start:end].decode('utf-8')
else:
  self.x = str[start:end]

So on the user side, you just need to make sure that the encoding for the string you’re sending is utf-8.

With that in mind, that block from the msg wiki page seems sufficient to me:

unicode strings are currently not supported as a ROS data type. utf-8 should be used to be compatible with ROS string serialization. In python 2, this encoding is automatic for unicode objects, but decoding must be done manually. In python 3, both encoding and decoding are automatic.

Interestingly, from this code, I understand the exact opposite of

unicode strings are currently not supported

We are obviously using unicode codec UTF-8 to encode and decode it, and the matching python type is a unicode string. So looking at this code, I would say :
’ A string field in a ROS message is a unicode string, and will be encoded/decoded using UTF-8 for serialization/deserialization’

And in that case the wiki should state :

  • Primitive Type: string
  • Serialization: utf-8 string (4)
  • C++: std::string
  • Python3: str
  • Python2: unicode

On the other hand, if this is not true and the ROS serialization is only supporting ASCII, then the python matching type should be bytes, and the wiki should say :

  • Primitive Type: string
  • Serialization: ascii string (4)
  • C++: std::string
  • Python3: bytes
  • Python2: str

and the serialization code needs to be fixed ( no need to encode/decode, unicode is not supported ).

Yes you’re right, that statement doesn’t seem to be correct.

As per your recommendation, I think mentioning utf-8 string as the serialization type would be fine (though not sure if that is the right thing with the C++ client library), but it would be better to use/recommend the str type for Python2 since there is no automatic decoding into a unicode string for Python 2. So it would just be type str for both Python 2 and 3.

But str in Python3 is unicode in Python2, and having different ways to serialize data between different versions of python will break a few things in many places (“why my message is garbled on this node and not that one?”).
We could do that, but it would require a “big warning” everywhere we mention this topic…

=> I could not find any REP specification regarding the message serialization, and how to match the types of the supported languages and integrate deserialization with it. I seems it’s something we need to drive implementation (especially given ROS supports multiple languages) and prevent “incomplete features” as much as possible.

The current serialization code breaks :

  • when we pass a bytes in python 3 (no encode method) fix attempt here
  • when we pass a unicode in python2 (receiving end lose the encoding)
  • when we pass a str in python3 (receiving python2 end lose the encoding)

=> we need a solution (design fix) that integrates properly for all supported languages…

I think fully supporting unicode strings would require a lot of effort, more so on the C++ client libraries.
Sadly not a solution, but for now the recommendation of just sticking to ascii strings would prevent the issues mentioned in the the second and third bullets affecting user code.

1 Like

Agreed. That means that the advised/documented python3 type should be bytes
I went ahead and worked on an update on the wiki, to try to remove the confusion when talking about py2/py3.

Well, you can make both Python 2 & 3 string msg types be bytes with a note that bytes is the same as str in Python2.

$ python2
Python 2.7.13 (default, Jul 21 2017, 03:24:34) 
>>> a = str('123')
>>> type(a)
<type 'str'>
>>> a = bytes('123')
>>> type(a)
<type 'str'>