OK Let's talk about chars

bgrommes

Problem

The lack of a character type means that reading text data either has to happen as strings or as raw bytes.
When parsing delimited text files (e.g. csv) in C# I typically call an external open source lib that does a pretty good job of determining the encoding of the file. Then by default I read a line at a time and parse that into columns (not as simple as splitting on commas, due to quoted strings / embedded commas within column values). Here the ability to iterate the char[] underlying the string is helpful at times for performance reasons but in principle I can do what I need to do in Objo. However ...
Sometimes text values have embedded line endings in them (could either LF or CR/LF independent of OS). In this case reading line by line would have you picking up partial records / rows. Since you don't know how many columns consist of multiline text data, and each value could have many end-of-lines, there's no good way to work around this.
In that case you have to read the file a char at a time and run a finite state machine against it until you've assembled an entire record / row. It's slower, but necessary.
Problem is that in Objo you can't read a file a char at a time -- only a byte at a time (BinaryStream). Then you're faced with decoding whatever encoding scheme the file consists of on top of sorting out whatever gnarly problems the file has to begin with (don't get me started).

Proposed Solution

I'm not going to bother advocating at this point for a char type because even for you this is likely out of scope for an initial release, and I sense that you probably consider it of low enough value for an "approachable" language that you have left it out of at least an initial cut. However what I could live with is if BinaryStream had a ReadChar(Encoding) method (default = UTF8). This could return the code point as an integer. That is then convertible to a string if desired using Chr(). Checking code points would generally be faster for parsing; I can append code points I am not discarding to an Array(Of Integer) and convert the lot to a string later. (I shudder to think of the string allocations involved, but I'm trying to keep an open mind, lol).

You'd need a corresponding WriteChar(Integer,Encoding) method. It would be just as important; I just haven't covered the use case in this writeup. The point is sometimes you need to read / write at the character level. Right now unless I'm missing something, you can't, at least not without encoding / decoding raw bytes yourself, which would probably be really slow even if it weren't a total pain.

Who Would This Help?

Anyone who has to deal with characters rather than bytes.

jalih

How about adding getc, ungetc, getb and ungetb methods to BinaryStream? That would make parsing data character by character or byte by byte easy.

bgrommes

jalih Well after I wrote this I do see that I was also using Peek() to look-ahead at the next char without actually consuming it -- that is also needed at times and I think it's a better metaphor than "ungetting" although maybe you could tell me if there's a use case that requires that instead (where peek would not be a substitute). The .NET byte stream doesn't have an API for "ungetting" but it does provide seeking and positioning which can achieve the same thing. For things like, I guess, reading fixed-length records out of order, or skipping sections, or starting operations at some place other than the beginning of the file.

Since not all BinaryStreams are seekable, Seek() would also require a CanSeek property. Technically you shouldn't Peek(), much less Seek(), if CanSeek is false, because it's possible the next byte isn't yet in the buffer.

.NET also has MemoryStream and FileStream with additional functionality over Stream (including Peek and Seek), but it seems Objo is going for more of a Swiss army knife with BinaryStream covering all those use cases?

jalih

bgrommes Say you are writing a scanner for interpreter. I think using getc/ungetc is more natural than using peek/read. You also need to use ungetc only once when scanned token is complete.

Garry

Thanks both. I think this is a gap, but I’m leaning away from adding a Char type for now.

The awkward part is that “character” is not one clear thing. It could mean a UTF-16 code unit, a Unicode code point, a Unicode scalar value, or a user-perceived grapheme. I’d rather not bake that ambiguity into the language before v1.

Instead, I’m considering adding a text-oriented stream API that sits above byte-level BinaryStream.

Proposed shape:

Class TextStream
  Constructor(file As FileSystemItem)
  Constructor(file As FileSystemItem, encoding As TextEncoding)

  Property Encoding As TextEncoding (readonly)
  Property EndOfFile As Boolean (readonly)

  Function ReadCodePoint() As Integer
  Function PeekCodePoint() As Integer

  Function ReadCharacter() As String
  Function PeekCharacter() As String

  Sub WriteCodePoint(codePoint As Integer)
  Sub WriteCharacter(value As String)

  Function ReadLine() As String
  Function ReadAll() As String

  Sub Write(value As String)
  Sub WriteLine(value As String)

  Sub Close()
End Class

Default encoding would be TextEncoding.UTF8.

ReadCodePoint() would return the next Unicode code point as an Integer, matching the existing Chr(), Asc(), and String.FromCodePoints() APIs. ReadCharacter() would be a convenience wrapper returning Chr(ReadCodePoint()).

For parsers and scanners:

Var stream As New TextStream(file, TextEncoding.UTF8)

While Not stream.EndOfFile
  Var cp As Integer = stream.ReadCodePoint()

  Select Case cp
  Case Asc("""")
    inQuotes = Not inQuotes
  Case Asc(",")
    If Not inQuotes Then
      # End of field
    End If
  Case 10, 13
    If Not inQuotes Then
      # End of record
    End If
  End Select
End While

I’m proposing PeekCodePoint() rather than UngetCodePoint() initially because it covers common one-character lookahead without introducing a pushback buffer. If people feel strongly that scanner-style pushback is needed, maybe this should include one of:

Sub PushBackCodePoint(codePoint As Integer)

or:

Sub UnreadCodePoint()

Questions I’d like opinions on before implementing:

Is TextStream the right name, or would TextReader / TextWriter be clearer?
Should this be a new class, or should these methods live directly on BinaryStream?
Is PeekCodePoint() enough, or do scanner authors really want pushback?
Should ReadCodePoint() return -1 at EOF, or should users rely on EndOfFile and let reading past EOF throw an IOException like BinaryStream?
Are ReadCharacter() / WriteCharacter() useful, or is code point I/O enough?

My current preference is a separate TextStream class with code point methods, EOF checked via EndOfFile, and no Char type for the initial release.

bgrommes

Garry A separate stream class is fine.

What I use in .NET is StreamWriter / StreamReader which inherits from TextReader / TextWriter which wrap an underlying TextStream. Mostly I'm dealing with strings, and dipping into the char[] that the string wraps as needed (exposed via a char indexer).

To be clear -- my main concern is that if I'm trying to iterate character by character on a hot path I have two choices in Objo to represent that: strings or integers (code points). An Objo integer is 8 bytes. A string (in .NET anyway) is an object on the heap, not a value type (despite the value semantics of strings). Apart from the size of the backing char[] array there's an instance header and sync block and then a pointer for a minimum of 16 bytes of overhead for ANY object on a 64-bit system (not counting private fields, properties, etc) and then for a string it's basically 2 more bytes per char, then 4 bytes for the Length property. So it comes down, for a one character string, to an 8 byte code point in integer form or a 22 byte string instance. The latter is a lot more memory -- and probably more relevant in practice -- allocation pressure. (I'm assuming that either Objo objects involved are backed by .NET objects or when they are not, they have similar overhead).

For my use case it's easier and almost certainly way more performant to just treat incoming file characters as an integer array as I read it in and convert some or all of it to a string when I'm done parsing. This also opens up the possibility of having code that can do some string operations in code point form before going back to string form, which could be an opportunity for further optimizations (although one could get into the weeds pretty quickly there, lol).

Is this overdetermined for a lot of cases? Sure. But on a hot path where you're doing involved parsing and transformations, it's worth the extra effort.

To your specific questions:

1) StreamReader/StreamWriter would make sense to me.

2) I lean toward a separate class because it involves a level of abstraction above bytes.

3) IMO Peek is fine. I don't have an opinion about pushback other than that I've never had a need for it. I'd want to see Seek() on BinaryStream() if it's not already there, but aside from that I have not had to read out of sequence. I'd use Seek() for things like jumping past data I don't care about, random record reads, etc.

4) IMO one should be in a while not eof() loop and reading past eof() should throw an exception. Peek() would be an exception because it's a look-ahead. It should return -1 if there is no next character.

5) I think ReadCharacter() / WriteCharacter makes sense I guess for completeness and flexibility. Would I use them much? Probably not. But there's a symmetry to having them. It's hard to anticipate every possible need.

Garry

Thanks for the feedback on this. I’ve implemented the text-stream approach we discussed.

I’ve added a new TextStream class rather than adding character methods directly to BinaryStream, and I’ve avoided adding a separate Char type for now. The API works in Unicode code points, represented as Integer, which keeps it aligned with Asc(), Chr(), and String.FromCodePoints().

The added API is:

Class TextStream
  Constructor(file As FileSystemItem)
  Constructor(file As FileSystemItem, encoding As TextEncoding)

  Property Encoding As TextEncoding
  Property EndOfFile As Boolean

  Function ReadCodePoint() As Integer
  Function PeekCodePoint() As Integer

  Function ReadCharacter() As String
  Function PeekCharacter() As String

  Sub WriteCodePoint(codePoint As Integer)
  Sub WriteCharacter(value As String)

  Function ReadLine() As String
  Function ReadAll() As String

  Sub Write(value As String)
  Sub WriteLine(value As String)

  Sub Close()
End Class

Default encoding is UTF-8, but you can pass an explicit TextEncoding.

For parser/scanner-style code, the intended pattern is:

Var stream As TextStream = New TextStream(file, TextEncoding.UTF8)

While Not stream.EndOfFile
  Var cp As Integer = stream.ReadCodePoint()

  Select Case cp
  Case Asc(",")
    # End of field
  Case 10, 13
    # End of line, unless inside quoted text
  End Select
Wend

stream.Close()

PeekCodePoint() gives one-code-point lookahead and returns -1 at EOF. ReadCodePoint(), ReadCharacter(), and ReadLine() throw IOException if you read past EOF, so normal code should use EndOfFile.

I also added a bundled Studio example called TextStream Demo. It writes a small UTF-8 CSV file, reads it back with ReadAll() and ReadLine(), then parses it with ReadCodePoint() / PeekCodePoint().

Will be in the next release: https://feedback.objo.dev/feature/499