Major performance problem with std.array.front()

Discussion:

Major performance problem with std.array.front()

Walter Bright

2014-03-07 02:37:13 UTC

In "Lots of low hanging fruit in Phobos" the issue came up about the automatic
encoding and decoding of char ranges.

Throughout D's history, there are regular and repeated proposals to redesign D's
view of char[] to pretend it is not UTF-8, but UTF-32. I.e. so D will
automatically generate code to decode and encode on every attempt to index char[].

I have strongly objected to these proposals on the grounds that:

1. It is a MAJOR performance problem to do this.

2. Very, very few manipulations of strings ever actually need decoded values.

3. D is a systems/native programming language, and systems/native programming
languages must not hide the underlying representation (I make similar arguments
about proposals to make ints issue errors on overflow, etc.).

4. Users should choose when decode/encode happens, not the language.

and I have been successful at heading these off. But one slipped by me. See this
in std.array:

@property dchar front(T)(T[] a) @safe pure if (isNarrowString!(T[]))
{
assert(a.length, "Attempting to fetch the front of an empty array of " ~
T.stringof);
size_t i = 0;
return decode(a, i);
}

What that means is that if I implement an algorithm that accepts, as input, an
InputRange of char's, it will ALWAYS try to decode it. This means that even:

from.copy(to)

will decode 'from', and then re-encode it for 'to'. And it will do it SILENTLY.
The user won't notice, and he'll just assume that D performance sux. Even if he
does notice, his options to make his code run faster are poor.

If the user wants decoding, it should be explicit, as in:

from.decode.copy(encode!to)

The USER should decide where and when the decoding goes. 'decode' should be just
another algorithm.

(Yes, I know that std.algorithm.copy() has some specializations to take care of
this. But these specializations would have to be written for EVERY algorithm,
which is thoroughly unreasonable. Furthermore, copy()'s specializations only
apply if BOTH source and destination are arrays. If just one is, the
decode/encode penalty applies.)

Is there any hope of fixing this?

bearophile

2014-03-07 02:54:50 UTC

systems/native programming languages must not hide the
underlying representation (I make similar arguments about
proposals to make ints issue errors on overflow, etc.).

But it's good to have in Phobos a compiler-intrinsics-based
efficient overflow detection on a user-defined struct type that
behaves like built-in ints in all other aspects.

Is there any hope of fixing this?

I don't think we can change that in D2. You can change it in D3.

Bye,
bearophile

Walter Bright

2014-03-07 02:57:40 UTC

systems/native programming languages must not hide the underlying
representation (I make similar arguments about proposals to make ints issue
errors on overflow, etc.).

But it's good to have in Phobos a compiler-intrinsics-based efficient overflow
detection on a user-defined struct type that behaves like built-in ints in all
other aspects.

Yes, so that the user selects it, rather than having it wired in everywhere and
the user has to figure out how to defeat it.

bearophile

2014-03-07 03:26:51 UTC

Post by Walter Bright

Post by bearophile
But it's good to have in Phobos a compiler-intrinsics-based
efficient overflow
detection on a user-defined struct type that behaves like
built-in ints in all
other aspects.

Yes, so that the user selects it, rather than having it wired
in everywhere and the user has to figure out how to defeat it.

I don't think people have ever suggested that.

In a recent discussion you seemed against the idea of a special
compiler support for that user defined type.

Bye,
bearophile

Adam D. Ruppe

2014-03-07 04:01:14 UTC

Post by Walter Bright
Yes, so that the user selects it, rather than having it wired
in everywhere and the user has to figure out how to defeat it.

BTW you know what would help this? A pragma we can attach to a
struct which makes it a very thin value type.

pragma(thin_struct)
struct A {
int a;
int foo() { return a; }
static A get() { A(10); }
}

void test() {
A a = A.get();
printf("%d", a.foo());
}

With the pragma, A would be completely indistinguishable from int
in all ways.

What do I mean?
$ dmd -release -O -inline test56 -c

Let's look at A.foo:

A.foo:
0: 55 push ebp
1: 8b ec mov ebp,esp
3: 50 push eax
4: 8b 00 mov eax,DWORD PTR [eax] ;
waste!
6: 8b e5 mov esp,ebp
8: 5d pop ebp
9: c3 ret

It is line four that bugs me: the struct is passed as a
*pointer*, but its only contents are an int, which could just as
well be passed as a value. Let's compare it to an identical
function in operation:

int identity(int a) { return a; }

00000000 <_D6test568identityFiZi>:
0: 55 push ebp
1: 8b ec mov ebp,esp
3: 83 ec 04 sub esp,0x4
6: c9 leave
7: c3 ret

lol it *still* wastes time, setting up a stack frame for nothing.
But we could just as well write asm { naked; ret; } and it would
work as expected: the argument is passed in EAX and the return
value is expected in EAX. The function doesn't actually have to
do anything.

Anywho, the struct could work the same way. Now, I understand
that we can't just change this unilaterally since it would break
interaction with the C ABI, but we could opt in to some thinner
stuff with a pragma.

Ideally, the thin struct would generate this code:

void A.get() {
naked { // no need for stack frame here
mov EAX, 10;
ret;
}
}

return A(10); when A is thin should be equal to return 10;. No
need for NRVO, the object is super thin.

void A.foo() {
naked { // no locals, no stack frame
ret; // the last argument (this) is passed in EAX
// and the return value goes in EAX
// so we don't have to do anything
}
}

Without the thin_struct thing, this would minimally look like

mov EAX, [EAX];
ret;

Having to load the value from the this pointer. But since it is
thin, it is generated identically to an int, like the identity
function above, so the value is already in the register!

Then, test:

void test() {
naked { // don't need a stack frame here either!
call A.get;
// a is now in EAX, the value loaded right up
call A.foo; // the this is an int and already
// where it needs to be, so just go
// and finally, go ahead and call printf
push EAX;
push "%d".ptr;
call printf;
ret;
}
}

Then, naturally, inlining A.get and A.foo might be possible
(though I'd love to write them in assembly myself* and the
compiler prolly can't inline them) but call/ret is fairly cheap,
especially when compared to push/pop, so just keeping all the
relevant stuff right in registers with no need to reference can
really help us.

pragma(thin_struct)
struct RangedInt {
int a;
RangedInt opBinary(string op : "+")(int rhs) {
asm {
naked;
add EAX, [rhs]; // or RDI on 64 bit! Don't even need to
touch the stack! **
jo throw_exception;
ret;
}
}
}

Might still not be as perfect as intrinsics like bearophile is
thinking of... but we'd be getting pretty close. And this kind of
thing would be good for other thin wrappers too, we could
magically make smart pointers too! (This can't be done now since
returning a struct is done via hidden pointer argument instead of
by register like a naked pointer).

** i'd kinda love it if we had an all-register calling convention
on 32 bit too.... but eh oh well

Walter Bright

2014-03-07 04:19:18 UTC

BTW you know what would help this? A pragma we can attach to a struct which
makes it a very thin value type.

I'd rather fix the compiler's codegen than add a pragma.

bearophile

2014-03-07 04:22:29 UTC

Post by Walter Bright
I'd rather fix the compiler's codegen than add a pragma.

But a standard common intrinsic to detect the overflow
efficiently could be useful.

Bye,
bearophile

H. S. Teoh

2014-03-07 06:12:38 UTC

Post by Walter Bright

BTW you know what would help this? A pragma we can attach to a struct
which makes it a very thin value type.

I'd rather fix the compiler's codegen than add a pragma.

[...]

Post by Walter Bright
From what I understand, structs are *supposed* to be thin value types. I

would say that if a struct is under a certain size (determined by the
compiler), and doesn't have complicated semantics like dtors and stuff
like that, then it should be treated like a POD (passed in registers,
etc).

T

--
Ruby is essentially Perl minus Wall.

Walter Bright

2014-03-07 06:18:14 UTC

Post by Walter Bright
From what I understand, structs are *supposed* to be thin value types. I
would say that if a struct is under a certain size (determined by the
compiler), and doesn't have complicated semantics like dtors and stuff
like that, then it should be treated like a POD (passed in registers,
etc).

Yes, that's right.

Adam D. Ruppe

2014-03-07 13:56:46 UTC

Post by Walter Bright
I'd rather fix the compiler's codegen than add a pragma.

The codegen isn't broken, the current this pointer behavior is
needed for full compatibility with the C ABI. It would be opt in
to an ABI tweak that the caller needs to be aware of rather than
an traditional optimization where the outside world would never
know.

Dicebot

2014-03-07 14:04:51 UTC

Post by Adam D. Ruppe

Post by Walter Bright
I'd rather fix the compiler's codegen than add a pragma.

The codegen isn't broken, the current this pointer behavior is
needed for full compatibility with the C ABI. It would be opt
in to an ABI tweak that the caller needs to be aware of rather
than an traditional optimization where the outside world would
never know.

We don't need C ABI compatibility for stuff that is not
extern(C), do we?

Adam D. Ruppe

2014-03-07 14:17:17 UTC

Post by Dicebot
We don't need C ABI compatibility for stuff that is not
extern(C), do we?

That's a good point, though personally I'd still like some way to
magic it up, even in extern(C).

Consider the example of library typedef. If C did:

typedef void* HANDLE;

and D did

struct HANDLE { void* foo; alias foo this; }

it is almost the same, but then when you declare

HANDLE OpenFile(...);

it won't work since the compiler will pass a hidden struct
pointer (which is exactly what C woudl expect if it was a typedef
struct { void* } on its side too) instead of expecting the value
in the accumulator as it would with the void*.

Walter Bright

2014-03-07 19:16:21 UTC

Post by Walter Bright
I'd rather fix the compiler's codegen than add a pragma.

The codegen isn't broken, the current this pointer behavior is needed for full
compatibility with the C ABI. It would be opt in to an ABI tweak that the caller
needs to be aware of rather than an traditional optimization where the outside
world would never know.

Oh, I see what you mean. But I think it does generate the same code, if you use
it the same way. There is no 'get' function for ints; you aren't using it the
same way.

Kagamin

2014-03-07 10:44:45 UTC

Post by Adam D. Ruppe
BTW you know what would help this? A pragma we can attach to a
struct which makes it a very thin value type.
pragma(thin_struct)
struct A {
int a;
int foo() { return a; }
static A get() { A(10); }
}
void test() {
A a = A.get();
printf("%d", a.foo());
}
With the pragma, A would be completely indistinguishable from
int in all ways.
What do I mean?
$ dmd -release -O -inline test56 -c
0: 55 push ebp
1: 8b ec mov ebp,esp
3: 50 push eax
4: 8b 00 mov eax,DWORD PTR [eax] ;
waste!
6: 8b e5 mov esp,ebp
8: 5d pop ebp
9: c3 ret
It is line four that bugs me: the struct is passed as a
*pointer*, but its only contents are an int, which could just
as well be passed as a value. Let's compare it to an identical
int identity(int a) { return a; }
0: 55 push ebp
1: 8b ec mov ebp,esp
3: 83 ec 04 sub esp,0x4
6: c9 leave
7: c3 ret
lol it *still* wastes time, setting up a stack frame for
nothing. But we could just as well write asm { naked; ret; }
and it would work as expected: the argument is passed in EAX
and the return value is expected in EAX. The function doesn't
actually have to do anything.

struct A {
int a;
//int foo() { return a; }
static A get() { A(10); }
}

int foo(A a) { return a.a; }

printf("%d", a.foo());

Now it's passed by value.

Though, I needed checked arithmetic only twice: for cast from
long to int and for cast from double to long. If you expect your
number type to overflow, you probably chose wrong type.

Adam D. Ruppe

2014-03-07 14:13:53 UTC

Post by Kagamin
Now it's passed by value.

That won't work for operator overloading though (which is the
really interesting case here).

Post by Kagamin
Though, I needed checked arithmetic only twice: for cast from
long to int and for cast from double to long. If you expect
your number type to overflow, you probably chose wrong type.

I very rarely need it too, but it is nice to have in a convenient
package that is fairly efficient at the same time.

Kagamin

2014-03-07 14:44:41 UTC

Post by Adam D. Ruppe

Post by Kagamin
Now it's passed by value.

That won't work for operator overloading though (which is the
really interesting case here).

Alternatively for small methods you can rely on inlining, which
dereferences the argument. If the method is big, the reference is
probably unimportant.

Adam D. Ruppe

2014-03-07 15:24:47 UTC

Post by Kagamin
Alternatively for small methods you can rely on inlining, which
dereferences the argument.

Yeah, that's usually the way to go, inlining can also avoid
pushing other arguments to the stack on 32 bit which is a big win
too. But you can't inline asm function, and checking the overflow
flag needs asm. (or a compiler intrinsic.)

For the library typedef case too, this means wrapping any
function that returns a struct too which is annoying if nothing
else.

Walter Bright

2014-03-07 19:19:16 UTC

Post by Adam D. Ruppe
But you can't inline asm function,

I intend to fix that for dmd, but haven't had the time.

Post by Adam D. Ruppe
and checking the overflow flag needs asm. (or a compiler intrinsic.)

For that, I was thinking of having the compiler recognize one of the common
coding patterns for detecting overflow, and then generating efficient overflow
checks. Then documenting the pattern as being specially detected.

This means the code will still be successful for compilers that don't detect the
pattern, and no language changes would be required.

Walter Bright

2014-03-07 02:59:36 UTC

Post by bearophile

Post by Walter Bright
Is there any hope of fixing this?

I don't think we can change that in D2. You can change it in D3.

You use ranges a lot. Would it break any of your code?

bearophile

2014-03-07 03:22:12 UTC

Post by Walter Bright
You use ranges a lot. Would it break any of your code?

I need to try the changes to be sure. But the magnitude of this
change is so large that I guess some code will surely break.

One advantage of your change is that this code will work:

auto s = "hello".dup;
s.sort();

Bye,
bearophile

Walter Bright

2014-03-07 03:55:53 UTC

Post by bearophile
auto s = "hello".dup;
s.sort();

Yes, I hadn't thought of that.

The auto-decoding front() introduces all kinds of asymmetry in how ranges work,
and asymmetry is bad as it negatively impacts composability.

Andrei Alexandrescu

2014-03-07 19:59:54 UTC

Post by Walter Bright

Post by bearophile
auto s = "hello".dup;
s.sort();

Yes, I hadn't thought of that.
The auto-decoding front() introduces all kinds of asymmetry in how
ranges work, and asymmetry is bad as it negatively impacts composability.

There's no asymmetry, and decoding helps composability as I demonstrated.

Andrei

Walter Bright

2014-03-08 00:18:18 UTC

Post by Andrei Alexandrescu

Post by Walter Bright

Post by bearophile
auto s = "hello".dup;
s.sort();

Yes, I hadn't thought of that.
The auto-decoding front() introduces all kinds of asymmetry in how
ranges work, and asymmetry is bad as it negatively impacts composability.

There's no asymmetry, and decoding helps composability as I demonstrated.

Here's one asymmetry:
-----------------------------
alias int T; // compiles
//alias char T; // fails to compile

struct Input(T) { T front(); bool empty(); void popFront(); }
struct Output(T) { void put(T); }

import std.array;

void copy(F,T)(F f, T t) {
while (!f.empty) {
t.put(f.front);
f.popFront();
}
}

void main() {
immutable(T)[] from;
Output!T to;
from.copy(to);
}
-------------------------------

Dmitry Olshansky

2014-03-07 09:56:28 UTC

Post by Walter Bright
You use ranges a lot. Would it break any of your code?

I need to try the changes to be sure. But the magnitude of this change
is so large that I guess some code will surely break.
auto s = "hello".dup;
s.sort();

Which it shouldn't unless there is an ascii type or some such.

--
Dmitry Olshansky

Andrei Alexandrescu

2014-03-07 19:03:04 UTC

Post by Dmitry Olshansky

Post by Walter Bright
You use ranges a lot. Would it break any of your code?

I need to try the changes to be sure. But the magnitude of this change
is so large that I guess some code will surely break.
auto s = "hello".dup;
s.sort();

Which it shouldn't unless there is an ascii type or some such.

Correct. This is a win, not a failure, of the current approach. To sort
the bytes in "hello" write:

s.representation.sort();

which is indicative to the human and technically correct.

Andrei

H. S. Teoh

2014-03-07 03:31:17 UTC

Post by Walter Bright

Post by bearophile

Post by Walter Bright
Is there any hope of fixing this?

I don't think we can change that in D2. You can change it in D3.

You use ranges a lot. Would it break any of your code?

Whoa. You're not serious about changing this now, are you? Because even
though I would support such a change, you have to realize the magnitude
of code breakage that will happen. A lot of code that iterates over
narrow strings will break, and worse yet, they will break *silently*.
Calling count() on a narrow string will not return the expected value,
for example. And existing code that iterates over narrow strings
expecting dchars to come out of it will suddenly silently convert to
char, and may pass by unnoticed until somebody runs the program with a
multibyte character in the input.

This is very high risk change IMO.

You're welcome to create a (temporary) Phobos fork that reverts narrow
string auto-decoding, of course, and people can try it out to see how
much actual breakage is happening. If you really want to push for this,
that might be the safest way to test the waters before committing to
such a major change. Silent breakage is not easy to test for,
unfortunately. :(

T

--
Truth, Sir, is a cow which will give [skeptics] no more milk, and so
they are gone to milk the bull. -- Sam. Johnson

Walter Bright

2014-03-07 03:57:49 UTC

Post by H. S. Teoh
Whoa. You're not serious about changing this now, are you? Because even
though I would support such a change, you have to realize the magnitude
of code breakage that will happen. A lot of code that iterates over
narrow strings will break, and worse yet, they will break *silently*.
Calling count() on a narrow string will not return the expected value,
for example. And existing code that iterates over narrow strings
expecting dchars to come out of it will suddenly silently convert to
char, and may pass by unnoticed until somebody runs the program with a
multibyte character in the input.

I understand this all too well. (Note that we currently have a different silent
problem: unnoticed large performance problems.)

Post by H. S. Teoh
This is very high risk change IMO.
You're welcome to create a (temporary) Phobos fork that reverts narrow
string auto-decoding, of course, and people can try it out to see how
much actual breakage is happening. If you really want to push for this,
that might be the safest way to test the waters before committing to
such a major change. Silent breakage is not easy to test for,
unfortunately. :(

I posted a plan in another message in this thread. It'll be a long process, but
I think it's doable.

bearophile

2014-03-07 03:59:55 UTC

Post by Walter Bright
I understand this all too well. (Note that we currently have a
different silent problem: unnoticed large performance problems.)

On the other hand your change could introduce Unicode-related
bugs in future code (that the current Phobos avoids) (and here I
am not talking about code breakage).

Bye,
bearophile

Walter Bright

2014-03-07 04:17:34 UTC

Post by Walter Bright
I understand this all too well. (Note that we currently have a different
silent problem: unnoticed large performance problems.)

On the other hand your change could introduce Unicode-related bugs in future
code (that the current Phobos avoids) (and here I am not talking about code
breakage).

This comes up repeatedly as justification for D trying to hide the UTF-8 nature
of strings that I discussed upthread.

To my mind it's like trying to pretend that floating point doesn't have roundoff
issues, integers have infinite range, memory is infinite, etc. That has a place
in other languages, but not in a systems/native language.

Shammah Chancellor

2014-03-07 12:09:27 UTC

Post by Walter Bright

Post by Walter Bright
I understand this all too well. (Note that we currently have a different
silent problem: unnoticed large performance problems.)

On the other hand your change could introduce Unicode-related bugs in future
code (that the current Phobos avoids) (and here I am not talking about code
breakage).

This comes up repeatedly as justification for D trying to hide the
UTF-8 nature of strings that I discussed upthread.
To my mind it's like trying to pretend that floating point doesn't have
roundoff issues, integers have infinite range, memory is infinite, etc.
That has a place in other languages, but not in a systems/native
language.

Is it possible to add a warning notice when .front() is used on char?
I would say fix it now, add a warning, and then remove the warning
later.

-S.

Michel Fortin

2014-03-07 13:40:31 UTC

Post by Walter Bright
I understand this all too well. (Note that we currently have a
different silent problem: unnoticed large performance problems.)

On the other hand your change could introduce Unicode-related bugs in
future code (that the current Phobos avoids) (and here I am not talking
about code breakage).

The way Phobos works isn't any more correct than dealing with code
units. Many graphemes span on multiple code points -- because of
combined diacritics or character variant modifiers -- and decoding at
the code-point level is thus often insufficient for correctness.

The problem with Unicode strings is that the representation you must
work with depends on the things you want to do. If you want to count
the characters then you need graphemes; if you want to parse XML then
you'll need to work with code points (in theory, in practice you might
still want direct access to code units for performance reasons); and if
you want to slice or copy a string then you need to deal with code
units. Because of this multiple-representation-for-different-purpose
thing, generic algorithms for arrays don't map very well to string.

From my experience, I'd suggest these basic operations for a "string

range" instead of the regular range interface:

.empty
.frontCodeUnit
.frontCodePoint
.frontGrapheme
.popFrontCodeUnit
.popFrontCodePoint
.popFrontGrapheme
.codeUnitLength (aka length)
.codePointLength (for dchar[] only)
.codePointLengthLinear
.graphemeLengthLinear

Someone should be able to mix all the three 'front' and 'pop' function
variants above in any code dealing with a string type. In my XML parser
for instance I regularly use frontCodeUnit to avoid the decoding
penalty when matching the next character with an ASCII one such as '<'
or '&'. An API like the one above forces you to be aware of the level
you're working on, making bugs and inefficiencies stand out (as long as
you're familiar with each representation).

If someone wants to use a generic array/range algorithm with a string,
my opinion is that he should have to wrap it in a range type that maps
front and popFront to one of the above variant. Having to do that
should make it obvious that there's an inefficiency there, as you're
using an algorithm that wasn't tailored to work with strings and that
more decoding than strictly necessary is being done.

--
Michel Fortin
michel.fortin at michelf.ca
http://michelf.ca

Vladimir Panteleev

2014-03-07 13:51:48 UTC

if you want to parse XML then you'll need to work with code
points

Why is this?

Kagamin

2014-03-07 14:47:26 UTC

if you want to parse XML then you'll need to work with code
points (in theory, in practice you might still want direct
access to code units for performance reasons)

AFAIK, xml control characters are all ascii, and what's between
them you can slice or dup without consideration, so code units
should be more than enough.

Michel Fortin

2014-03-07 19:13:13 UTC

if you want to parse XML then you'll need to work with code points (in
theory, in practice you might still want direct access to code units
for performance reasons)

AFAIK, xml control characters are all ascii, and what's between them
you can slice or dup without consideration, so code units should be
more than enough.

If you don't fully check for well-formness (as XML parsers ought to do
according to the XML spec) then sure you can limit yourself to ASCII.
You'll let through illegal characters in element and attribute names
though.

--
Michel Fortin
michel.fortin at michelf.ca
http://michelf.ca

Peter Alexander

2014-03-07 12:03:41 UTC

Post by H. S. Teoh

Post by Walter Bright

Post by bearophile

Post by Walter Bright
Is there any hope of fixing this?

I don't think we can change that in D2. You can change it in
D3.

You use ranges a lot. Would it break any of your code?

This is very high risk change IMO.

+1

This will be the most disruptive change in D's history...

Vladimir Panteleev

2014-03-07 12:32:19 UTC

Post by H. S. Teoh
Calling count() on a narrow string will not return the expected
value, for example.

I would argue that, unless it's been made clear that the program
is expected to work only for certain languages, code that relied
on this was wrong in the first place.

Walter Bright

2014-03-07 03:06:43 UTC

Post by Walter Bright
Is there any hope of fixing this?

Is there any way we can provide an upgrade path for this? Silent breakage is
terrible. Any ideas?

Walter Bright

2014-03-07 03:52:44 UTC

Post by Walter Bright

Post by Walter Bright
Is there any hope of fixing this?

Is there any way we can provide an upgrade path for this? Silent breakage is
terrible. Any ideas?

Ok, I have a plan. Each step will be separated by at least one version:

1. implement decode() as an algorithm for string types, so one can write:

string s;
s.decode.algorithm...

suggest that people start doing that instead of:

s.algorithm...

2. Emit warning when people use std.array.front(s) with strings.

3. Deprecate std.array.front for strings.

4. Error for std.array.front for strings.

5. Implement new std.array.front for strings that doesn't decode.

Dmitry Olshansky

2014-03-07 10:11:27 UTC

Post by Walter Bright

Post by Walter Bright

Post by Walter Bright
Is there any hope of fixing this?

Is there any way we can provide an upgrade path for this? Silent breakage is
terrible. Any ideas?

string s;
s.decode.algorithm...
s.algorithm...

This would also be a great fit in cases where 'decode' is decoding some
other encoding.

Post by Walter Bright
2. Emit warning when people use std.array.front(s) with strings.
3. Deprecate std.array.front for strings.
4. Error for std.array.front for strings.

This sounds fine to me. I would even prefer to only offer explicit wrappers:
.raw - ubyte/ushort for UTF-8/UTF-16 etc.
.decode - dchars
as Nick suggests.

Then there is also the horrible ElementEncodingType vs ElementType.
I would love to see ElementEncodingType die.

Post by Walter Bright
5. Implement new std.array.front for strings that doesn't decode.

It would make it simple to think that strings are arrays of characters.
This illusion was broken (and good thing it was), no point in
reestablishing it to save a couple of keystrokes for those "who really
know what they are doing".

--
Dmitry Olshansky

Walter Bright

2014-03-07 10:33:23 UTC

Post by Dmitry Olshansky
Then there is also the horrible ElementEncodingType vs ElementType.
I would love to see ElementEncodingType die.

I agree. ElementEncodingType is a giant red flag saying we screwed things up.

Vladimir Panteleev

2014-03-07 17:24:59 UTC

Post by Walter Bright
Ok, I have a plan. Each step will be separated by at least one
1. implement decode() as an algorithm for string types, so one
string s;
s.decode.algorithm...
s.algorithm...

I think .decode should be something more explicit (byCodePoint
OSLT), just so it's clear that it's not magical and does not
solve all problems.

Post by Walter Bright
2. Emit warning when people use std.array.front(s) with strings.
3. Deprecate std.array.front for strings.
4. Error for std.array.front for strings.
5. Implement new std.array.front for strings that doesn't
decode.

Until then, how will people use strings with algorithms when they
mean to use them per-byte? A .raw property which casts to ubyte[]?

H. S. Teoh

2014-03-07 17:34:05 UTC

Post by Vladimir Panteleev

Post by Walter Bright
string s;
s.decode.algorithm...
s.algorithm...

I think .decode should be something more explicit (byCodePoint
OSLT), just so it's clear that it's not magical and does not solve
all problems.

+1. I think "byCodePoint" is far more self-documenting and less
misleading than "decode".

string s;
s.byCodePoint.algorithm...

I'm already starting to like it.

T

--
It always amuses me that Windows has a Safe Mode during bootup. Does
that mean that Windows is normally unsafe?

Andrei Alexandrescu

2014-03-07 19:11:49 UTC

Post by Walter Bright
5. Implement new std.array.front for strings that doesn't decode.

Until then, how will people use strings with algorithms when they mean
to use them per-byte? A .raw property which casts to ubyte[]?

There's no "until then".

A current ".representation" property already exists that casts all
string types appropriately.

Andrei

Dmitry Olshansky

2014-03-07 19:28:09 UTC

Post by Andrei Alexandrescu

Post by Walter Bright
5. Implement new std.array.front for strings that doesn't decode.

Until then, how will people use strings with algorithms when they mean
to use them per-byte? A .raw property which casts to ubyte[]?

There's no "until then".
A current ".representation" property already exists that casts all
string types appropriately.

There is however a big glaring failure: std.algorithm specialized for
char[], wchar[] but not for any RandomAccessRange!char or
RandomAccessRange!wchar.

So if I for instance get a custom slice type (e.g. a ring buffer), then
I'm out of luck w/o both "auto-magic dchar range" and special code in
std.algo that works with chars as code units.

If there is a way to exploit the duality of RA range of code units being
"is a" BD range of code points we certainly have failed with making it
work (first of all doing horrible job at generic-ness as mentioned).

--
Dmitry Olshansky

Andrei Alexandrescu

2014-03-07 21:53:09 UTC

Post by Dmitry Olshansky

Post by Andrei Alexandrescu

Post by Walter Bright
5. Implement new std.array.front for strings that doesn't decode.

Until then, how will people use strings with algorithms when they mean
to use them per-byte? A .raw property which casts to ubyte[]?

There's no "until then".
A current ".representation" property already exists that casts all
string types appropriately.

There is however a big glaring failure: std.algorithm specialized for
char[], wchar[] but not for any RandomAccessRange!char or
RandomAccessRange!wchar.

I agree that's an issue. Back in the day when this was a choice I
decided to consider only char[] and friends "UTF strings". There was
room for more generality but I didn't know of any use cases that would
ask for them. It's possible I was wrong, but the option to generalize is
still open today.

Andrei

Walter Bright

2014-03-07 19:32:11 UTC

I think .decode should be something more explicit (byCodePoint OSLT), just so
it's clear that it's not magical and does not solve all problems.

Good point. Perhaps "decodeUTF". "decode" is too generic.

Until then, how will people use strings with algorithms when they mean to use
them per-byte?

The way they do it now, i.e. they can't. That's the whole problem.

Dmitry Olshansky

2014-03-07 19:52:13 UTC

Post by H. S. Teoh

Post by Vladimir Panteleev

Post by Walter Bright
string s;
s.decode.algorithm...
s.algorithm...

I think .decode should be something more explicit (byCodePoint
OSLT), just so it's clear that it's not magical and does not solve
all problems.

+1. I think "byCodePoint" is far more self-documenting and less
misleading than "decode".
string s;
s.byCodePoint.algorithm...
I'm already starting to like it.

And there is precedent, see std.uni.byCodepoint ;)

--
Dmitry Olshansky

Andrei Alexandrescu

2014-03-07 19:59:23 UTC

Post by Walter Bright

Post by Walter Bright

Post by Walter Bright
Is there any hope of fixing this?

Is there any way we can provide an upgrade path for this? Silent breakage is
terrible. Any ideas?

string s;
s.decode.algorithm...
s.algorithm...
2. Emit warning when people use std.array.front(s) with strings.
3. Deprecate std.array.front for strings.
4. Error for std.array.front for strings.
5. Implement new std.array.front for strings that doesn't decode.

This would kill D. I am not exaggerating.

Andrei

Nick Sabalausky

2014-03-07 04:11:06 UTC

Post by Walter Bright
@property dchar front(T)(T[] a) @safe pure if (isNarrowString!(T[]))
{
assert(a.length, "Attempting to fetch the front of an empty array of " ~
T.stringof);
size_t i = 0;
return decode(a, i);
}

We rip out that front() entirely. The result is *not* technically a
range...yet! We could call it a protorange.

Then we provide two functions:

auto decode(someStringProtoRange) {...}
auto raw(someStringProtoRange) {...}

These convert the protoranges into actual ranges by adding the missing
front() function. The 'decode' adds a front() which decodes into dchar,
while the 'raw' adds a front() which simply returns the raw underlying type.

I imagine the decode/raw would probably also handle any "length"
property (if it exists in the protorange) accordingly.

This way, the user is forced to specify "myStringRange.decode" or
"myStringRange.raw" as appropriate, otherwise myStringRange can't be
used since it isn't technically a range, only a protorange.

(Naturally, ranges of dchar would always have front, since no decoding
is ever needed for them anyway. For these ranges, the decode/raw funcs
above would simply be no-ops.)

Nick Sabalausky

2014-03-07 04:44:30 UTC

Post by Nick Sabalausky

Post by Walter Bright
@property dchar front(T)(T[] a) @safe pure if (isNarrowString!(T[]))
{
assert(a.length, "Attempting to fetch the front of an empty array of " ~
T.stringof);
size_t i = 0;
return decode(a, i);
}

We rip out that front() entirely. The result is *not* technically a
range...yet! We could call it a protorange.
auto decode(someStringProtoRange) {...}
auto raw(someStringProtoRange) {...}
These convert the protoranges into actual ranges by adding the missing
front() function. The 'decode' adds a front() which decodes into dchar,
while the 'raw' adds a front() which simply returns the raw underlying type.
I imagine the decode/raw would probably also handle any "length"
property (if it exists in the protorange) accordingly.
This way, the user is forced to specify "myStringRange.decode" or
"myStringRange.raw" as appropriate, otherwise myStringRange can't be
used since it isn't technically a range, only a protorange.
(Naturally, ranges of dchar would always have front, since no decoding
is ever needed for them anyway. For these ranges, the decode/raw funcs
above would simply be no-ops.)

Of course, I just realized that these front()s can't be added unless
there's already a front to be called in the first place...

So instead of ripping out the current front() functions entirely, we
replace "front" with some sort of "rawFront" which the raw/decode
versions of front() can query in order to provide actual
decoding/non-decoding ranges.

Dmitry Olshansky

2014-03-07 10:27:57 UTC

Post by Walter Bright
In "Lots of low hanging fruit in Phobos" the issue came up about the
automatic encoding and decoding of char ranges.
Throughout D's history, there are regular and repeated proposals to
redesign D's view of char[] to pretend it is not UTF-8, but UTF-32. I.e.
so D will automatically generate code to decode and encode on every
attempt to index char[].

...

Post by Walter Bright
Is there any hope of fixing this?

Where have you been when it was introduced? :)

--
Dmitry Olshansky

Walter Bright

2014-03-07 10:41:18 UTC

Post by Dmitry Olshansky
Where have you been when it was introduced? :)

It slipped by me. What can I say? I'm not the only committer :-)

But after spending non-trivial time suffering as auto-decode blasted my kingdom,
I've concluded that it needs to die. Working around it is not easy.

I know that auto-decode has negatively impacted your regex, too. Basically,
auto-decode is like booking a flight from Seattle to San Francisco with a plane
change in Atlanta.

Steven Schveighoffer

2014-03-07 13:32:17 UTC

On Fri, 07 Mar 2014 05:41:18 -0500, Walter Bright

Post by Walter Bright

Post by Dmitry Olshansky
Where have you been when it was introduced? :)

It slipped by me. What can I say? I'm not the only committer :-)

No, this is intrinsic in the problem of treating strings as ranges of
dchar. This one function is a symptom, not the problem.

-Steve

Dmitry Olshansky

2014-03-07 19:43:51 UTC

Post by Walter Bright

Post by Dmitry Olshansky
Where have you been when it was introduced? :)

It slipped by me. What can I say? I'm not the only committer :-)
But after spending non-trivial time suffering as auto-decode blasted my
kingdom, I've concluded that it needs to die.
Working around it is not
easy.

That seems to be the biggest problem, it's an overriding default that is
very hard to "turn off" and retain nice and clear generic view of stuff.

Post by Walter Bright
I know that auto-decode has negatively impacted your regex, too.

No, technically, I knew what I was doing and that decode call was
explicit. It's just turned out to set a bar on a minimum time budget to
do X with a string, and it's too high.

What really got nasty is multiple re-decoding of the same piece as
engine backtracks to try earlier alternatives.

--
Dmitry Olshansky

Walter Bright

2014-03-07 19:51:50 UTC

No, technically, I knew what I was doing and that decode call was explicit.

Ah right, I misremembered. Thanks for the correction.

Vladimir Panteleev

2014-03-07 11:56:56 UTC

Post by Walter Bright
In "Lots of low hanging fruit in Phobos" the issue came up
about the automatic encoding and decoding of char ranges.
Throughout D's history, there are regular and repeated
proposals to redesign D's view of char[] to pretend it is not
UTF-8, but UTF-32. I.e. so D will automatically generate code
to decode and encode on every attempt to index char[].

I'm glad I'm not the only one who feels this way. Implicit
decoding must die.

I strongly believe that implicit decoding of character points in
std.range has been a mistake.

- Algorithms such as "countUntil" will count code points. These
numbers are useless for slicing, and can introduce hard-to-find
bugs.

- In lots of places, I've discovered that Phobos did UTF decoding
(thus murdering performance) when it didn't need to. Such cases
included format (now fixed), appender (now fixed), startsWith
(now fixed - recently), skipOver (still unfixed). These have
caused latent bugs in my programs that happened to be fed non-UTF
data. There's no reason for why D should fail on non-UTF data if
it has no reason to decode it in the first place! These failures
have only served to identify places in Phobos where redundant
decoding was occurring.

Furthermore, it doesn't actually solve anything completely! The
only thing it solves is a subset of cases for a subset of
languages!

People want to look at a string "character by character". If a
Unicode code point is a character in your language and alphabet,
I'm really happy for you, but that's not how it is for everyone.
Combining marks, complex scripts etc. make this point just a
fallacy that in the end will cause programmers to make mistakes
that will affect certain users somewhere.

Why do people want to look at individual characters? There are a
lot of misconceptions about Unicode, and I think some of that
applies here.

- Do you want to split a string by whitespace? Some languages
have no notion of whitespace. What do you need it for? Line
wrapping? Employ the Unicode line-breaking algorithm instead.

- Do you want to uppercase the first letter of a string? Some
language have no notion of letter case, and some use it for
different reasons. Furthermore, even languages with a Latin-based
alphabet may not have 1:1 mapping for case, e.g. the German ?
letter.

- Do you want to count how wide a string will be in a fixed-point
font? Wrong... Combining and control characters, zero-width
whitespace, etc. will render this approach futile.

- Do you want to split or flush a stream to a character device at
a point so that there's no garbage? I believe, this is the case
in TDPL's mention of the subject. Again, combining characters or
complex scripts will still be broken by this approach.

You need to either go all-out and provide complete
implementations of the relevant Unicode algorithms to perform
tasks such as the above that will work in all locales, or you
need to draw a line somewhere for which languages, alphabets,
locales do you want to support in your program. D's line is drawn
at the point where it considers that code points == characters,
however the outcome of this is clear nowhere in its documentation
and for such an arbitrary decision (from a cultural point of
view), it is embedded too deep into the language itself. With
std.ascii, at least, it's clear to the user that the functions
there will only work with English or languages using the same
alphabet.

This doesn't apply universally. There are still cases like, e.g.,
regular expression ranges. [a-z] makes sense in English, and
[?-?] makes sense in Russian, but I don't think that makes sense
for all languages. However, for the most part, I think implicit
decoding must be axed, and instead we need implementations of
Unicode algorithms and the documentation to instruct users why
and how to use them.

Andrej Mitrovic

2014-03-07 12:07:54 UTC

Post by Vladimir Panteleev
- Do you want to split a string by whitespace?
- Do you want to uppercase the first letter of a string?
- Do you want to count how wide a string will be in a fixed-point
font?
- Do you want to split or flush a stream to a character device at
a point so that there's no garbage?

We could later make a page on dlang (or the wiki) describing how to do
these common things.

Robert Schadek

2014-03-07 13:11:26 UTC

I'm glad I'm not the only one who feels this way. Implicit decoding
must die.
I strongly believe that implicit decoding of character points in
std.range has been a mistake.
- Algorithms such as "countUntil" will count code points. These
numbers are useless for slicing, and can introduce hard-to-find bugs.

+1 see my pull requests for std.string:
https://github.com/D-Programming-Language/phobos/pull/1952
https://github.com/D-Programming-Language/phobos/pull/1977

Steven Schveighoffer

2014-03-07 13:40:33 UTC

On Thu, 06 Mar 2014 21:37:13 -0500, Walter Bright

Post by Walter Bright
Is there any hope of fixing this?

Yes, make d strings not char arrays, but a library-defined struct with an
array as backing.

auto x = "..."; compiles to => auto x =
string(cast(immutable(char)[])"...");

Then define string to be whatever kind of range you want in the library,
with whatever functionality you want.

Then if you want by-char traversal, explicitly use immutable(char)[] as
x's type. And in the string range's members, we can provide whatever
access we want.

Note, this also fixes foreach, and many other problems we have. Most
likely code that works today will continue to work, since it's much more
of a bear to type immutable(char)[] instead of string :)

-Steve

Dicebot

2014-03-07 15:03:23 UTC

I don't like it at all.

1) It is a huge breakage and you have been refusing to do one
even for more important problems. What is about this sudden
change of mind?

2) It is regression back to C++ days of
no-one-cares-about-Unicode pain. Thinking about strings as
character arrays is so natural and convenient that if
language/Phobos won't punish you for that, it will be extremely
widespread.

Rendering correctness is very application-specific but providing
basic guarantees that string is not completely broken is useful.

Now real problems I see:

1) stuff like readText() returns char[] instead of requiring
explicit default encoding

2) lack of convenient .raw property which will effectively do
cast(ubyte[])

3) the fact that std.string always assumes unicode and never
forwards to std.ascii for
http://dlang.org/phobos/std_encoding.html#.AsciiString / ubyte[]

Vladimir Panteleev

2014-03-07 16:18:04 UTC

Post by Dicebot
I don't like it at all.
1) It is a huge breakage

Can we look at some example situations that this will break?

Post by Dicebot
and you have been refusing to do one even for more important
problems.

This is a fallacy.

Post by Dicebot
2) It is regression back to C++ days of
no-one-cares-about-Unicode pain. Thinking about strings as
character arrays is so natural and convenient that if
language/Phobos won't punish you for that, it will be extremely
widespread.

Thinking about dstrings as character arrays is less flawed only
to a certain extent.

Post by Dicebot
1) stuff like readText() returns char[] instead of requiring
explicit default encoding
2) lack of convenient .raw property which will effectively do
cast(ubyte[])
3) the fact that std.string always assumes unicode and never
forwards to std.ascii for
http://dlang.org/phobos/std_encoding.html#.AsciiString / ubyte[]

I think these are fixable without breaking anything? So why not
go for it? The first two sound trivial (.raw can be an UFCS
property).

Dicebot

2014-03-07 16:43:29 UTC

Post by Vladimir Panteleev

Post by Dicebot
I don't like it at all.
1) It is a huge breakage

Can we look at some example situations that this will break?

Any code that relies on countUntil to count dchar's? Or, to
generalize, almost any code that uses std.algorithm functions
with string?

Post by Vladimir Panteleev

Post by Dicebot
and you have been refusing to do one even for more important
problems.

This is a fallacy.

Ok :)

Post by Vladimir Panteleev

Post by Dicebot
2) It is regression back to C++ days of
no-one-cares-about-Unicode pain. Thinking about strings as
character arrays is so natural and convenient that if
language/Phobos won't punish you for that, it will be
extremely widespread.

Thinking about dstrings as character arrays is less flawed only
to a certain extent.

Sure. But I find this extent practical enough to make the
difference. It is good compromise between perfectly correct (and
very slow) string processing and having your program unusable
with anything but basic latin symbol set.

Post by Vladimir Panteleev

Post by Dicebot
1) stuff like readText() returns char[] instead of requiring
explicit default encoding
2) lack of convenient .raw property which will effectively do
cast(ubyte[])
3) the fact that std.string always assumes unicode and never
forwards to std.ascii for
http://dlang.org/phobos/std_encoding.html#.AsciiString /
ubyte[]

I think these are fixable without breaking anything? So why not
go for it? The first two sound trivial (.raw can be an UFCS
property).

(1) will likely require deprecation (== breakage) of old
interface, but yes, those are relatively trivial. It is just has
not been important enough to me to spend time on pushing it.
Still struggling to finish my template argument list proposal :(

Vladimir Panteleev

2014-03-07 17:04:29 UTC

On Friday, 7 March 2014 at 16:18:06 UTC, Vladimir Panteleev

Post by Vladimir Panteleev
Can we look at some example situations that this will break?

Any code that relies on countUntil to count dchar's? Or, to
generalize, almost any code that uses std.algorithm functions
with string?

This is a pretty fragile design in the first place, since we use
the same basic type (integers) to count two different things
(code units / code points). Code that relies on this behavior
would need to be explicitly tested with Unicode data to be sure
that it works correctly - otherwise, it will only appear at a
glance that it works right if it's only tested with ASCII.

Correct code where these indices never left the equation will not
be affected, e.g.:

auto s = "???";
auto x = s.countUntil("??"); // was 1, will be 3
s = s.drop(x);
assert(s == "??"); // still OK

Post by Vladimir Panteleev
Thinking about dstrings as character arrays is less flawed
only to a certain extent.

Sure. But I find this extent practical enough to make the
difference. It is good compromise between perfectly correct
(and very slow) string processing and having your program
unusable with anything but basic latin symbol set.

I think that if we are to draw a line somewhere on what to
support and not, the decision should not be embedded as deep into
the language. Ideally, it would be clearly visible in the code
that you are counting code points.

Dicebot

2014-03-07 17:08:02 UTC

Post by Vladimir Panteleev
I think that if we are to draw a line somewhere on what to
support and not, the decision should not be embedded as deep
into the language. Ideally, it would be clearly visible in the
code that you are counting code points.

Well if you consider really breaking changes, simply prohibiting
plain random access to char[] and forcing to use either .raw or
.decode is one thing I'd love to see (with .byGrapheme as library
cherry on top)

H. S. Teoh

2014-03-07 17:38:12 UTC

Post by Dicebot

I think that if we are to draw a line somewhere on what to support
and not, the decision should not be embedded as deep into the
language. Ideally, it would be clearly visible in the code that
you are counting code points.

Well if you consider really breaking changes, simply prohibiting
plain random access to char[] and forcing to use either .raw or
.decode is one thing I'd love to see (with .byGrapheme as library
cherry on top)

I don't understand what advantage this would bring.

T

--
Frank disagreement binds closer than feigned agreement.

Dicebot

2014-03-07 18:03:19 UTC

Post by H. S. Teoh

Post by Dicebot
Well if you consider really breaking changes, simply
prohibiting
plain random access to char[] and forcing to use either .raw or
.decode is one thing I'd love to see (with .byGrapheme as
library
cherry on top)

I don't understand what advantage this would bring.

Making sure that whatever interpretation is chosen by the
programmer it is actually a conscious choice and he does not hold
any false illusions.

Walter Bright

2014-03-07 19:46:10 UTC

Post by Vladimir Panteleev
Ideally, it would be
clearly visible in the code that you are counting code points.

Yes.

Walter Bright

2014-03-07 19:44:01 UTC

1) It is a huge breakage and you have been refusing to do one even for more
important problems. What is about this sudden change of mind?

1. Performance Performance Performance

2. The current behavior is surprising (it sure surprised me, I didn't notice it
until I looked at the assembler to figure out why the performance sucked)

3. Weirdnesses like ElementEncodingType

4. Strange behavior differences between char[], char*, and InputRange!char types

5. Funky anomalous issues with writing OutputRange!char (the put(T) must take a
dchar)

2) lack of convenient .raw property which will effectively do cast(ubyte[])

I've done the cast as a workaround, but when working with generic code it turns
out the ubyte type becomes viral - you have to use it everywhere. So all over
the place you're having casts between ubyte <=> char in unexpected places. You
also wind up with ugly ubyte <=> dchar casts, with the commensurate risk that
you goofed and have a truncation bug.

Essentially, the auto-decode makes trivial code look better, but if you're
writing a more comprehensive string processing program, and care about
performance, it makes a regular ugly mess of things.

Sean Kelly

2014-03-07 15:51:48 UTC

I'm with Walter on this, and it's why I don't use char ranges.
Though converting to ubyte feels weird.

Chris

2014-03-07 18:03:18 UTC

I only hope it won't break my code. It mainly deals with string /
character processing and our project in D is now almost ready for
take off (at least for a beta flight). It deals with characters
like "?", it is not dealing with English input. Hope the landing
will be soft!

Andrei Alexandrescu

2014-03-07 19:57:23 UTC

Post by Walter Bright
In "Lots of low hanging fruit in Phobos" the issue came up about the
automatic encoding and decoding of char ranges.

[snip]

Post by Walter Bright
Is there any hope of fixing this?

There's nothing to fix.

Allow me to enumerate the functions of std.algorithm and how they work
today and how they'd work with the proposed change. Let s be a variable
of some string type.

1.

s.all!(x => x == '?') currently works as expected. Proposed: fails silently.

2.

s.any!(x => x == '?') currently works as expected. Proposed: fails silently.

3.

s.canFind!(x => x == '?') currently works as expected. Proposed: fails
silently.

4.

s.canFind('?') currently works as expected. Proposed: fails silently.

5.

s.count() currently works as expected. Proposed: fails silently.

6.

s.count!((a, b) => std.uni.toLower(a) == std.uni.toLower(b))("?")
currently works as expected (with the known issues of lowercase
conversion). Proposed: fails silently.

7.

s.count('?') currently works as expected. Proposed: fails silently.

8.

s.countUntil("a") currently work as expected. Proposed: fails silently.
This applies to all variations of countUntil.

9.

s.endsWith('?') currently works as expected. Proposed: fails silently.

10.

s.find('?') currently works as expected. Proposed: fails silently. This
applies to other variations of find that include custom predicates.

11.

...

I went down std.algorithm in the order listed in its documentation and
found pernicious issues with almost every single algorithm.

I designed the range behavior of strings after much thinking and
consideration back in the day when I designed std.algorithm. It was
painfully obvious (but it seems to have been forgotten now that it's
working so well) that approaching strings as arrays of char[] would
break almost every single algorithm leaving us essentially in the
pre-UTF C++aveman era.

Making strings bidirectional ranges has been a very good choice within
the constraints. There was already a string type, and that was
immutable(char)[], and a bunch of code depended on that definition.

Clearly one might argue that their app has no business dealing with
diacriticals or Asian characters. But that's the typical provincial view
that marred many languages' approach to UTF and internationalization. If
you know your string is ASCII, the remedy is simple - don't use char[]
and friends. From day 1, the type "char" was meant to mean "code unit of
UTF characters".

So please ponder the above before going to do surgery on the patient
that's going to kill him.

Andrei

H. S. Teoh

2014-03-07 20:26:00 UTC

Post by Andrei Alexandrescu

Post by Walter Bright
In "Lots of low hanging fruit in Phobos" the issue came up about the
automatic encoding and decoding of char ranges.

[snip]

Post by Walter Bright
Is there any hope of fixing this?

There's nothing to fix.

:D I knew this was going to happen.

Post by Andrei Alexandrescu
Allow me to enumerate the functions of std.algorithm and how they
work today and how they'd work with the proposed change. Let s be a
variable of some string type.
1.
s.all!(x => x == '?') currently works as expected. Proposed: fails silently.
2.
s.any!(x => x == '?') currently works as expected. Proposed: fails silently.
3.
fails silently.
4.
s.canFind('?') currently works as expected. Proposed: fails silently.

The problem is that the current implementation of this correct behaviour
leaves a lot to be desired in terms of performance. Ideally, you should
not need to decode every single character in s just to see if it happens
to contain ?. Rather, canFind, et al should convert the dchar literal
'?' into a UTF-8 (resp. UTF-16) sequence and do a substring search
instead. Decoding every character in s, while correct, is also
needlessly inefficient.

Post by Andrei Alexandrescu
5.
s.count() currently works as expected. Proposed: fails silently.

Wrong. The current behaviour of s.count() does not work as expected, it
only gives an illusion that it does. Its return value is misleading when
combining diacritics and other such Unicode "niceness" are involved.
Arguably, such things should be prohibited altogether, and more
semantically transparent algorithms used, namely s.countCodePoints,
s.countGraphemes, etc..

Post by Andrei Alexandrescu
6.
s.count!((a, b) => std.uni.toLower(a) == std.uni.toLower(b))("?")
currently works as expected (with the known issues of lowercase
conversion). Proposed: fails silently.

Again, I don't like this. It sweeps the issues of comparing unicode
strings under the carpet and gives the programmer a false sense of code
correctness. Users instead should be encouraged to use proper Unicode
collation functions that are actually correct, instead of giving an
illusion of correctness.

Post by Andrei Alexandrescu
7.
s.count('?') currently works as expected. Proposed: fails silently.

This is a repetition of #5. :)

Post by Andrei Alexandrescu
8.
s.countUntil("a") currently work as expected. Proposed: fails
silently. This applies to all variations of countUntil.

Whether this is correct or not depends on what the intention is. If
you're looking to slice a string, this most definitely does NOT work as
expected. If you're looking to count graphemes, this doesn't work as
expected either. This only works if you just so happen to be counting
code points. The correct approach, IMO, is to help the user make a
conscious choice between these different semantics:

s.indexOf("a"); // for slicing
s.byCodepoint.countUntil("a"); // count code points
s.byGrapheme.countUntil("a"); // count graphemes

Things like s.countUntil("a") are misleading and lead to subtle Unicode
bugs.

Post by Andrei Alexandrescu
9.
s.endsWith('?') currently works as expected. Proposed: fails silently.

Arguable, because it imposes a performance hit by needless decoding.
Ideally, you should have 3 overloads:

bool endsWith(string s, char asciiChar);
bool endsWith(string s, wchar wideChar);
bool endsWith(string s, dchar codepoint);

In the wchar and dchar overloads you'd do substring search. There is no
need to decode.

Post by Andrei Alexandrescu
10.
s.find('?') currently works as expected. Proposed: fails silently.
This applies to other variations of find that include custom
predicates.

Not necessarily. Arguably we should be overloading on needle type to
eliminate needless decoding:

string find(string s, char c); // ubyte search
string find(string s, wchar c); // substring search with char[2]
string find(string s, dchar c); // substring search with char[4]

This makes sense to me because string is immutable(char)[], so from the
point of view of being an array, searching for wchar is not something
that is obvious (how do you search for a value of type T in an array of
elements of type U?), so explicit overloads for handling those cases
make sense.

Decoding every single character in s is a lot of needless work.

[...]

Post by Andrei Alexandrescu
I designed the range behavior of strings after much thinking and
consideration back in the day when I designed std.algorithm. It was
painfully obvious (but it seems to have been forgotten now that it's
working so well) that approaching strings as arrays of char[] would
break almost every single algorithm leaving us essentially in the
pre-UTF C++aveman era.

I agree, but it is also painfully obvious that the current
implementation is lackluster in terms of performance.

Post by Andrei Alexandrescu
Making strings bidirectional ranges has been a very good choice
within the constraints. There was already a string type, and that
was immutable(char)[], and a bunch of code depended on that
definition.
Clearly one might argue that their app has no business dealing with
diacriticals or Asian characters. But that's the typical provincial
view that marred many languages' approach to UTF and
internationalization. If you know your string is ASCII, the remedy
is simple - don't use char[] and friends. From day 1, the type
"char" was meant to mean "code unit of UTF characters".

Yes, but currently Phobos support for non-UTF strings is rather poor,
and requires many explicit casts to/from ubyte[].

Post by Andrei Alexandrescu
So please ponder the above before going to do surgery on the patient
that's going to kill him.

[...]

Yeah I was surprised Walter was actually seriously going to pursue this.
It's a change of a far vaster magnitude than many of the other DIPs and
other proposals that have been rejected because they were deemed to
cause too much breakage of existing code.

T

--
Having a smoking section in a restaurant is like having a peeing section
in a swimming pool. -- Edward Burr

Vladimir Panteleev

2014-03-07 20:43:44 UTC

On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu

Post by Andrei Alexandrescu
Allow me to enumerate the functions of std.algorithm and how
they work today and how they'd work with the proposed change.
Let s be a variable of some string type.
s.canFind('?') currently works as expected.

No, it doesn't.

import std.algorithm;

void main()
{
auto s = "casse?";
assert(s.canFind('?'));
}

That's the whole problem - all this hot steam and it still does
not work properly. Because it can't - not without pulling in all
of the Unicode algorithms implicitly, and that would be much
worse.

Post by Andrei Alexandrescu
I went down std.algorithm in the order listed in its
documentation and found pernicious issues with almost every
single algorithm.

All of your examples are variations of one and the same case:
searching for a non-ASCII dchar or dchar literal.

How often does this pattern occur in real programs? I think the
only real metric is to try the change and find out.

Post by Andrei Alexandrescu
Clearly one might argue that their app has no business dealing
with diacriticals or Asian characters. But that's the typical
provincial view that marred many languages' approach to UTF and
internationalization.

So is yours, if you think that making everything magically a
dchar is going to solve all problems.

The TDPL example only showcases the problem. Yes, it works with
Swedish. Now try it again with Sanskrit.

Eyrk

2014-03-07 21:56:43 UTC

Post by Vladimir Panteleev
No, it doesn't.
import std.algorithm;
void main()
{
auto s = "casse?";
assert(s.canFind('e?'));
}

Hm, I'm not following? Works perfectly fine on my system?

Vladimir Panteleev

2014-03-07 21:58:39 UTC

On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev

Post by Vladimir Panteleev
No, it doesn't.
import std.algorithm;
void main()
{
auto s = "casse?";
assert(s.canFind('e?'));
}

Hm, I'm not following? Works perfectly fine on my system?

Something's messing with your Unicode. Try downloading and
compiling this file:
http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

TC

2014-03-07 22:16:57 UTC

Post by Vladimir Panteleev

Post by Eyrk
Hm, I'm not following? Works perfectly fine on my system?

Something's messing with your Unicode. Try downloading and
http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

Used hex view on referenced file and it does not seem to be the
same symbol.

Works for me with same ones.

Vladimir Panteleev

2014-03-07 22:18:16 UTC

Post by Vladimir Panteleev

Post by Eyrk
Hm, I'm not following? Works perfectly fine on my system?

Something's messing with your Unicode. Try downloading and
http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

Used hex view on referenced file and it does not seem to be the
same symbol.

Define "symbol". :)

TC

2014-03-07 22:23:43 UTC

Post by Vladimir Panteleev

Post by Vladimir Panteleev

Post by Eyrk
Hm, I'm not following? Works perfectly fine on my system?

Something's messing with your Unicode. Try downloading and
http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

Used hex view on referenced file and it does not seem to be
the same symbol.

Define "symbol". :)

"cass?" - 22 63 61 73 73 65 cc 81 22

vs

'?' - 27 c3 a9 27

Eyrk

2014-03-07 22:19:08 UTC

Post by Vladimir Panteleev

On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev

Post by Vladimir Panteleev
No, it doesn't.
import std.algorithm;
void main()
{
auto s = "casse?";
assert(s.canFind('e?'));
}

Hm, I'm not following? Works perfectly fine on my system?

Something's messing with your Unicode. Try downloading and
http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

ah right, missing normalization, I get your point, thanks.

TC

2014-03-07 22:25:30 UTC

Post by Eyrk
ah right, missing normalization, I get your point, thanks.

Oops :)

H. S. Teoh

2014-03-07 22:26:04 UTC

Post by Vladimir Panteleev
No, it doesn't.
import std.algorithm;
void main()
{
auto s = "casse?";
assert(s.canFind('e?'));
}

Hm, I'm not following? Works perfectly fine on my system?

Probably because your browser is normalizing the unicode string when you

Something's messing with your Unicode. Try downloading and compiling
http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

I downloaded the file and looked at it through `od -ctx1`: the first ?
is encoded as the byte sequence 65 cc 81, that is, [U+65, U+301] (small
letter e + combining diacritic acute accent), whereas the second ? is
encoded as c3 a9, that is, U+E9 (precomposed small letter e with acute
accent).

This illustrates one of my objections to Andrei's post: by auto-decoding
behind the user's back and hiding the intricacies of unicode from him,
it has masked the fact that codepoint-for-codepoint comparison of a
unicode string is not guaranteed to always return the correct results,
due to the possibility of non-normalized strings.

Basically, to have correct behaviour in all cases, the user must be
aware of, and use, the Unicode collation / normalization algorithms
prescribed by the Unicode standard. What we have in std.algorithm right
now is an incomplete implementation with non-working edge cases (like
Vladimir's example) that has poor performance to start with. Its only
redeeming factor is that the auto-decoding hack has given it the
illusion of being correct, when actually it's not correct according to
the Unicode standard. I don't see how this is necessarily superior to
Walter's proposal.

T

--
Just because you survived after you did it, doesn't mean it wasn't stupid!

Eyrk

2014-03-07 22:55:45 UTC

Post by H. S. Teoh
This illustrates one of my objections to Andrei's post: by
auto-decoding
behind the user's back and hiding the intricacies of unicode
from him,
it has masked the fact that codepoint-for-codepoint comparison
of a
unicode string is not guaranteed to always return the correct
results,
due to the possibility of non-normalized strings.
Basically, to have correct behaviour in all cases, the user
must be
aware of, and use, the Unicode collation / normalization
algorithms
prescribed by the Unicode standard. What we have in
std.algorithm right
now is an incomplete implementation with non-working edge cases
(like
Vladimir's example) that has poor performance to start with.
Its only
redeeming factor is that the auto-decoding hack has given it the
illusion of being correct, when actually it's not correct
according to
the Unicode standard. I don't see how this is necessarily
superior to
Walter's proposal.
T

Yes, I realised too late.

Would it not be beneficial to have different types of literals,
one type which is implicitly normalized and one which is
"raw"(like today)? Since typically you'd want to normalize most
string literals at compile-time, then you only have to normalize
external input at run-time.

TC

2014-03-07 23:06:19 UTC

Post by H. S. Teoh
Probably because your browser is normalizing the unicode string
when you

Just for curiosity I tried it with C# to see how it is handled
there and it works like this:

using System;
using System.Diagnostics;

namespace Test
{
class Program
{
static void Main()
{
var s = "casse?";
Debug.Assert(s.IndexOf('?') < 0);
s = s.Normalize();
Debug.Assert(s.IndexOf('?') == 4);
}
}
}

So it's neither work by default there and Normalize has to be used

Brad Anderson

2014-03-07 23:50:51 UTC

On Fri, Mar 07, 2014 at 09:58:39PM +0000, Vladimir Panteleev

Post by Vladimir Panteleev

On Friday, 7 March 2014 at 20:43:45 UTC, Vladimir Panteleev

Post by Vladimir Panteleev
No, it doesn't.
import std.algorithm;
void main()
{
auto s = "casse?";
assert(s.canFind('e?'));
}

Hm, I'm not following? Works perfectly fine on my system?

Probably because your browser is normalizing the unicode string
when you

Post by Vladimir Panteleev
Something's messing with your Unicode. Try downloading and
compiling
http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

I downloaded the file and looked at it through `od -ctx1`: the
first ?
is encoded as the byte sequence 65 cc 81, that is, [U+65,
U+301] (small
letter e + combining diacritic acute accent), whereas the
second ? is
encoded as c3 a9, that is, U+E9 (precomposed small letter e
with acute
accent).
This illustrates one of my objections to Andrei's post: by
auto-decoding
behind the user's back and hiding the intricacies of unicode
from him,
it has masked the fact that codepoint-for-codepoint comparison
of a
unicode string is not guaranteed to always return the correct
results,
due to the possibility of non-normalized strings.
Basically, to have correct behaviour in all cases, the user
must be
aware of, and use, the Unicode collation / normalization
algorithms
prescribed by the Unicode standard. What we have in
std.algorithm right
now is an incomplete implementation with non-working edge cases
(like
Vladimir's example) that has poor performance to start with.
Its only
redeeming factor is that the auto-decoding hack has given it the
illusion of being correct, when actually it's not correct
according to
the Unicode standard. I don't see how this is necessarily
superior to
Walter's proposal.
T

To me, the status quo feels like an ok compromise between
performance and correctness. Everyone is pointing out that
working at the code point level is bad because it's not correct
but working at the code unit level as Walter proposes is correct
even less often so that's not really an argument for moving to
that. It is, however, an argument for forcing the user to decide
what level of correctness and performance they need.

Walter's idea (code unit level) would be fastest but least
correct.
The current is somewhat fast and is somewhat correct.
The next level, graphemes, would be slowest of all but most
correct.

It seems like there is just no way to avoid the tradeoff between
speed and correctness so we shouldn't try, only try to force the
user to make a decision.

Maybe some more string types are in order (hrm). In order of
performance to correctness:

string, wstring (code units)
dstring (code points)
+gstring (graphemes)

(do grapheme's completely normalize? If not probably need another
level, say, nstring)

Then if a user needs correctness over performance they just work
with gstrings. If they need performance over correctness they
work with strings (assuming some of Walter's idea happens,
otherwise they'd work with string.representation).

Andrei Alexandrescu

2014-03-08 01:23:32 UTC

Post by Vladimir Panteleev
No, it doesn't.
import std.algorithm;
void main()
{
auto s = "casse?";
assert(s.canFind('e?'));
}

Hm, I'm not following? Works perfectly fine on my system?

Something's messing with your Unicode. Try downloading and compiling
http://dump.thecybershadow.net/6f82ea151c1a00835cbcf5baaace2801/test.d

Yup, the grapheme issue. This should work.

import std.algorithm, std.uni;

void main()
{
auto s = "casse?";
assert(s.byGrapheme.canFind('?'));
}

It doesn't compile, seems like a library bug.

Graphemes are the next level of Nirvana above code points, but that
doesn't mean it's graphemes or nothing.

Andrei

Sarath Kodali

2014-03-07 22:35:46 UTC

Post by Vladimir Panteleev
On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu

Post by Andrei Alexandrescu
Allow me to enumerate the functions of std.algorithm and how
they work today and how they'd work with the proposed change.
Let s be a variable of some string type.
s.canFind('?') currently works as expected.

No, it doesn't.
import std.algorithm;
void main()
{
auto s = "casse?";
assert(s.canFind('?'));
}
That's the whole problem - all this hot steam and it still does
not work properly. Because it can't - not without pulling in
all of the Unicode algorithms implicitly, and that would be
much worse.

Post by Andrei Alexandrescu
I went down std.algorithm in the order listed in its
documentation and found pernicious issues with almost every
single algorithm.

searching for a non-ASCII dchar or dchar literal.
How often does this pattern occur in real programs? I think the
only real metric is to try the change and find out.

Post by Andrei Alexandrescu
Clearly one might argue that their app has no business dealing
with diacriticals or Asian characters. But that's the typical
provincial view that marred many languages' approach to UTF
and internationalization.

So is yours, if you think that making everything magically a
dchar is going to solve all problems.
The TDPL example only showcases the problem. Yes, it works with
Swedish. Now try it again with Sanskrit.

+1
In Indian languages, a character consists of one or more UNICODE
code points. For example, in Sanskrit "ddhrya"
http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg
consists of 7 UNICODE code points. So to search for this char I
have to use string search.

- Sarath

H. S. Teoh

2014-03-07 23:12:16 UTC

Post by Sarath Kodali

Post by Vladimir Panteleev
On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu

[...]

Post by Sarath Kodali

Post by Vladimir Panteleev

Post by Andrei Alexandrescu
Clearly one might argue that their app has no business dealing
with diacriticals or Asian characters. But that's the typical
provincial view that marred many languages' approach to UTF and
internationalization.

So is yours, if you think that making everything magically a dchar
is going to solve all problems.
The TDPL example only showcases the problem. Yes, it works with
Swedish. Now try it again with Sanskrit.

+1
In Indian languages, a character consists of one or more UNICODE
code points. For example, in Sanskrit "ddhrya"
http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg
consists of 7 UNICODE code points. So to search for this char I have
to use string search.

[...]

That's what I've been arguing for. The most general form of character
searching in Unicode requires substring searching, and similarly many
character-based operations on Unicode strings are effectively
substring-based operations, because said "character" may be a multibyte
code point, or, in your case, multiple code points. Since that's the
case, we might as well just forget about the distinction between
"character" and "string", and treat all such operations as substring
operations (even if the operand is supposedly "just 1 character long").

This would allow us to get rid of the hackish auto-decoding of narrow
strings, and thus eliminate the needless overhead of always decoding.

T

--
All men are mortal. Socrates is mortal. Therefore all men are Socrates.

Sarath Kodali

2014-03-07 23:13:50 UTC

Post by Sarath Kodali
+1
In Indian languages, a character consists of one or more
UNICODE code points. For example, in Sanskrit "ddhrya"
http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg
consists of 7 UNICODE code points. So to search for this char I
have to use string search.
- Sarath

Oops, incomplete reply ...

Since a single "alphabet" in Indian languages can contain
multiple code-points, iterating over single code-points is like
iterating over char[] for non English European languages. So
decode is of no use other than decreasing the performance. A raw
char[] comparison is much faster.

And then there is this "unicode normalization" that makes it very
difficult for string searches or comparisons.

- Sarath

H. S. Teoh

2014-03-07 23:33:35 UTC

Post by Sarath Kodali

Post by Sarath Kodali
+1
In Indian languages, a character consists of one or more UNICODE
code points. For example, in Sanskrit "ddhrya"
http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg
consists of 7 UNICODE code points. So to search for this char I
have to use string search.
- Sarath

Oops, incomplete reply ...
Since a single "alphabet" in Indian languages can contain multiple
code-points, iterating over single code-points is like iterating
over char[] for non English European languages. So decode is of no
use other than decreasing the performance. A raw char[] comparison
is much faster.

Yes. The more I think about it, the more auto-decoding sounds like a
wrong decision. The question, though, is whether it's worth the massive
code breakage needed to undo it. :-(

Post by Sarath Kodali
And then there is this "unicode normalization" that makes it very
difficult for string searches or comparisons.

[...]

I believe the convention is to always normalize strings before
performing operations on them, in order to prevent these sorts of
problems. I think many of the unicode prescribed algorithms have
normalization as a prerequisite, since otherwise there's no guarantee
that the algorithm will produce the correct results.

T

--
"I'm not childish; I'm just in touch with the child within!" - RL

Andrei Alexandrescu

2014-03-08 00:44:58 UTC

Post by Vladimir Panteleev

Post by Andrei Alexandrescu
Allow me to enumerate the functions of std.algorithm and how they work
today and how they'd work with the proposed change. Let s be a
variable of some string type.
s.canFind('?') currently works as expected.

No, it doesn't.
import std.algorithm;
void main()
{
auto s = "casse?";
assert(s.canFind('?'));
}

worksforme

Vladimir Panteleev

2014-03-08 00:45:51 UTC

On Saturday, 8 March 2014 at 00:44:53 UTC, Andrei Alexandrescu

Post by Andrei Alexandrescu
worksforme

http://forum.dlang.org/post/fhqradggtvwnpqpuehgg at forum.dlang.org

Dmitry Olshansky

2014-03-07 20:48:19 UTC

Post by Andrei Alexandrescu

Post by Walter Bright
In "Lots of low hanging fruit in Phobos" the issue came up about the
automatic encoding and decoding of char ranges.

[snip]

Post by Walter Bright
Is there any hope of fixing this?

There's nothing to fix.

There is, all right. ElementEncodingType for starters.

Post by Andrei Alexandrescu
Allow me to enumerate the functions of std.algorithm and how they work
today and how they'd work with the proposed change. Let s be a variable
of some string type.

Special case was wrong though - special casing arrays of char[] and
throwing all other ranges of char out the window. The amount of code to
support this schizophrenia is enormous.

Post by Andrei Alexandrescu
Making strings bidirectional ranges has been a very good choice within
the constraints. There was already a string type, and that was
immutable(char)[], and a bunch of code depended on that definition.

Trying to make it work by blowing a hole in the generic range concept
now seems like it wasn't worth it.

--
Dmitry Olshansky

Andrei Alexandrescu

2014-03-08 01:18:55 UTC

Post by Dmitry Olshansky

Post by Andrei Alexandrescu

Post by Walter Bright
In "Lots of low hanging fruit in Phobos" the issue came up about the
automatic encoding and decoding of char ranges.

[snip]

Post by Walter Bright
Is there any hope of fixing this?

There's nothing to fix.

There is, all right. ElementEncodingType for starters.

Post by Andrei Alexandrescu
Allow me to enumerate the functions of std.algorithm and how they work
today and how they'd work with the proposed change. Let s be a variable
of some string type.

Special case was wrong though - special casing arrays of char[] and
throwing all other ranges of char out the window. The amount of code to
support this schizophrenia is enormous.

I think this is a confusion. The code in e.g. std.algorithm is
specialized for efficiency of stuff that already works.

Post by Dmitry Olshansky

Post by Andrei Alexandrescu
Making strings bidirectional ranges has been a very good choice within
the constraints. There was already a string type, and that was
immutable(char)[], and a bunch of code depended on that definition.

Trying to make it work by blowing a hole in the generic range concept
now seems like it wasn't worth it.

I disagree. Also what hole?

Andrei

Vladimir Panteleev

2014-03-08 00:39:11 UTC

On Friday, 7 March 2014 at 19:57:38 UTC, Andrei Alexandrescu

Post by Andrei Alexandrescu
s.all!(x => x == '?')
s.any!(x => x == '?')
s.canFind!(x => x == '?')

These are a variation of the following:

ubyte b = ...;
if (b == 1000) { ... }

The compiler could emit a warning here, and indeed some
languages/compilers do. It might not be in the vein of D
metaprogramming, though, as the compiler will not emit a warning
for "if (false) { ... }".

Post by Andrei Alexandrescu
s.canFind('?')
s.endsWith('?')
s.find('?')
s.count('?')
s.countUntil('?')

These should not compile post-change, because the sought element
(dchar) is not of the same type as the string. So they will not
fail silently.

Post by Andrei Alexandrescu
s.count()
s.count!((a, b) => std.uni.toLower(a) ==
std.uni.toLower(b))("?")
s.countUntil('?')

As has already been mentioned, counting code points is borderline
useless.

Post by Andrei Alexandrescu
s.count!((a, b) => std.uni.toLower(a) ==
std.uni.toLower(b))("?")

And this is just wrong on many levels. I hope you know better
than to actually use this for case-insensitive comparisons in
production software.

Andrei Alexandrescu

2014-03-08 00:44:41 UTC

Post by H. S. Teoh

Post by Andrei Alexandrescu
s.canFind('?') currently works as expected. Proposed: fails silently.

The problem is that the current implementation of this correct behaviour
leaves a lot to be desired in terms of performance. Ideally, you should
not need to decode every single character in s just to see if it happens
to contain ?. Rather, canFind, et al should convert the dchar literal
'?' into a UTF-8 (resp. UTF-16) sequence and do a substring search
instead. Decoding every character in s, while correct, is also
needlessly inefficient.

That's an optimization that fits the current design and goes in the
library transparently, i.e. the good stuff.

Post by H. S. Teoh

Post by Andrei Alexandrescu
5.
s.count() currently works as expected. Proposed: fails silently.

Wrong. The current behaviour of s.count() does not work as expected, it
only gives an illusion that it does.

Depends on what one expects :o).

Post by H. S. Teoh
Its return value is misleading when
combining diacritics and other such Unicode "niceness" are involved.
Arguably, such things should be prohibited altogether, and more
semantically transparent algorithms used, namely s.countCodePoints,
s.countGraphemes, etc..

I think s.byGrapheme.count is the right way instead of specializing a
bunch of algorithms to work with graphemes.

Post by H. S. Teoh

Post by Andrei Alexandrescu
s.endsWith('?') currently works as expected. Proposed: fails silently.

Arguable, because it imposes a performance hit by needless decoding.
bool endsWith(string s, char asciiChar);
bool endsWith(string s, wchar wideChar);
bool endsWith(string s, dchar codepoint);

Nice idea. Fits current design. Then interesting complications arise
with things like bool endsWith(string, wstring) etc.

Post by H. S. Teoh
[...]

Post by Andrei Alexandrescu
I designed the range behavior of strings after much thinking and
consideration back in the day when I designed std.algorithm. It was
painfully obvious (but it seems to have been forgotten now that it's
working so well) that approaching strings as arrays of char[] would
break almost every single algorithm leaving us essentially in the
pre-UTF C++aveman era.

I agree, but it is also painfully obvious that the current
implementation is lackluster in terms of performance.

It's not painfully obvious to me at all. What is obvious to me is people
are happy campers with the way D's strings work, including UTF support
and performance. I don't remember people bringing this up in forums and
here at Facebook "yeah, just look at the crappy way they handle
strings..." Silent approval is easy to forget about.

Walter has been working on an application in which anything slower than
2x baseline would have been a failure. In that app (which I know very
well) the right option from day 1 would have been ubyte[], which he
discovered the hard way. His incomplete understanding of how D strings
work is the single largest problem there, and indicates an issue with
the documentation.

He discovered that, was surprised, and overreacted. No need to amplify
that into mass hysteria. There are improvements that can be made, in the
form of additions, not breaking changes that would inflict massive
breakage on the community. This is the way in which this discussion can
have a positive outcome. (I've shared in fact a few ideas with Walter.)

Post by H. S. Teoh

Post by Andrei Alexandrescu
Clearly one might argue that their app has no business dealing with
diacriticals or Asian characters. But that's the typical provincial
view that marred many languages' approach to UTF and
internationalization. If you know your string is ASCII, the remedy
is simple - don't use char[] and friends. From day 1, the type
"char" was meant to mean "code unit of UTF characters".

Yes, but currently Phobos support for non-UTF strings is rather poor,
and requires many explicit casts to/from ubyte[].

Non-UTF strings are currently modeled as ubyte[], so I don't see what
you'd be casting to and fro. You have absolutely no business
representing anything non-UTF with char and char[] etc.

Post by H. S. Teoh

Post by Andrei Alexandrescu
So please ponder the above before going to do surgery on the patient
that's going to kill him.

[...]
Yeah I was surprised Walter was actually seriously going to pursue this.
It's a change of a far vaster magnitude than many of the other DIPs and
other proposals that have been rejected because they were deemed to
cause too much breakage of existing code.

Compared with what's going on now with D at Facebook, this agitation is
but a little side show. We have way bigger fish to fry.

Andrei

Timon Gehr

2014-03-07 23:40:41 UTC

Post by Walter Bright
In "Lots of low hanging fruit in Phobos" the issue came up about the
automatic encoding and decoding of char ranges.
...

I think this is among the most annoying aspects of Phobos.

Walter Bright

2014-03-08 00:22:10 UTC

Andrei suggests that this change would destroy D by breaking too much existing
code. He might be right. Can we afford the risk that he is right?

We should think about a way to have our cake and eat it, too.

Keep in mind that this issue is a Phobos one, not a core language issue.

Peter Alexander

2014-03-08 00:46:21 UTC

Post by Walter Bright
Andrei suggests that this change would destroy D by breaking
too much existing code. He might be right. Can we afford the
risk that he is right?
We should think about a way to have our cake and eat it, too.
Keep in mind that this issue is a Phobos one, not a core
language issue.

Before we discuss risk in the change, we need to agree that it is
even a desirable change. I don't think we have reached that point.

It's worth pointing out that all the performance issues can be
resolved in Phobos through specialisation with no disruption to
the users.

H. S. Teoh

2014-03-08 01:15:33 UTC

Post by Peter Alexander

Post by Walter Bright
Andrei suggests that this change would destroy D by breaking too
much existing code. He might be right. Can we afford the risk that
he is right?
We should think about a way to have our cake and eat it, too.
Keep in mind that this issue is a Phobos one, not a core language issue.

Before we discuss risk in the change, we need to agree that it is
even a desirable change. I don't think we have reached that point.
It's worth pointing out that all the performance issues can be
resolved in Phobos through specialisation with no disruption to the
users.

Regardless of which way we decide in the end, I hope the one thing good
that will come out of this thread is to improve the performance of
string algorithms in Phobos. Things like substring searching to
implement multibyte character (or multi-codepoint "characters")
operations efficiently are quite needed, IMO.

T

--
If a person can't communicate, the very least he could do is to shut up.
-- Tom Lehrer, on people who bemoan their communication woes with their
loved ones.

Vladimir Panteleev

2014-03-08 01:18:42 UTC

Post by Walter Bright
We should think about a way to have our cake and eat it, too.

I think a good place to start would be to have a draft
implementation of the proposal. This will allow people to try it
with their projects and see how much code it will really affect.
As I mentioned here[1], I suspect that certain valid code that
used the range primitives will continue to work unaffected even
after a sudden switch, so perhaps the "deprecation" and "error"
stage can be replaced with a longer "warning" stage instead.

This is similar to how git changed the meaning of the "push"
command: it just nagged users for a long time, and included the
instructions to switch to the new behavior early (thus squelching
the warning) or permanently accepting the old behavior. (For our
case it is adding .representation or .byCodepoint depending on
the intent.)

[1]:
http://forum.dlang.org/post/dlpmchtaqzrxxylpmiwh at forum.dlang.org

243 Replies
1 View
Permalink to this page
Disable enhanced parsing

Thread Navigation

Walter Bright 2014-03-07 02:37:13 UTC

bearophile 2014-03-07 02:54:50 UTC

Walter Bright 2014-03-07 02:57:40 UTC

bearophile 2014-03-07 03:26:51 UTC

Adam D. Ruppe 2014-03-07 04:01:14 UTC

Walter Bright 2014-03-07 04:19:18 UTC

bearophile 2014-03-07 04:22:29 UTC

H. S. Teoh 2014-03-07 06:12:38 UTC

Walter Bright 2014-03-07 06:18:14 UTC

Adam D. Ruppe 2014-03-07 13:56:46 UTC

Dicebot 2014-03-07 14:04:51 UTC

Adam D. Ruppe 2014-03-07 14:17:17 UTC

Walter Bright 2014-03-07 19:16:21 UTC

Kagamin 2014-03-07 10:44:45 UTC

Adam D. Ruppe 2014-03-07 14:13:53 UTC

Kagamin 2014-03-07 14:44:41 UTC

Adam D. Ruppe 2014-03-07 15:24:47 UTC

Walter Bright 2014-03-07 19:19:16 UTC

Walter Bright 2014-03-07 02:59:36 UTC

bearophile 2014-03-07 03:22:12 UTC

Walter Bright 2014-03-07 03:55:53 UTC

Andrei Alexandrescu 2014-03-07 19:59:54 UTC

Walter Bright 2014-03-08 00:18:18 UTC

Dmitry Olshansky 2014-03-07 09:56:28 UTC

Andrei Alexandrescu 2014-03-07 19:03:04 UTC

H. S. Teoh 2014-03-07 03:31:17 UTC

Walter Bright 2014-03-07 03:57:49 UTC

bearophile 2014-03-07 03:59:55 UTC

Walter Bright 2014-03-07 04:17:34 UTC

Shammah Chancellor 2014-03-07 12:09:27 UTC

Michel Fortin 2014-03-07 13:40:31 UTC

Vladimir Panteleev 2014-03-07 13:51:48 UTC

Kagamin 2014-03-07 14:47:26 UTC

Michel Fortin 2014-03-07 19:13:13 UTC

Peter Alexander 2014-03-07 12:03:41 UTC

Vladimir Panteleev 2014-03-07 12:32:19 UTC

Walter Bright 2014-03-07 03:06:43 UTC

Walter Bright 2014-03-07 03:52:44 UTC

Dmitry Olshansky 2014-03-07 10:11:27 UTC

Walter Bright 2014-03-07 10:33:23 UTC

Vladimir Panteleev 2014-03-07 17:24:59 UTC

H. S. Teoh 2014-03-07 17:34:05 UTC

Andrei Alexandrescu 2014-03-07 19:11:49 UTC

Dmitry Olshansky 2014-03-07 19:28:09 UTC

Andrei Alexandrescu 2014-03-07 21:53:09 UTC

Walter Bright 2014-03-07 19:32:11 UTC

Dmitry Olshansky 2014-03-07 19:52:13 UTC

Andrei Alexandrescu 2014-03-07 19:59:23 UTC

Nick Sabalausky 2014-03-07 04:11:06 UTC

Nick Sabalausky 2014-03-07 04:44:30 UTC

Dmitry Olshansky 2014-03-07 10:27:57 UTC

Walter Bright 2014-03-07 10:41:18 UTC

Steven Schveighoffer 2014-03-07 13:32:17 UTC

Dmitry Olshansky 2014-03-07 19:43:51 UTC

Walter Bright 2014-03-07 19:51:50 UTC

Vladimir Panteleev 2014-03-07 11:56:56 UTC

Andrej Mitrovic 2014-03-07 12:07:54 UTC

Robert Schadek 2014-03-07 13:11:26 UTC

Steven Schveighoffer 2014-03-07 13:40:33 UTC

Dicebot 2014-03-07 15:03:23 UTC

Vladimir Panteleev 2014-03-07 16:18:04 UTC

Dicebot 2014-03-07 16:43:29 UTC

Vladimir Panteleev 2014-03-07 17:04:29 UTC

Dicebot 2014-03-07 17:08:02 UTC

H. S. Teoh 2014-03-07 17:38:12 UTC

Dicebot 2014-03-07 18:03:19 UTC

Walter Bright 2014-03-07 19:46:10 UTC

Walter Bright 2014-03-07 19:44:01 UTC

Sean Kelly 2014-03-07 15:51:48 UTC

Chris 2014-03-07 18:03:18 UTC

Andrei Alexandrescu 2014-03-07 19:57:23 UTC

H. S. Teoh 2014-03-07 20:26:00 UTC

Vladimir Panteleev 2014-03-07 20:43:44 UTC

Eyrk 2014-03-07 21:56:43 UTC

Vladimir Panteleev 2014-03-07 21:58:39 UTC

TC 2014-03-07 22:16:57 UTC

Vladimir Panteleev 2014-03-07 22:18:16 UTC

TC 2014-03-07 22:23:43 UTC

Eyrk 2014-03-07 22:19:08 UTC

TC 2014-03-07 22:25:30 UTC

H. S. Teoh 2014-03-07 22:26:04 UTC

Eyrk 2014-03-07 22:55:45 UTC

TC 2014-03-07 23:06:19 UTC

Brad Anderson 2014-03-07 23:50:51 UTC

Andrei Alexandrescu 2014-03-08 01:23:32 UTC

Sarath Kodali 2014-03-07 22:35:46 UTC

H. S. Teoh 2014-03-07 23:12:16 UTC

Sarath Kodali 2014-03-07 23:13:50 UTC

H. S. Teoh 2014-03-07 23:33:35 UTC

Andrei Alexandrescu 2014-03-08 00:44:58 UTC

Vladimir Panteleev 2014-03-08 00:45:51 UTC

Dmitry Olshansky 2014-03-07 20:48:19 UTC

Andrei Alexandrescu 2014-03-08 01:18:55 UTC

Vladimir Panteleev 2014-03-08 00:39:11 UTC

Andrei Alexandrescu 2014-03-08 00:44:41 UTC

Timon Gehr 2014-03-07 23:40:41 UTC

Walter Bright 2014-03-08 00:22:10 UTC

Peter Alexander 2014-03-08 00:46:21 UTC

H. S. Teoh 2014-03-08 01:15:33 UTC

Vladimir Panteleev 2014-03-08 01:18:42 UTC

about - legalese

Loading...