Discussion:
RFC: std.json sucessor
Sönke Ludwig via Digitalmars-d
2014-08-21 22:35:19 UTC
Permalink
Following up on the recent "std.jgrandson" thread [1], I've picked up
the work (a lot earlier than anticipated) and finished a first version
of a loose blend of said std.jgrandson, vibe.data.json and some changes
that I had planned for vibe.data.json for a while. I'm quite pleased by
the results so far, although without a serialization framework it still
misses a very important building block.

Code: https://github.com/s-ludwig/std_data_json
Docs: http://s-ludwig.github.io/std_data_json/
DUB: http://code.dlang.org/packages/std_data_json

The new code contains:
- Lazy lexer in the form of a token input range (using slices of the
input if possible)
- Lazy streaming parser (StAX style) in the form of a node input range
- Eager DOM style parser returning a JSONValue
- Range based JSON string generator taking either a token range, a
node range, or a JSONValue
- Opt-out location tracking (line/column) for tokens, nodes and values
- No opDispatch() for JSONValue - this has shown to do more harm than
good in vibe.data.json

The DOM style JSONValue type is based on std.variant.Algebraic. This
currently has a few usability issues that can be solved by
upgrading/fixing Algebraic:

- Operator overloading only works sporadically
- No "tag" enum is supported, so that switch()ing on the type of a
value doesn't work and an if-else cascade is required
- Operations and conversions between different Algebraic types is not
conveniently supported, which gets important when other similar
formats get supported (e.g. BSON)

Assuming that those points are solved, I'd like to get some early
feedback before going for an official review. One open issue is how to
handle unescaping of string literals. Currently it always unescapes
immediately, which is more efficient for general input ranges when the
unescaped result is needed, but less efficient for string inputs when
the unescaped result is not needed. Maybe a flag could be used to
conditionally switch behavior depending on the input range type.

Destroy away! ;)

[1]: http://forum.dlang.org/thread/lrknjl$co7$1 at digitalmars.com
Brian Schott via Digitalmars-d
2014-08-21 22:48:28 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Destroy away! ;)
source/stdx/data/json/lexer.d(263:8)[warn]: 'JSONToken' has
method 'opEquals', but not 'toHash'.
source/stdx/data/json/lexer.d(499:65)[warn]: Use parenthesis to
clarify this expression.
source/stdx/data/json/parser.d(516:8)[warn]: 'JSONParserNode' has
method 'opEquals', but not 'toHash'.
source/stdx/data/json/value.d(95:10)[warn]: Variable c is never
used.
source/stdx/data/json/value.d(99:10)[warn]: Variable d is never
used.
source/stdx/data/json/package.d(942:14)[warn]: Variable val is
never used.

It's likely that you can ignore these, but I thought I'd post
them anyways. (The last three are in unittest blocks, for
example.)
Justin Whear via Digitalmars-d
2014-08-21 23:27:28 UTC
Permalink
Someone needs to make a "showbrianmycode" bot: mention a D github repo
and it runs static analysis for you.
Idan Arye via Digitalmars-d
2014-08-21 23:33:34 UTC
Permalink
Post by Justin Whear via Digitalmars-d
Someone needs to make a "showbrianmycode" bot: mention a D
github repo
and it runs static analysis for you.
Why bother with mentioning a GitHub repo? Just make the bot
periodically scan the DUB registry.
Brian Schott via Digitalmars-d
2014-08-21 23:54:10 UTC
Permalink
Post by Idan Arye via Digitalmars-d
Why bother with mentioning a GitHub repo? Just make the bot
periodically scan the DUB registry.
It's kind of picky. Loading Image...
Sönke Ludwig via Digitalmars-d
2014-08-22 07:34:16 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Destroy away! ;)
source/stdx/data/json/lexer.d(263:8)[warn]: 'JSONToken' has method
'opEquals', but not 'toHash'.
source/stdx/data/json/lexer.d(499:65)[warn]: Use parenthesis to clarify
this expression.
source/stdx/data/json/parser.d(516:8)[warn]: 'JSONParserNode' has method
'opEquals', but not 'toHash'.
source/stdx/data/json/value.d(95:10)[warn]: Variable c is never used.
source/stdx/data/json/value.d(99:10)[warn]: Variable d is never used.
source/stdx/data/json/package.d(942:14)[warn]: Variable val is never used.
It's likely that you can ignore these, but I thought I'd post them
anyways. (The last three are in unittest blocks, for example.)
Fixed all of them (neither was causing harm, but it's still nicer that
way). Also added @safe and nothrow where possible.

BTW, anyone knows what's holding back formattedWrite() from being @safe
for simple types?
Ary Borenszweig via Digitalmars-d
2014-08-22 00:42:21 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Following up on the recent "std.jgrandson" thread [1], I've picked up
the work (a lot earlier than anticipated) and finished a first version
of a loose blend of said std.jgrandson, vibe.data.json and some changes
that I had planned for vibe.data.json for a while. I'm quite pleased by
the results so far, although without a serialization framework it still
misses a very important building block.
Code: https://github.com/s-ludwig/std_data_json
Docs: http://s-ludwig.github.io/std_data_json/
DUB: http://code.dlang.org/packages/std_data_json
Say I have a class Person with name (string) and age (int) with a
constructor that receives both. How would I create an instance of a
Person from a json with the json stream?

Suppose the json is this:

{"age": 10, "name": "John"}

And the class is this:

class Person {
this(string name, int age) {
// ...
}
}
Sönke Ludwig via Digitalmars-d
2014-08-22 06:33:26 UTC
Permalink
Post by Ary Borenszweig via Digitalmars-d
Say I have a class Person with name (string) and age (int) with a
constructor that receives both. How would I create an instance of a
Person from a json with the json stream?
{"age": 10, "name": "John"}
class Person {
this(string name, int age) {
// ...
}
}
Without a serialization framework it would in theory work like this:

JSONValue v = parseJSON(`{"age": 10, "name": "John"}`);
auto p = new Person(v["name"].get!string, v["age"].get!int);

unfortunately the operator overloading doesn't work like this currently,
so this is needed:

JSONValue v = parseJSON(`{"age": 10, "name": "John"}`);
auto p = new Person(
v.get!(Json[string])["name"].get!string,
v.get!(Json[string])["age"].get!int);

That should be solved together with the new module (it could of course
also easily be added to JSONValue itself instead of Algebraic, but the
value of having it in Algebraic would be much higher).
Ary Borenszweig via Digitalmars-d
2014-08-22 14:53:08 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Post by Ary Borenszweig via Digitalmars-d
Say I have a class Person with name (string) and age (int) with a
constructor that receives both. How would I create an instance of a
Person from a json with the json stream?
{"age": 10, "name": "John"}
class Person {
this(string name, int age) {
// ...
}
}
JSONValue v = parseJSON(`{"age": 10, "name": "John"}`);
auto p = new Person(v["name"].get!string, v["age"].get!int);
unfortunately the operator overloading doesn't work like this currently,
JSONValue v = parseJSON(`{"age": 10, "name": "John"}`);
auto p = new Person(
v.get!(Json[string])["name"].get!string,
v.get!(Json[string])["age"].get!int);
But does this parse the whole json into JSONValue? I want to create a
Person without creating an intermediate JSONValue for the whole json.
Can this be done?
Sönke Ludwig via Digitalmars-d
2014-08-22 16:24:19 UTC
Permalink
Post by Ary Borenszweig via Digitalmars-d
Post by Sönke Ludwig via Digitalmars-d
JSONValue v = parseJSON(`{"age": 10, "name": "John"}`);
auto p = new Person(v["name"].get!string, v["age"].get!int);
unfortunately the operator overloading doesn't work like this currently,
JSONValue v = parseJSON(`{"age": 10, "name": "John"}`);
auto p = new Person(
v.get!(Json[string])["name"].get!string,
v.get!(Json[string])["age"].get!int);
But does this parse the whole json into JSONValue? I want to create a
Person without creating an intermediate JSONValue for the whole json.
Can this be done?
That would be done by the serialization framework. Instead of using
parseJSON(), it could use parseJSONStream() to populate the Person
instance on the fly, without putting the whole JSON into memory. But I'd
like to leave that for a later addition, because we'd otherwise end up
with duplicate functionality once std.serialization gets finalized.

Manually it would work similar to this:

auto nodes = parseJSONStream(`{"age": 10, "name": "John"}`);
with (JSONParserNode.Kind) {
enforce(nodes.front == objectStart);
nodes.popFront();
while (nodes.front != objectEnd) {
auto key = nodes.front.key;
nodes.popFront();
if (key == "name")
person.name = nodes.front.literal.string;
else if (key == "age")
person.age = nodes.front.literal.number;
}
}
Ary Borenszweig via Digitalmars-d
2014-08-22 17:08:00 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Post by Ary Borenszweig via Digitalmars-d
Post by Sönke Ludwig via Digitalmars-d
JSONValue v = parseJSON(`{"age": 10, "name": "John"}`);
auto p = new Person(v["name"].get!string, v["age"].get!int);
unfortunately the operator overloading doesn't work like this currently,
JSONValue v = parseJSON(`{"age": 10, "name": "John"}`);
auto p = new Person(
v.get!(Json[string])["name"].get!string,
v.get!(Json[string])["age"].get!int);
But does this parse the whole json into JSONValue? I want to create a
Person without creating an intermediate JSONValue for the whole json.
Can this be done?
That would be done by the serialization framework. Instead of using
parseJSON(), it could use parseJSONStream() to populate the Person
instance on the fly, without putting the whole JSON into memory. But I'd
like to leave that for a later addition, because we'd otherwise end up
with duplicate functionality once std.serialization gets finalized.
auto nodes = parseJSONStream(`{"age": 10, "name": "John"}`);
with (JSONParserNode.Kind) {
enforce(nodes.front == objectStart);
nodes.popFront();
while (nodes.front != objectEnd) {
auto key = nodes.front.key;
nodes.popFront();
if (key == "name")
person.name = nodes.front.literal.string;
else if (key == "age")
person.age = nodes.front.literal.number;
}
}
Cool, that looks good :-)
Colden Cullen via Digitalmars-d
2014-08-22 02:35:39 UTC
Permalink
I notice in the docs there are several references to a
`parseJSON` and `parseJson`, but I can't seem to find where
either of these are defined. Is this just a typo?

Hope this helps:
https://github.com/s-ludwig/std_data_json/search?q=parseJson&type=Code
Sönke Ludwig via Digitalmars-d
2014-08-22 06:34:34 UTC
Permalink
I notice in the docs there are several references to a `parseJSON` and
`parseJson`, but I can't seem to find where either of these are defined.
Is this just a typo?
https://github.com/s-ludwig/std_data_json/search?q=parseJson&type=Code
Seems like I forgot to replace a few mentions. They are called
parseJSONValue and toJSONValue now for clarity.
Sönke Ludwig via Digitalmars-d
2014-08-22 10:49:02 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
The DOM style JSONValue type is based on std.variant.Algebraic. This
currently has a few usability issues that can be solved by
- Operator overloading only works sporadically
- (...)
- Operations and conversions between different Algebraic types is not
conveniently supported, which gets important when other similar
formats get supported (e.g. BSON)
https://github.com/D-Programming-Language/phobos/pull/2452
https://github.com/D-Programming-Language/phobos/pull/2453

Those fix the most important operators, index access and binary arithmetic.
matovitch via Digitalmars-d
2014-08-22 12:17:06 UTC
Permalink
Very nice ! I had started (and dropped) a json module based on
Algebraic too. So without opDispatch you plan to use a syntax
like jPerson["age"] = 10 ? You didn't use stdx.d.lexer. Any
reason why ? (I am asking even if I never used this module.(never
coded much in D in fact))
Sönke Ludwig via Digitalmars-d
2014-08-22 12:39:08 UTC
Permalink
Very nice ! I had started (and dropped) a json module based on Algebraic
too. So without opDispatch you plan to use a syntax like jPerson["age"]
= 10 ? You didn't use stdx.d.lexer. Any reason why ? (I am asking even
if I never used this module.(never coded much in D in fact))
Exactly, that's the syntax you'd use for JSONValue. But my favorite way
to work with most JSON data is actually to directly read the JSON string
into a D struct using a serialization framework and then access the
struct in a strongly typed way. This has both, less syntactic and less
runtime overhead, and also greatly reduces the chance for field
name/type related bugs.

The module is written against current Phobos, which is why stdx.d.lexer
wasn't really an option. I'm also unsure if std.lexer would be able to
handle the parsing required for JSON numbers and strings. But it would
certainly be nice already if at least the token structure could be
reused. However, it should also be possible to find a painless migration
path later, when std.lexer is actually part of Phobos.
matovitch via Digitalmars-d
2014-08-22 12:47:31 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Very nice ! I had started (and dropped) a json module based on Algebraic
too. So without opDispatch you plan to use a syntax like
jPerson["age"]
= 10 ? You didn't use stdx.d.lexer. Any reason why ? (I am
asking even
if I never used this module.(never coded much in D in fact))
Exactly, that's the syntax you'd use for JSONValue. But my
favorite way to work with most JSON data is actually to
directly read the JSON string into a D struct using a
serialization framework and then access the struct in a
strongly typed way. This has both, less syntactic and less
runtime overhead, and also greatly reduces the chance for field
name/type related bugs.
Completely agree, I am waiting for a serializer too. I would love
to see something like cap'n proto in D.
Post by Sönke Ludwig via Digitalmars-d
The module is written against current Phobos, which is why
stdx.d.lexer wasn't really an option. I'm also unsure if
std.lexer would be able to handle the parsing required for JSON
numbers and strings. But it would certainly be nice already if
at least the token structure could be reused. However, it
should also be possible to find a painless migration path
later, when std.lexer is actually part of Phobos.
Ok. I think I remember there was a stdx.d.lexer's Json parser
provided as sample.
Sönke Ludwig via Digitalmars-d
2014-08-22 13:00:20 UTC
Permalink
Ok. I think I remember there was a stdx.d.lexer's Json parser provided
as sample.
I see, so you just have to write your own number/string parsing routines:
https://github.com/Hackerpilot/lexer-demo/blob/master/jsonlexer.d
matovitch via Digitalmars-d
2014-08-22 13:20:18 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Post by matovitch via Digitalmars-d
Ok. I think I remember there was a stdx.d.lexer's Json parser
provided
as sample.
https://github.com/Hackerpilot/lexer-demo/blob/master/jsonlexer.d
It's kind of "low level" indeed...I don't know what kind of back
magic are doing all these template mixins but the code looks
quite clean.

Confusing :

// Therefore, this always returns false.
bool isSeparating(size_t offset) pure nothrow @safe
{
return true;
}
Jacob Carlborg via Digitalmars-d
2014-08-22 15:47:50 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Following up on the recent "std.jgrandson" thread [1], I've picked up
the work (a lot earlier than anticipated) and finished a first version
of a loose blend of said std.jgrandson, vibe.data.json and some changes
that I had planned for vibe.data.json for a while. I'm quite pleased by
the results so far, although without a serialization framework it still
misses a very important building block.
Code: https://github.com/s-ludwig/std_data_json
Docs: http://s-ludwig.github.io/std_data_json/
DUB: http://code.dlang.org/packages/std_data_json
* Opening braces should be put on their own line to follow Phobos style
guides

* I'm wondering about the assert in lexer.d, line 160. What happens if
two invalid tokens after each other occur?

* I think we have talked about this before, when reviewing D lexers. I'm
thinking of how to handle invalid data. Is it the best solution to throw
an exception? Would it be possible to return an error token and have the
client decide what to do about? Shouldn't it be possible to build a JSON
validator on this?

* The lexer seems to always convert JSON types to their native D types,
is that wise to do? That's unnecessary if you're implementing syntax
highlighting
--
/Jacob Carlborg
via Digitalmars-d
2014-08-22 15:56:33 UTC
Permalink
Post by Jacob Carlborg via Digitalmars-d
* I think we have talked about this before, when reviewing D
lexers. I'm thinking of how to handle invalid data. Is it the
best solution to throw an exception? Would it be possible to
return an error token and have the client decide what to do
about?
Hmm... my initial reaction was "not as default - it should throw
on error, otherwise noone will check for errors". But if it's
returning an error token, maybe it would be sufficient if that
token throws when its value is accessed?
Sönke Ludwig via Digitalmars-d
2014-08-22 16:13:24 UTC
Permalink
Post by Jacob Carlborg via Digitalmars-d
Post by Sönke Ludwig via Digitalmars-d
Following up on the recent "std.jgrandson" thread [1], I've picked up
the work (a lot earlier than anticipated) and finished a first version
of a loose blend of said std.jgrandson, vibe.data.json and some changes
that I had planned for vibe.data.json for a while. I'm quite pleased by
the results so far, although without a serialization framework it still
misses a very important building block.
Code: https://github.com/s-ludwig/std_data_json
Docs: http://s-ludwig.github.io/std_data_json/
DUB: http://code.dlang.org/packages/std_data_json
* Opening braces should be put on their own line to follow Phobos style
guides
Will do.
Post by Jacob Carlborg via Digitalmars-d
* I'm wondering about the assert in lexer.d, line 160. What happens if
two invalid tokens after each other occur?
There are actually no invalid tokens at all, the "invalid" enum value is
only used to denote that no token is currently stored in _front. If
readToken() doesn't throw, there will always be a valid token.
Post by Jacob Carlborg via Digitalmars-d
* I think we have talked about this before, when reviewing D lexers. I'm
thinking of how to handle invalid data. Is it the best solution to throw
an exception? Would it be possible to return an error token and have the
client decide what to do about? Shouldn't it be possible to build a JSON
validator on this?
That would indeed be a possibility, it's how I used to handle it in my
private version of std.lexer, too. It could also be made a compile time
option.
Post by Jacob Carlborg via Digitalmars-d
* The lexer seems to always convert JSON types to their native D types,
is that wise to do? That's unnecessary if you're implementing syntax
highlighting
It's basically the same trade-off as for unescaping string literals. For
"string" inputs, it would be more efficient to just store a slice, but
for generic input ranges it avoids the otherwise needed allocation. The
proposed flag could make an improvement here, too.
Sönke Ludwig via Digitalmars-d
2014-08-22 21:31:17 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Post by Jacob Carlborg via Digitalmars-d
* Opening braces should be put on their own line to follow Phobos style
guides
Will do.
Post by Jacob Carlborg via Digitalmars-d
* I'm wondering about the assert in lexer.d, line 160. What happens if
two invalid tokens after each other occur?
There are actually no invalid tokens at all, the "invalid" enum value is
only used to denote that no token is currently stored in _front. If
readToken() doesn't throw, there will always be a valid token.
Renamed from "invalid" to "none" now to avoid confusion ->
Post by Sönke Ludwig via Digitalmars-d
Post by Jacob Carlborg via Digitalmars-d
* I think we have talked about this before, when reviewing D lexers. I'm
thinking of how to handle invalid data. Is it the best solution to throw
an exception? Would it be possible to return an error token and have the
client decide what to do about? Shouldn't it be possible to build a JSON
validator on this?
That would indeed be a possibility, it's how I used to handle it in my
private version of std.lexer, too. It could also be made a compile time
option.
and an additional "error" kind has been added, which implements the
above. Enabled using LexOptions.noThrow.
Post by Sönke Ludwig via Digitalmars-d
Post by Jacob Carlborg via Digitalmars-d
* The lexer seems to always convert JSON types to their native D types,
is that wise to do? That's unnecessary if you're implementing syntax
highlighting
It's basically the same trade-off as for unescaping string literals. For
"string" inputs, it would be more efficient to just store a slice, but
for generic input ranges it avoids the otherwise needed allocation. The
proposed flag could make an improvement here, too.
via Digitalmars-d
2014-08-22 16:15:03 UTC
Permalink
Some thoughts about the API:

1) Instead of `parseJSONValue` and `lexJSON`, how about static
methods `JSON.parse` and `JSON.lex`, or even a module level
functions `std.data.json.parse` etc.? The "JSON" part of the name
is redundant.

2) Also, `parseJSONValue` and `parseJSONStream` probably don't
need to have different names. They can be distinguished by their
parameter types.

3) `toJSONString` shouldn't just take a boolean as flag for
pretty-printing. It should either use something like
`Pretty.YES`, or the function should be called
`toPrettyJSONString` (I believe I have seen this latter
convention elsewhere).
We should also think about whether we can just call the functions
`toString` and `toPrettyString`. Alternatively, `toJSON` and
`toPrettyJSON` should be considered.
Sönke Ludwig via Digitalmars-d
2014-08-22 16:48:42 UTC
Permalink
1) Instead of `parseJSONValue` and `lexJSON`, how about static methods
`JSON.parse` and `JSON.lex`, or even a module level functions
`std.data.json.parse` etc.? The "JSON" part of the name is redundant.
For those functions it may be acceptable, although I really dislike that
style, because it makes the code harder to read (what exactly does this
parse?) and the functions are rarely used, so that that typing that
additional "JSON" should be no issue at all. On the other hand, if you
always type "JSON.lex" it's more to type than just "lexJSON".

But for "[JSON]Value" it gets ugly really quick, because "Value"s are
such a common thing and quickly occur in multiple kinds in the same
source file.
2) Also, `parseJSONValue` and `parseJSONStream` probably don't need to
have different names. They can be distinguished by their parameter types.
Actually they take exactly the same parameters and just differ in their
return value. It would be more descriptive to name them parseAsJSONValue
and parseAsJSONStream - or maybe parseJSONAsValue or parseJSONToValue?
The current naming is somewhat modeled after std.conv's "to!T" and
"parse!T".
3) `toJSONString` shouldn't just take a boolean as flag for
pretty-printing. It should either use something like `Pretty.YES`, or
the function should be called `toPrettyJSONString` (I believe I have
seen this latter convention elsewhere).
We should also think about whether we can just call the functions
`toString` and `toPrettyString`. Alternatively, `toJSON` and
`toPrettyJSON` should be considered.
Agreed, a boolean isn't good for a public interface, renaming the
current writeAsString to private writeAsStringImpl and then adding
"(writeAs/to)[Pretty]String" sounds reasonable. Actually I've done it
that way for vibe.data.json.
via Digitalmars-d
2014-08-22 17:24:46 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
1) Instead of `parseJSONValue` and `lexJSON`, how about static methods
`JSON.parse` and `JSON.lex`, or even a module level functions
`std.data.json.parse` etc.? The "JSON" part of the name is
redundant.
For those functions it may be acceptable, although I really
dislike that style, because it makes the code harder to read
(what exactly does this parse?) and the functions are rarely
used, so that that typing that additional "JSON" should be no
issue at all. On the other hand, if you always type "JSON.lex"
it's more to type than just "lexJSON".
I'm not really concerned about the amount of typing, it just
seemed a bit odd to have the redundant JSON in there, as we have
module names for namespacing. Your argument about readability is
true nevertheless. But...
Post by Sönke Ludwig via Digitalmars-d
But for "[JSON]Value" it gets ugly really quick, because
"Value"s are such a common thing and quickly occur in multiple
kinds in the same source file.
2) Also, `parseJSONValue` and `parseJSONStream` probably don't need to
have different names. They can be distinguished by their
parameter types.
Actually they take exactly the same parameters and just differ
in their return value. It would be more descriptive to name
them parseAsJSONValue and parseAsJSONStream - or maybe
parseJSONAsValue or parseJSONToValue? The current naming is
somewhat modeled after std.conv's "to!T" and "parse!T".
... why not use exactly the same convention then? =>
`parse!JSONValue`

Would be nice to have a "pluggable" API where you just need to
specify the type in a factory method to choose the input format.
Then there could be `parse!BSON`, `parse!YAML`, with the same
style as `parse!(int[])`.

I know this sound a bit like bike-shedding, but the API shouldn't
stand by itself, but fit into the "big picture", especially as
there will probably be other parsers (you already named the
module std._data_.json).
Sönke Ludwig via Digitalmars-d
2014-08-22 17:35:19 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Actually they take exactly the same parameters and just differ in
their return value. It would be more descriptive to name them
parseAsJSONValue and parseAsJSONStream - or maybe parseJSONAsValue or
parseJSONToValue? The current naming is somewhat modeled after
std.conv's "to!T" and "parse!T".
... why not use exactly the same convention then? => `parse!JSONValue`
Would be nice to have a "pluggable" API where you just need to specify
the type in a factory method to choose the input format. Then there
could be `parse!BSON`, `parse!YAML`, with the same style as
`parse!(int[])`.
I know this sound a bit like bike-shedding, but the API shouldn't stand
by itself, but fit into the "big picture", especially as there will
probably be other parsers (you already named the module std._data_.json).
That would be nice, but then it should also work together with std.conv,
which basically is exactly this pluggable API. Just like this it would
result in an ambiguity error if both std.data.json and std.conv are
imported at the same time.

Is there a way to make std.conv work properly with JSONValue? I guess
the only theoretical way would be to put something in JSONValue, but
that would result in a slightly ugly cyclic dependency between parser.d
and value.d.
via Digitalmars-d
2014-08-22 17:57:32 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Post by via Digitalmars-d
... why not use exactly the same convention then? =>
`parse!JSONValue`
Would be nice to have a "pluggable" API where you just need to specify
the type in a factory method to choose the input format. Then
there
could be `parse!BSON`, `parse!YAML`, with the same style as
`parse!(int[])`.
I know this sound a bit like bike-shedding, but the API
shouldn't stand
by itself, but fit into the "big picture", especially as there will
probably be other parsers (you already named the module
std._data_.json).
That would be nice, but then it should also work together with
std.conv, which basically is exactly this pluggable API. Just
like this it would result in an ambiguity error if both
std.data.json and std.conv are imported at the same time.
Is there a way to make std.conv work properly with JSONValue? I
guess the only theoretical way would be to put something in
JSONValue, but that would result in a slightly ugly cyclic
dependency between parser.d and value.d.
The easiest and cleanest way would be to add a function in
std.data.json:

auto parse(Target, Source)(Source input)
if(is(Target == JSONValue))
{
return ...;
}

The various overloads of `std.conv.parse` already have mutually
exclusive template constraints, they will not collide with our
function.
Sönke Ludwig via Digitalmars-d
2014-08-22 18:08:32 UTC
Permalink
Post by via Digitalmars-d
Post by Sönke Ludwig via Digitalmars-d
... why not use exactly the same convention then? => `parse!JSONValue`
Would be nice to have a "pluggable" API where you just need to specify
the type in a factory method to choose the input format. Then there
could be `parse!BSON`, `parse!YAML`, with the same style as
`parse!(int[])`.
I know this sound a bit like bike-shedding, but the API shouldn't stand
by itself, but fit into the "big picture", especially as there will
probably be other parsers (you already named the module
std._data_.json).
That would be nice, but then it should also work together with
std.conv, which basically is exactly this pluggable API. Just like
this it would result in an ambiguity error if both std.data.json and
std.conv are imported at the same time.
Is there a way to make std.conv work properly with JSONValue? I guess
the only theoretical way would be to put something in JSONValue, but
that would result in a slightly ugly cyclic dependency between
parser.d and value.d.
auto parse(Target, Source)(Source input)
if(is(Target == JSONValue))
{
return ...;
}
The various overloads of `std.conv.parse` already have mutually
exclusive template constraints, they will not collide with our function.
Okay, for parse that may work, but what about to!()?
via Digitalmars-d
2014-08-22 19:00:14 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Post by via Digitalmars-d
The easiest and cleanest way would be to add a function in
auto parse(Target, Source)(Source input)
if(is(Target == JSONValue))
{
return ...;
}
The various overloads of `std.conv.parse` already have mutually
exclusive template constraints, they will not collide with our function.
Okay, for parse that may work, but what about to!()?
What's the problem with to!()?
Sönke Ludwig via Digitalmars-d
2014-08-23 16:49:24 UTC
Permalink
Post by via Digitalmars-d
Post by Sönke Ludwig via Digitalmars-d
Post by via Digitalmars-d
The easiest and cleanest way would be to add a function in
auto parse(Target, Source)(Source input)
if(is(Target == JSONValue))
{
return ...;
}
The various overloads of `std.conv.parse` already have mutually
exclusive template constraints, they will not collide with our function.
Okay, for parse that may work, but what about to!()?
What's the problem with to!()?
to!() definitely doesn't have a template constraint that excludes
JSONValue. Instead, it will convert any struct type that doesn't define
toString() to a D-like representation.
via Digitalmars-d
2014-08-23 17:25:02 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Post by via Digitalmars-d
Am 22.08.2014 19:57, schrieb "Marc SchÃŒtz"
Post by via Digitalmars-d
The easiest and cleanest way would be to add a function in
auto parse(Target, Source)(Source input)
if(is(Target == JSONValue))
{
return ...;
}
The various overloads of `std.conv.parse` already have
mutually
exclusive template constraints, they will not collide with
our function.
Okay, for parse that may work, but what about to!()?
What's the problem with to!()?
to!() definitely doesn't have a template constraint that
excludes JSONValue. Instead, it will convert any struct type
that doesn't define toString() to a D-like representation.
For converting a JSONValue to a different type, JSONValue can
implement `opCast`, which is the regular interface that
std.conv.to uses if it's available.

For converting something _to_ a JSONValue, std.conv.to will
simply create an instance of it by calling the constructor.
Sönke Ludwig via Digitalmars-d
2014-08-23 17:32:02 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Post by via Digitalmars-d
Post by Sönke Ludwig via Digitalmars-d
Post by via Digitalmars-d
The easiest and cleanest way would be to add a function in
auto parse(Target, Source)(Source input)
if(is(Target == JSONValue))
{
return ...;
}
The various overloads of `std.conv.parse` already have mutually
exclusive template constraints, they will not collide with our function.
Okay, for parse that may work, but what about to!()?
What's the problem with to!()?
to!() definitely doesn't have a template constraint that excludes
JSONValue. Instead, it will convert any struct type that doesn't
define toString() to a D-like representation.
For converting a JSONValue to a different type, JSONValue can implement
`opCast`, which is the regular interface that std.conv.to uses if it's
available.
For converting something _to_ a JSONValue, std.conv.to will simply
create an instance of it by calling the constructor.
That would just introduce the said dependency cycle between JSONValue,
the parser and the lexer. Possible, but not particularly pretty. Also,
using the JSONValue constructor to parse an input string would
contradict the intuitive behavior to just store the string value.
via Digitalmars-d
2014-08-23 18:31:18 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
On Saturday, 23 August 2014 at 16:49:23 UTC, Sönke Ludwig
Am 22.08.2014 21:00, schrieb "Marc SchÃŒtz"
On Friday, 22 August 2014 at 18:08:34 UTC, Sönke Ludwig
Am 22.08.2014 19:57, schrieb "Marc SchÃŒtz"
Post by via Digitalmars-d
auto parse(Target, Source)(Source input)
if(is(Target == JSONValue))
{
return ...;
}
The various overloads of `std.conv.parse` already have
mutually
exclusive template constraints, they will not collide with our
function.
Okay, for parse that may work, but what about to!()?
What's the problem with to!()?
to!() definitely doesn't have a template constraint that
excludes
JSONValue. Instead, it will convert any struct type that
doesn't
define toString() to a D-like representation.
For converting a JSONValue to a different type, JSONValue can
implement
`opCast`, which is the regular interface that std.conv.to uses if it's
available.
For converting something _to_ a JSONValue, std.conv.to will
simply
create an instance of it by calling the constructor.
That would just introduce the said dependency cycle between
JSONValue, the parser and the lexer. Possible, but not
particularly pretty. Also, using the JSONValue constructor to
parse an input string would contradict the intuitive behavior
to just store the string value.
That's what I expect it to do anyway. For parsing, there are
already other functions. "mystring".to!JSONValue should just wrap
"mystring".
Sönke Ludwig via Digitalmars-d
2014-08-23 18:52:40 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Post by Sönke Ludwig via Digitalmars-d
Post by via Digitalmars-d
Post by Sönke Ludwig via Digitalmars-d
Post by via Digitalmars-d
auto parse(Target, Source)(Source input)
if(is(Target == JSONValue))
{
return ...;
}
The various overloads of `std.conv.parse` already have mutually
exclusive template constraints, they will not collide with our function.
Okay, for parse that may work, but what about to!()?
What's the problem with to!()?
to!() definitely doesn't have a template constraint that excludes
JSONValue. Instead, it will convert any struct type that doesn't
define toString() to a D-like representation.
For converting a JSONValue to a different type, JSONValue can implement
`opCast`, which is the regular interface that std.conv.to uses if it's
available.
For converting something _to_ a JSONValue, std.conv.to will simply
create an instance of it by calling the constructor.
That would just introduce the said dependency cycle between JSONValue,
the parser and the lexer. Possible, but not particularly pretty. Also,
using the JSONValue constructor to parse an input string would
contradict the intuitive behavior to just store the string value.
That's what I expect it to do anyway. For parsing, there are already
other functions. "mystring".to!JSONValue should just wrap "mystring".
Probably, but then to!() is inconsistent with parse!(). Usually they are
both the same apart from how the tail of the input string is handled.
Christian Manning via Digitalmars-d
2014-08-22 16:31:05 UTC
Permalink
It would be nice to have integers treated separately to doubles.
I know it makes the number parsing simpler to just treat
everything as double, but still, it could be annoying when you
expect an integer type.

I'd also like to see some benchmarks, particularly against some
of the high performance C++ parsers, i.e. rapidjson, gason,
sajson. Or even some of the "not bad" performance parsers with
better APIs, i.e. QJsonDocument, jsoncpp and jsoncons (slow but
perhaps comparable interface to this proposal?).
Sönke Ludwig via Digitalmars-d
2014-08-22 16:56:26 UTC
Permalink
It would be nice to have integers treated separately to doubles. I know
it makes the number parsing simpler to just treat everything as double,
but still, it could be annoying when you expect an integer type.
That's how I've done it for vibe.data.json, too. For the new
implementation, I've just used the number parsing routine from Andrei's
std.jgrandson module. Does anybody have reservations about representing
integers as "long" instead?
I'd also like to see some benchmarks, particularly against some of the
high performance C++ parsers, i.e. rapidjson, gason, sajson. Or even
some of the "not bad" performance parsers with better APIs, i.e.
QJsonDocument, jsoncpp and jsoncons (slow but perhaps comparable
interface to this proposal?).
That would indeed be nice to have, but I'm not sure if I can manage to
squeeze that in besides finishing the module itself. My time frame for
working on this is quite limited.
via Digitalmars-d
2014-08-22 17:27:32 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Post by Christian Manning via Digitalmars-d
It would be nice to have integers treated separately to
doubles. I know
it makes the number parsing simpler to just treat everything
as double,
but still, it could be annoying when you expect an integer
type.
That's how I've done it for vibe.data.json, too. For the new
implementation, I've just used the number parsing routine from
Andrei's std.jgrandson module. Does anybody have reservations
about representing integers as "long" instead?
It should automatically fall back to double on overflow. Maybe
even use BigInt if applicable?
Sönke Ludwig via Digitalmars-d
2014-08-22 17:45:01 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
It would be nice to have integers treated separately to doubles. I know
it makes the number parsing simpler to just treat everything as double,
but still, it could be annoying when you expect an integer type.
That's how I've done it for vibe.data.json, too. For the new
implementation, I've just used the number parsing routine from
Andrei's std.jgrandson module. Does anybody have reservations about
representing integers as "long" instead?
It should automatically fall back to double on overflow. Maybe even use
BigInt if applicable?
I guess BigInt + exponent would be the only lossless way to represent
any JSON number. That could then be converted to any desired smaller
type as required.

But checking for overflow during number parsing would definitely have an
impact on parsing speed, as well as using a BigInt of course, so the
question is how we want set up the trade off here (or if there is
another way that is overhead-free).
via Digitalmars-d
2014-08-22 18:01:12 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Post by Sönke Ludwig via Digitalmars-d
Post by Christian Manning via Digitalmars-d
It would be nice to have integers treated separately to
doubles. I know
it makes the number parsing simpler to just treat everything as double,
but still, it could be annoying when you expect an integer
type.
That's how I've done it for vibe.data.json, too. For the new
implementation, I've just used the number parsing routine from
Andrei's std.jgrandson module. Does anybody have reservations about
representing integers as "long" instead?
It should automatically fall back to double on overflow. Maybe even use
BigInt if applicable?
I guess BigInt + exponent would be the only lossless way to
represent any JSON number. That could then be converted to any
desired smaller type as required.
But checking for overflow during number parsing would
definitely have an impact on parsing speed, as well as using a
BigInt of course, so the question is how we want set up the
trade off here (or if there is another way that is
overhead-free).
As the functions will be templatized anyway, it should include a
flags parameter. These and possible future extensions can then be
selected by the user.
Sönke Ludwig via Digitalmars-d
2014-08-22 18:11:04 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Post by Sönke Ludwig via Digitalmars-d
It would be nice to have integers treated separately to doubles. I know
it makes the number parsing simpler to just treat everything as double,
but still, it could be annoying when you expect an integer type.
That's how I've done it for vibe.data.json, too. For the new
implementation, I've just used the number parsing routine from
Andrei's std.jgrandson module. Does anybody have reservations about
representing integers as "long" instead?
It should automatically fall back to double on overflow. Maybe even use
BigInt if applicable?
I guess BigInt + exponent would be the only lossless way to represent
any JSON number. That could then be converted to any desired smaller
type as required.
But checking for overflow during number parsing would definitely have
an impact on parsing speed, as well as using a BigInt of course, so
the question is how we want set up the trade off here (or if there is
another way that is overhead-free).
As the functions will be templatized anyway, it should include a flags
parameter. These and possible future extensions can then be selected by
the user.
I'm actually in the process of converting the "track_location" parameter
to a flags enum and to add support for an error token, so this would fit
right in.
Christian Manning via Digitalmars-d
2014-08-22 19:48:49 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Post by Sönke Ludwig via Digitalmars-d
Post by Christian Manning via Digitalmars-d
It would be nice to have integers treated separately to
doubles. I know
it makes the number parsing simpler to just treat everything as double,
but still, it could be annoying when you expect an integer
type.
That's how I've done it for vibe.data.json, too. For the new
implementation, I've just used the number parsing routine from
Andrei's std.jgrandson module. Does anybody have reservations about
representing integers as "long" instead?
It should automatically fall back to double on overflow. Maybe even use
BigInt if applicable?
I guess BigInt + exponent would be the only lossless way to
represent any JSON number. That could then be converted to any
desired smaller type as required.
But checking for overflow during number parsing would
definitely have an impact on parsing speed, as well as using a
BigInt of course, so the question is how we want set up the
trade off here (or if there is another way that is
overhead-free).
You could check for a decimal point and a 0 at the front
(excluding possible - sign), either would indicate a double,
making the reasonable assumption that anything else will fit in a
long.
Sönke Ludwig via Digitalmars-d
2014-08-22 20:02:41 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Post by Sönke Ludwig via Digitalmars-d
It would be nice to have integers treated separately to doubles. I know
it makes the number parsing simpler to just treat everything as double,
but still, it could be annoying when you expect an integer type.
That's how I've done it for vibe.data.json, too. For the new
implementation, I've just used the number parsing routine from
Andrei's std.jgrandson module. Does anybody have reservations about
representing integers as "long" instead?
It should automatically fall back to double on overflow. Maybe even use
BigInt if applicable?
I guess BigInt + exponent would be the only lossless way to represent
any JSON number. That could then be converted to any desired smaller
type as required.
But checking for overflow during number parsing would definitely have
an impact on parsing speed, as well as using a BigInt of course, so
the question is how we want set up the trade off here (or if there is
another way that is overhead-free).
You could check for a decimal point and a 0 at the front (excluding
possible - sign), either would indicate a double, making the reasonable
assumption that anything else will fit in a long.
Yes, no decimal point + no exponent would work without overhead to
detect integers, but that wouldn't solve the proposed automatic
long->double overflow, which is what I meant. My current idea is to
default to double and optionally support any of long, BigInt and
"Decimal" (BigInt+exponent), where integer overflow only works for
long->BigInt.
John Colvin via Digitalmars-d
2014-08-22 20:33:32 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Post by Christian Manning via Digitalmars-d
Am 22.08.2014 19:27, schrieb "Marc SchÃŒtz"
On Friday, 22 August 2014 at 16:56:26 UTC, Sönke Ludwig
Post by Sönke Ludwig via Digitalmars-d
Post by Christian Manning via Digitalmars-d
It would be nice to have integers treated separately to
doubles. I
know
it makes the number parsing simpler to just treat
everything as
double,
but still, it could be annoying when you expect an integer type.
That's how I've done it for vibe.data.json, too. For the new
implementation, I've just used the number parsing routine
from
Andrei's std.jgrandson module. Does anybody have
reservations about
representing integers as "long" instead?
It should automatically fall back to double on overflow.
Maybe even use
BigInt if applicable?
I guess BigInt + exponent would be the only lossless way to
represent
any JSON number. That could then be converted to any desired
smaller
type as required.
But checking for overflow during number parsing would
definitely have
an impact on parsing speed, as well as using a BigInt of
course, so
the question is how we want set up the trade off here (or if
there is
another way that is overhead-free).
You could check for a decimal point and a 0 at the front
(excluding
possible - sign), either would indicate a double, making the
reasonable
assumption that anything else will fit in a long.
Yes, no decimal point + no exponent would work without overhead
to detect integers, but that wouldn't solve the proposed
automatic long->double overflow, which is what I meant. My
current idea is to default to double and optionally support any
of long, BigInt and "Decimal" (BigInt+exponent), where integer
overflow only works for long->BigInt.
It might be the right choice anyway (seeing as json/js do
overflow to double), but fwiw it's still atrocious.

double a = long.max;
assert(iota(1, 1000000).map!(d => (a+d)-a).until!"a !=
0".walkLength == 1024);

Yuk.

Floating point numbers and integers are so completely different
in behaviour that it's just dishonest to transparently switch
between the two. This especially the case for overflow from long
-> double, where by definition you're 10 bits past being able to
reliably accurately represent the integer in question.
Christian Manning via Digitalmars-d
2014-08-22 21:06:13 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Yes, no decimal point + no exponent would work without overhead
to detect integers, but that wouldn't solve the proposed
automatic long->double overflow, which is what I meant. My
current idea is to default to double and optionally support any
of long, BigInt and "Decimal" (BigInt+exponent), where integer
overflow only works for long->BigInt.
Ah I see.

I have to say, if you are going to treat integers and floating
point numbers differently, then you should store them
differently. long should be used to store integers, double for
floating point numbers. 64 bit signed integer (long) is a totally
reasonable limitation for integers, but even that would lose
precision stored as a double as you are proposing (if I'm
understanding right). I don't think BigInt needs to be brought
into this at all really.

In the case of integers met in the parser which are too
large/small to fit in long, give an error IMO. Such integers
should be (and are by other libs IIRC) serialised in the form
"1.234e-123" to force double parsing, perhaps losing precision at
that stage rather than invisibly inside the library. Size of JSON
numbers is implementation defined and the whole thing shouldn't
be degraded in both performance and usability to cover JSON
serialisers who go beyond common native number types.

Of course, you are free to do whatever you like :)
Walter Bright via Digitalmars-d
2014-08-22 18:08:40 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Destroy away! ;)
Thanks for taking this on! This is valuable work. On to destruction!

I'm looking at:

http://s-ludwig.github.io/std_data_json/stdx/data/json/lexer/lexJSON.html

I anticipate this will be used a LOT and in very high speed demanding
applications. With that in mind,


1. There's no mention of what will happen if it is passed malformed JSON
strings. I presume an exception is thrown. Exceptions are both slow and consume
GC memory. I suggest an alternative would be to emit an "Error" token instead;
this would be much like how the UTF decoding algorithms emit a "replacement
char" for invalid UTF sequences.

2. The escape sequenced strings presumably consume GC memory. This will be a
problem for high performance code. I suggest either leaving them undecoded in
the token stream, and letting higher level code decide what to do about them, or
provide a hook that the user can override with his own allocation scheme.


If we don't make it possible to use std.json without invoking the GC, I believe
the module will fail in the long term.
Sönke Ludwig via Digitalmars-d
2014-08-22 21:27:12 UTC
Permalink
Post by Walter Bright via Digitalmars-d
Post by Sönke Ludwig via Digitalmars-d
Destroy away! ;)
Thanks for taking this on! This is valuable work. On to destruction!
http://s-ludwig.github.io/std_data_json/stdx/data/json/lexer/lexJSON.html
I anticipate this will be used a LOT and in very high speed demanding
applications. With that in mind,
1. There's no mention of what will happen if it is passed malformed JSON
strings. I presume an exception is thrown. Exceptions are both slow and
consume GC memory. I suggest an alternative would be to emit an "Error"
token instead; this would be much like how the UTF decoding algorithms
emit a "replacement char" for invalid UTF sequences.
The latest version now features a LexOptions.noThrow option which causes
an error token to be emitted instead. After popping the error token, the
range is always empty.
Post by Walter Bright via Digitalmars-d
2. The escape sequenced strings presumably consume GC memory. This will
be a problem for high performance code. I suggest either leaving them
undecoded in the token stream, and letting higher level code decide what
to do about them, or provide a hook that the user can override with his
own allocation scheme.
The problem is that it really depends on the use case and on the type of
input stream which approach is more efficient (storing the escaped
version of a string might require *two* allocations if the input range
cannot be sliced and if the decoded string is then requested by the
parser). My current idea therefore is to simply make this configurable, too.

Enabling the use of custom allocators should be easily possible as an
add-on functionality later on. At least my suggestion would be to wait
with this until we have a finished std.allocator module.
Walter Bright via Digitalmars-d
2014-08-23 01:05:33 UTC
Permalink
Post by Walter Bright via Digitalmars-d
1. There's no mention of what will happen if it is passed malformed JSON
strings. I presume an exception is thrown. Exceptions are both slow and
consume GC memory. I suggest an alternative would be to emit an "Error"
token instead; this would be much like how the UTF decoding algorithms
emit a "replacement char" for invalid UTF sequences.
The latest version now features a LexOptions.noThrow option which causes an
error token to be emitted instead. After popping the error token, the range is
always empty.
Having a nothrow option may prevent the functions from being attributed as
"nothrow".

But in any case, to worship at the Altar Of Composability, the error token could
always be emitted, and then provide another algorithm which passes through all
non-error tokens, and throws if it sees an error token.
Post by Walter Bright via Digitalmars-d
2. The escape sequenced strings presumably consume GC memory. This will
be a problem for high performance code. I suggest either leaving them
undecoded in the token stream, and letting higher level code decide what
to do about them, or provide a hook that the user can override with his
own allocation scheme.
The problem is that it really depends on the use case and on the type of input
stream which approach is more efficient (storing the escaped version of a string
might require *two* allocations if the input range cannot be sliced and if the
decoded string is then requested by the parser). My current idea therefore is to
simply make this configurable, too.
Enabling the use of custom allocators should be easily possible as an add-on
functionality later on. At least my suggestion would be to wait with this until
we have a finished std.allocator module.
I'm worried that std.allocator is stalled and we'll be digging ourselves deeper
into needing to revise things later to remove GC usage. I'd really like to find
a way to abstract the allocation away from the algorithm.
Walter Bright via Digitalmars-d
2014-08-23 02:30:21 UTC
Permalink
The problem is that it really depends on the use case and on the type of input
stream which approach is more efficient (storing the escaped version of a string
might require *two* allocations if the input range cannot be sliced and if the
decoded string is then requested by the parser). My current idea therefore is to
simply make this configurable, too.
Enabling the use of custom allocators should be easily possible as an add-on
functionality later on. At least my suggestion would be to wait with this until
we have a finished std.allocator module.
Another possibility is to have the user pass in a resizeable buffer which then
will be used to store the strings in as necessary.

One example is std.internal.scopebuffer. The nice thing about that is the user
can use the stack for the storage, which works out to be very, very fast.
Ola Fosheim Gr via Digitalmars-d
2014-08-23 04:01:59 UTC
Permalink
Post by Walter Bright via Digitalmars-d
Another possibility is to have the user pass in a resizeable
buffer which then will be used to store the strings in as
necessary.
One example is std.internal.scopebuffer. The nice thing about
that is the user can use the stack for the storage, which works
out to be very, very fast.
Does this mean that D is getting resizable stack allocations in
lower stack frames? That has a lot of implications for code gen.
Walter Bright via Digitalmars-d
2014-08-23 04:36:32 UTC
Permalink
Post by Walter Bright via Digitalmars-d
One example is std.internal.scopebuffer. The nice thing about that is the user
can use the stack for the storage, which works out to be very, very fast.
Does this mean that D is getting resizable stack allocations in lower stack
frames? That has a lot of implications for code gen.
scopebuffer does not require resizeable stack allocations.
Ola Fosheim Gr via Digitalmars-d
2014-08-23 04:48:14 UTC
Permalink
Post by Walter Bright via Digitalmars-d
Post by Ola Fosheim Gr via Digitalmars-d
Does this mean that D is getting resizable stack allocations
in lower stack
frames? That has a lot of implications for code gen.
scopebuffer does not require resizeable stack allocations.
So you cannot use the stack for resizable allocations.

That would however be a nice optimization. Iff an algorithm only
have one alloca, can be inlined in a way which does not extend
the stack and use a resizable buffer that grows downwards in
memory then you can have a resizable buffer on the stack:

HIMEM
...
Algorihm stack frame vars
Inlined vars
Buffer head/book keeping vars
Buffer end
Buffer front
...add to front here...
End of stack
LOMEM
Walter Bright via Digitalmars-d
2014-08-23 05:28:54 UTC
Permalink
Post by Ola Fosheim Gr via Digitalmars-d
Post by Walter Bright via Digitalmars-d
Does this mean that D is getting resizable stack allocations in lower stack
frames? That has a lot of implications for code gen.
scopebuffer does not require resizeable stack allocations.
So you cannot use the stack for resizable allocations.
Please, take a look at how scopebuffer works.
Ola Fosheim Gr via Digitalmars-d
2014-08-23 06:25:33 UTC
Permalink
Post by Walter Bright via Digitalmars-d
On Saturday, 23 August 2014 at 04:36:34 UTC, Walter Bright
Post by Walter Bright via Digitalmars-d
Does this mean that D is getting resizable stack allocations in lower stack
frames? That has a lot of implications for code gen.
scopebuffer does not require resizeable stack allocations.
So you cannot use the stack for resizable allocations.
Please, take a look at how scopebuffer works.
I have? It requires an upperbound to stay on the stack, that
creates a big hole in the stack. I don't think wasting the stack
or moving to the heap is a nice predictable solution. It would be
better to just have a couple of regions that do "reverse" stack
allocations, but the most efficient solution is the one I
outlined.

With json you might be able to create an upperbound of say 4-8
times the size of the source iff you know the file size. You
don't if you are streaming.

(scopebuffer is too unpredictable for real time, a pure stack
solution is predictable)
Walter Bright via Digitalmars-d
2014-08-23 06:41:09 UTC
Permalink
Post by Walter Bright via Digitalmars-d
Post by Ola Fosheim Gr via Digitalmars-d
Post by Walter Bright via Digitalmars-d
Does this mean that D is getting resizable stack allocations in lower stack
frames? That has a lot of implications for code gen.
scopebuffer does not require resizeable stack allocations.
So you cannot use the stack for resizable allocations.
Please, take a look at how scopebuffer works.
I have? It requires an upperbound to stay on the stack, that creates a big hole
in the stack. I don't think wasting the stack or moving to the heap is a nice
predictable solution. It would be better to just have a couple of regions that
do "reverse" stack allocations, but the most efficient solution is the one I
outlined.
Scopebuffer is extensively used in Warp, and works very well. The "hole" in the
stack is not a significant problem.
With json you might be able to create an upperbound of say 4-8 times the size of
the source iff you know the file size. You don't if you are streaming.
(scopebuffer is too unpredictable for real time, a pure stack solution is
predictable)
You can always implement your own buffering system and pass it in - that's the
point, it's under user control.
Ola Fosheim Gr via Digitalmars-d
2014-08-23 06:53:36 UTC
Permalink
Post by Walter Bright via Digitalmars-d
Scopebuffer is extensively used in Warp, and works very well.
The "hole" in the stack is not a significant problem.
Well, on a webserver you don't want to push out the caches for no
good reason.
Post by Walter Bright via Digitalmars-d
You can always implement your own buffering system and pass it
in - that's the point, it's under user control.
My point is that you need compiler support to get good buffering
options on the stack. Something like an @alloca_inline:

auto buffer = @alloca_inline getstuff();
process(buffer);

I think all memory allocation should be under compiler control,
the library solutions are bound to be suboptimal, i.e. slower.
Sönke Ludwig via Digitalmars-d
2014-08-23 09:13:21 UTC
Permalink
Post by Walter Bright via Digitalmars-d
Post by Walter Bright via Digitalmars-d
1. There's no mention of what will happen if it is passed malformed JSON
strings. I presume an exception is thrown. Exceptions are both slow and
consume GC memory. I suggest an alternative would be to emit an "Error"
token instead; this would be much like how the UTF decoding algorithms
emit a "replacement char" for invalid UTF sequences.
The latest version now features a LexOptions.noThrow option which causes an
error token to be emitted instead. After popping the error token, the range is
always empty.
Having a nothrow option may prevent the functions from being attributed
as "nothrow".
It's a compile time option, so that shouldn't be an issue. There is also
just a single "throw" statement in the source, so it's easy to isolate.
Sönke Ludwig via Digitalmars-d
2014-08-23 16:36:10 UTC
Permalink
Post by Walter Bright via Digitalmars-d
(...)
2. The escape sequenced strings presumably consume GC memory. This will
be a problem for high performance code. I suggest either leaving them
undecoded in the token stream, and letting higher level code decide what
to do about them, or provide a hook that the user can override with his
own allocation scheme.
If we don't make it possible to use std.json without invoking the GC, I
believe the module will fail in the long term.
I've added two new types now to abstract away how strings and numbers
are represented in memory. For string literals this means that for input
types "string" and "immutable(ubyte)[]" they will always be stored as
slices to the input buffer. JSONValue has a .rawValue property to access
them, as well as an "alias this"ed .value property that transparently
unescapes.

At that place it would also be easy to provide a method that takes an
arbitrary output range to unescape without allocations.

Documentation and code are both updated (also added a note about
exception behavior).
Walter Bright via Digitalmars-d
2014-08-23 17:38:04 UTC
Permalink
input types "string" and "immutable(ubyte)[]"
Why the immutable(ubyte)[] ?
Sönke Ludwig via Digitalmars-d
2014-08-23 17:42:24 UTC
Permalink
Post by Walter Bright via Digitalmars-d
input types "string" and "immutable(ubyte)[]"
Why the immutable(ubyte)[] ?
I've adopted that basically from Andrei's module. The idea is to allow
processing data with arbitrary character encoding. However, the output
will always be Unicode and JSON is defined to be encoded as Unicode,
too, so that could probably be dropped...
Walter Bright via Digitalmars-d
2014-08-23 17:46:04 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Post by Walter Bright via Digitalmars-d
input types "string" and "immutable(ubyte)[]"
Why the immutable(ubyte)[] ?
I've adopted that basically from Andrei's module. The idea is to allow
processing data with arbitrary character encoding. However, the output will
always be Unicode and JSON is defined to be encoded as Unicode, too, so that
could probably be dropped...
I feel that non-UTF encodings should be handled by adapter algorithms, not
embedded into the JSON lexer, so yes, I'd drop that.
Brad Roberts via Digitalmars-d
2014-08-23 19:00:37 UTC
Permalink
Post by Walter Bright via Digitalmars-d
Post by Sönke Ludwig via Digitalmars-d
Post by Walter Bright via Digitalmars-d
input types "string" and "immutable(ubyte)[]"
Why the immutable(ubyte)[] ?
I've adopted that basically from Andrei's module. The idea is to allow
processing data with arbitrary character encoding. However, the output will
always be Unicode and JSON is defined to be encoded as Unicode, too, so that
could probably be dropped...
I feel that non-UTF encodings should be handled by adapter algorithms,
not embedded into the JSON lexer, so yes, I'd drop that.
For performance purposes, determining encoding during lexing is useful.
You can avoid any conversion costs when you know that the original
string is ascii or utf-8 or other. The cost during lexing is
essentially zero. The cost of storing that state might be a concern, or
it might be free in otherwise unused padding space. The cost of
re-scanning strings that can be avoided is non-trivial.

My past experience with this was in an http parser, where there's even
more complex logic than json parsing, but the concepts still apply.
via Digitalmars-d
2014-08-23 19:57:34 UTC
Permalink
On Saturday, 23 August 2014 at 19:01:13 UTC, Brad Roberts via
original string is ascii or utf-8 or other. The cost during
lexing is essentially zero.
I am not so sure when it comes to SIMD lexing. I think the
specified behaviour should be done in a way which encourage later
optimizations.
via Digitalmars-d
2014-08-23 20:23:24 UTC
Permalink
Some baselines for performance:

https://github.com/mloskot/json_benchmark

http://chadaustin.me/2013/01/json-parser-benchmarking/
Andrei Alexandrescu via Digitalmars-d
2014-08-23 21:36:52 UTC
Permalink
Post by Walter Bright via Digitalmars-d
Post by Sönke Ludwig via Digitalmars-d
Post by Walter Bright via Digitalmars-d
input types "string" and "immutable(ubyte)[]"
Why the immutable(ubyte)[] ?
I've adopted that basically from Andrei's module. The idea is to allow
processing data with arbitrary character encoding. However, the output will
always be Unicode and JSON is defined to be encoded as Unicode, too, so that
could probably be dropped...
I feel that non-UTF encodings should be handled by adapter algorithms,
not embedded into the JSON lexer, so yes, I'd drop that.
I think accepting ubyte it's a good idea. It means "got this stream of
bytes off of the wire and it hasn't been validated as a UTF string". It
also means (which is true) that the lexer does enough validation to
constrain arbitrary bytes into text, and saves caller from either a
check (expensive) or a cast (unpleasant).

Reality is the JSON lexer takes ubytes and produces tokens.


Andrei
Walter Bright via Digitalmars-d
2014-08-23 22:24:02 UTC
Permalink
I think accepting ubyte it's a good idea. It means "got this stream of bytes off
of the wire and it hasn't been validated as a UTF string". It also means (which
is true) that the lexer does enough validation to constrain arbitrary bytes into
text, and saves caller from either a check (expensive) or a cast (unpleasant).
Reality is the JSON lexer takes ubytes and produces tokens.
Using an adapter still makes sense, because:

1. The adapter should be just as fast as wiring it in internally

2. The adapter then becomes a general purpose tool that can be used elsewhere
where the encoding is unknown or suspect

3. The scope of the adapter is small, so it is easier to get it right, and being
reusable means every user benefits from it

4. If we can't make adapters efficient, we've failed at the ranges+algorithms
model, and I'm very unwilling to fail at that
Andrei Alexandrescu via Digitalmars-d
2014-08-23 22:51:31 UTC
Permalink
Post by Walter Bright via Digitalmars-d
I think accepting ubyte it's a good idea. It means "got this stream of bytes off
of the wire and it hasn't been validated as a UTF string". It also means (which
is true) that the lexer does enough validation to constrain arbitrary bytes into
text, and saves caller from either a check (expensive) or a cast (unpleasant).
Reality is the JSON lexer takes ubytes and produces tokens.
1. The adapter should be just as fast as wiring it in internally
2. The adapter then becomes a general purpose tool that can be used
elsewhere where the encoding is unknown or suspect
3. The scope of the adapter is small, so it is easier to get it right,
and being reusable means every user benefits from it
4. If we can't make adapters efficient, we've failed at the
ranges+algorithms model, and I'm very unwilling to fail at that
An adapter would solve the wrong problem here. There's nothing to adapt
from and to.

An adapter would be good if e.g. the stream uses UTF-16 or some Windows
encoding. Bytes are the natural input for a json parser.


Andrei
Walter Bright via Digitalmars-d
2014-08-25 19:38:04 UTC
Permalink
An adapter would solve the wrong problem here. There's nothing to adapt from and
to.
An adapter would be good if e.g. the stream uses UTF-16 or some Windows
encoding. Bytes are the natural input for a json parser.
The adaptation is to take arbitrary byte input in an unknown encoding and
produce valid UTF.

Note that many html readers scan the bytes to see if it is ASCII, UTF, some code
page encoding, Shift-JIS, etc., and translate accordingly. I do not see why that
is less costly to put inside the JSON lexer than as an adapter.
via Digitalmars-d
2014-08-25 19:50:19 UTC
Permalink
Post by Walter Bright via Digitalmars-d
The adaptation is to take arbitrary byte input in an unknown
encoding and produce valid UTF.
I agree.

For a restful http service the encoding should be specified in
the http header and the input rejected if it isn't UTF
compatible. For that use scenario you only want validation, not
conversion. However some validation is free, like if you only
accept numbers you could just turn off parsing of strings in the
template


If files are read from storage then you can reread the file if it
fails validation on the first pass.

I wonder, in which use scenario it is that both of these
conditions fail?

1. unspecified character-set and cannot assume UTF for JSON
3. unable to re-parse
Sönke Ludwig via Digitalmars-d
2014-08-25 20:35:33 UTC
Permalink
Am 25.08.2014 21:50, schrieb "Ola Fosheim GrÞstad"
Post by via Digitalmars-d
Post by Walter Bright via Digitalmars-d
The adaptation is to take arbitrary byte input in an unknown encoding
and produce valid UTF.
I agree.
For a restful http service the encoding should be specified in the http
header and the input rejected if it isn't UTF compatible. For that use
scenario you only want validation, not conversion. However some
validation is free, like if you only accept numbers you could just turn
off parsing of strings in the template

If files are read from storage then you can reread the file if it fails
validation on the first pass.
I wonder, in which use scenario it is that both of these conditions fail?
1. unspecified character-set and cannot assume UTF for JSON
3. unable to re-parse
BTW, JSON is *required* to be UTF encoded anyway as per RFC-7159, which
is another argument for just letting the lexer assume valid UTF.
via Digitalmars-d
2014-08-25 20:51:16 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
BTW, JSON is *required* to be UTF encoded anyway as per
RFC-7159, which is another argument for just letting the lexer
assume valid UTF.
The lexer cannot assume valid UTF since the client might be a
rogue, but it can just bail out if the lookahead isn't jSON? So
UTF-validation is limited to strings.

You have to parse the strings because of the \uXXXX escapes of
course, so some basic validation is unavoidable? But I guess full
validation of string content could be another useful option along
with "ignore escapes" for the case where you want to avoid
decode-encode scenarios. (like for a proxy, or if you store
pre-escaped unicode in a database)

Walter Bright via Digitalmars-d
2014-08-23 22:20:45 UTC
Permalink
Post by Brad Roberts via Digitalmars-d
Post by Walter Bright via Digitalmars-d
I feel that non-UTF encodings should be handled by adapter algorithms,
not embedded into the JSON lexer, so yes, I'd drop that.
For performance purposes, determining encoding during lexing is useful.
I'm not convinced that using an adapter algorithm won't be just as fast.
Brad Roberts via Digitalmars-d
2014-08-24 01:32:47 UTC
Permalink
Post by Walter Bright via Digitalmars-d
Post by Brad Roberts via Digitalmars-d
Post by Walter Bright via Digitalmars-d
I feel that non-UTF encodings should be handled by adapter algorithms,
not embedded into the JSON lexer, so yes, I'd drop that.
For performance purposes, determining encoding during lexing is useful.
I'm not convinced that using an adapter algorithm won't be just as fast.
Consider your own talks on optimizing the existing dmd lexer. In those
talks you've talked about the evils of additional processing on every
byte. That's what you're talking about here. While it's possible that
the inliner and other optimizer steps might be able to integrate the two
phases and remove some overhead, I'll believe it when I see the
resulting assembly code.
Walter Bright via Digitalmars-d
2014-08-25 19:35:16 UTC
Permalink
Post by Walter Bright via Digitalmars-d
I'm not convinced that using an adapter algorithm won't be just as fast.
Consider your own talks on optimizing the existing dmd lexer. In those talks
you've talked about the evils of additional processing on every byte. That's
what you're talking about here. While it's possible that the inliner and other
optimizer steps might be able to integrate the two phases and remove some
overhead, I'll believe it when I see the resulting assembly code.
On the other hand, deadalnix demonstrated that the ldc optimizer was able to
remove the extra code.

I have a reasonable faith that optimization can be improved where necessary to
cover this.
simendsjo via Digitalmars-d
2014-08-25 19:49:01 UTC
Permalink
Post by Walter Bright via Digitalmars-d
Post by Walter Bright via Digitalmars-d
I'm not convinced that using an adapter algorithm won't be just as fast.
Consider your own talks on optimizing the existing dmd lexer. In those talks
you've talked about the evils of additional processing on every byte.
That's
what you're talking about here. While it's possible that the inliner and other
optimizer steps might be able to integrate the two phases and remove some
overhead, I'll believe it when I see the resulting assembly code.
On the other hand, deadalnix demonstrated that the ldc optimizer was
able to remove the extra code.
I have a reasonable faith that optimization can be improved where
necessary to cover this.
I just happened to write a very small script yesterday and tested with
the compilers (with dub --build=release).

dmd: 2.8 mb
gdc: 3.3 mb
ldc 0.5 mb

So ldc can remove quite a substantial amount of code in some cases.
Andrej Mitrovic via Digitalmars-d
2014-08-22 19:15:20 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Docs: http://s-ludwig.github.io/std_data_json/
This confused me for a solid minute:

// Lex a JSON string into a lazy range of tokens
auto tokens = lexJSON(`{"name": "Peter", "age": 42}`);

with (JSONToken.Kind) {
assert(tokens.map!(t => t.kind).equal(
[objectStart, string, colon, string, comma,
string, colon, number, objectEnd]));
}

Generally I'd avoid using de-facto reserved names as enum member names
(e.g. string).
Sönke Ludwig via Digitalmars-d
2014-08-22 19:30:37 UTC
Permalink
Post by Andrej Mitrovic via Digitalmars-d
Post by Sönke Ludwig via Digitalmars-d
Docs: http://s-ludwig.github.io/std_data_json/
// Lex a JSON string into a lazy range of tokens
auto tokens = lexJSON(`{"name": "Peter", "age": 42}`);
with (JSONToken.Kind) {
assert(tokens.map!(t => t.kind).equal(
[objectStart, string, colon, string, comma,
string, colon, number, objectEnd]));
}
Generally I'd avoid using de-facto reserved names as enum member names
(e.g. string).
Hmmm, but it *is* a string. Isn't the problem more the use of with in
this case? Maybe the example should just use with(JSONToken) and then
Kind.string?
Andrej Mitrovic via Digitalmars-d
2014-08-23 08:16:22 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Hmmm, but it *is* a string. Isn't the problem more the use of with in
this case?
Yeah, maybe so. I thought for a second it was a tuple, but then I saw
the square brackets and was left scratching my head. :)
deadalnix via Digitalmars-d
2014-08-23 02:23:25 UTC
Permalink
First thank you for your work. std.json is horrible to use right
now, so a replacement is more than welcome.

I haven't played with your code yet, so I may be asking for
somethign that already exists, but did you had a look to jsvar by
Adam ?

You can find it here:
https://github.com/adamdruppe/arsd/blob/master/jsvar.d

One of the big pain when one work with format like JSON is that
you go from the untyped world to the typed world (the same
problem occurs with XML and various config format as well).

I think Adam got the right balance in jsvar. It behave closely
enough to javascript so it is convenient to manipulate, while
removing the most dangerous behavior (concatenation is still done
using ~and not + as in JS).

If that is not already the case, I'd love that the element I get
out of my JSON behave that way. If you can do that, you have a
user.
ketmar via Digitalmars-d
2014-08-23 02:38:46 UTC
Permalink
On Sat, 23 Aug 2014 02:23:25 +0000
Post by deadalnix via Digitalmars-d
I haven't played with your code yet, so I may be asking for
somethign that already exists, but did you had a look to jsvar by
Adam ?
- No opDispatch() for JSONValue - this has shown to do more harm than
good in vibe.data.json
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: not available
URL: <http://lists.puremagic.com/pipermail/digitalmars-d/attachments/20140823/5e909590/attachment.sig>
Sönke Ludwig via Digitalmars-d
2014-08-23 09:22:02 UTC
Permalink
First thank you for your work. std.json is horrible to use right now, so
a replacement is more than welcome.
I haven't played with your code yet, so I may be asking for somethign
that already exists, but did you had a look to jsvar by Adam ?
https://github.com/adamdruppe/arsd/blob/master/jsvar.d
One of the big pain when one work with format like JSON is that you go
from the untyped world to the typed world (the same problem occurs with
XML and various config format as well).
I think Adam got the right balance in jsvar. It behave closely enough to
javascript so it is convenient to manipulate, while removing the most
dangerous behavior (concatenation is still done using ~and not + as in JS).
If that is not already the case, I'd love that the element I get out of
my JSON behave that way. If you can do that, you have a user.
Setting the issue of opDispatch aside, one of the goals was to use
Algebraic to store values. It is probably not completely as flexible as
jsvar, but still transparently enables a lot of operations (with those
pull requests merged at least). But it has another big advantage, which
is that we can later define other types based on Algebraic, such as
BSONValue, and those can be transparently runtime converted between each
other in a generic way. A special case type on the other hand produces
nasty dependencies between the formats.

Main issues of using opDispatch:

- Prone to bugs where a normal field/method of the JSONValue struct is
accessed instead of a JSON field
- On top of that the var.field syntax gives the wrong impression that
you are working with static typing, while var["field"] makes it clear
that runtime indexing is going on
- Every interface change of JSONValue would be a silent breaking
change, because the whole string domain is used up for opDispatch
w0rp via Digitalmars-d
2014-08-23 12:19:07 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
- Prone to bugs where a normal field/method of the JSONValue
struct is accessed instead of a JSON field
- On top of that the var.field syntax gives the wrong
impression that you are working with static typing, while
var["field"] makes it clear that runtime indexing is going on
- Every interface change of JSONValue would be a silent
breaking change, because the whole string domain is used up for
opDispatch
I have seen similar issues to these with simplexml in PHP. Using
opDispatch to match all possible names except a few doesn't work
so well.

I'm not sure if you've changed it already, but I agree with the
earlier comment about changing the flag for pretty printing from
a boolean to an enum value. Booleans in interfaces is one of my
pet peeves.
Sönke Ludwig via Digitalmars-d
2014-08-23 12:30:20 UTC
Permalink
I'm not sure if you've changed it already, but I agree with the earlier
comment about changing the flag for pretty printing from a boolean to an
enum value. Booleans in interfaces is one of my pet peeves.
It's split into two separate functions now. Having to type out a full
enum value I guess would be too distracting in this case, since they
will be pretty frequently used.
deadalnix via Digitalmars-d
2014-08-23 20:45:05 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
- Prone to bugs where a normal field/method of the JSONValue
struct is accessed instead of a JSON field
- On top of that the var.field syntax gives the wrong
impression that you are working with static typing, while
var["field"] makes it clear that runtime indexing is going on
- Every interface change of JSONValue would be a silent
breaking change, because the whole string domain is used up for
opDispatch
Yes, I don't mind missing that one. It look like a false good
idea.
Sönke Ludwig via Digitalmars-d
2014-08-25 11:30:16 UTC
Permalink
I've added support (compile time option [1]) for long and BigInt in the
lexer (and parser), see [2]. JSONValue currently still only stores
double for numbers. There are two options for extending JSONValue:

1. Add long and BigInt to the set of supported types for JSONValue. This
preserves all features of Algebraic and would later still allow
transparent conversion to other similar value types (e.g. BSONValue). On
the other hand it would be necessary to always check the actual type
before accessing a number, or the Algebraic would throw.

2. Instead of double, store a JSONNumber in the Algebraic. This enables
all the transparent conversions of JSONNumber and would thus be more
convenient, but blocks the way for possible automatic conversions in the
future.

I'm leaning towards 1, because allowing generic conversion between
different JSONValue-like types was one of my prime goals for the new module.

[1]:
http://s-ludwig.github.io/std_data_json/stdx/data/json/lexer/LexOptions.html
[2]:
http://s-ludwig.github.io/std_data_json/stdx/data/json/lexer/JSONNumber.html
via Digitalmars-d
2014-08-25 12:12:18 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
I've added support (compile time option [1]) for long and
BigInt in the lexer (and parser), see [2]. JSONValue currently
still only stores double for numbers.
It can be very useful to have a base 10 exponent representation
in certain situations where you need to have the exact same
results in two systems (like a third party ERP server versus a
client side application). Base 2 exponents are tricky (incorrect)
when you read ascii.

E.g. I have resorted to using Decimal in Python just to avoid the
weird round off issues when calculating prices where the price is
given in fractions of the order unit.

Perhaps a marginal problem, but could be important for some
serious application areas where you need to integrate D with
existing systems (for which you don't have the source code).
Sönke Ludwig via Digitalmars-d
2014-08-25 13:51:03 UTC
Permalink
Am 25.08.2014 14:12, schrieb "Ola Fosheim GrÞstad"
Post by Sönke Ludwig via Digitalmars-d
I've added support (compile time option [1]) for long and BigInt in
the lexer (and parser), see [2]. JSONValue currently still only stores
double for numbers.
It can be very useful to have a base 10 exponent representation in
certain situations where you need to have the exact same results in two
systems (like a third party ERP server versus a client side
application). Base 2 exponents are tricky (incorrect) when you read ascii.
E.g. I have resorted to using Decimal in Python just to avoid the weird
round off issues when calculating prices where the price is given in
fractions of the order unit.
Perhaps a marginal problem, but could be important for some serious
application areas where you need to integrate D with existing systems
(for which you don't have the source code).
In fact, I've already prepared the code for that, but commented it out
for now, because I wanted to have an efficient algorithm for converting
double to Decimal and because we should probably first add a Decimal
type to Phobos instead of adding it to the JSON module.
Don via Digitalmars-d
2014-08-25 13:07:07 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Following up on the recent "std.jgrandson" thread [1], I've
picked up the work (a lot earlier than anticipated) and
finished a first version of a loose blend of said
std.jgrandson, vibe.data.json and some changes that I had
planned for vibe.data.json for a while. I'm quite pleased by
the results so far, although without a serialization framework
it still misses a very important building block.
Code: https://github.com/s-ludwig/std_data_json
Docs: http://s-ludwig.github.io/std_data_json/
DUB: http://code.dlang.org/packages/std_data_json
- Lazy lexer in the form of a token input range (using slices
of the
input if possible)
- Lazy streaming parser (StAX style) in the form of a node
input range
- Eager DOM style parser returning a JSONValue
- Range based JSON string generator taking either a token
range, a
node range, or a JSONValue
- Opt-out location tracking (line/column) for tokens, nodes
and values
- No opDispatch() for JSONValue - this has shown to do more
harm than
good in vibe.data.json
The DOM style JSONValue type is based on std.variant.Algebraic.
This currently has a few usability issues that can be solved by
- Operator overloading only works sporadically
- No "tag" enum is supported, so that switch()ing on the type
of a
value doesn't work and an if-else cascade is required
- Operations and conversions between different Algebraic types is not
conveniently supported, which gets important when other
similar
formats get supported (e.g. BSON)
Assuming that those points are solved, I'd like to get some
early feedback before going for an official review. One open
issue is how to handle unescaping of string literals. Currently
it always unescapes immediately, which is more efficient for
general input ranges when the unescaped result is needed, but
less efficient for string inputs when the unescaped result is
not needed. Maybe a flag could be used to conditionally switch
behavior depending on the input range type.
Destroy away! ;)
[1]: http://forum.dlang.org/thread/lrknjl$co7$1 at digitalmars.com
One missing feature (which is also missing from the existing
std.json) is support for NaN and Infinity as JSON values.
Although they are not part of the formal JSON spec (which is a
ridiculous omission, the argument given for excluding them is
fallacious), they do get generated if you use Javascript's
toString to create the JSON. Many JSON libraries (eg Google's)
also generate them, so they are frequently encountered in
practice. So a JSON parser should at least be able to lex them.

ie this should be parsable:

{"foo": NaN, "bar": Infinity, "baz": -Infinity}

You should also put tests in for what happens when you pass NaN
or infinity to toJSON. It shouldn't silently generate invalid
JSON.
via Digitalmars-d
2014-08-25 13:23:44 UTC
Permalink
Post by Don via Digitalmars-d
practice. So a JSON parser should at least be able to lex them.
{"foo": NaN, "bar": Infinity, "baz": -Infinity}
You should also put tests in for what happens when you pass NaN
or infinity to toJSON. It shouldn't silently generate invalid
JSON.
I believe you are allowed to use very high exponents, though.
Like: 1E999 . So you need to decide if those should be mapped to
+Infinity or to the max value


NaN also come in two forms with differing semantics:
signalling(NaNs) and quiet (NaN). NaN is used for 0/0 and
sqrt(-1), but NaNs is used for illegal values and failure.

For some reason D does not seem to support this aspect of
IEEE754? I cannot find ".nans" listed on the page
http://dlang.org/property.html

The distinction is important when you do conditional branching.
With NaNs you might not be able to figure out which branch to
take since you might have missed out on a real value, with NaN
you got the value (which is known to be not real) and you might
be able to branch.
Walter Bright via Digitalmars-d
2014-08-25 19:42:06 UTC
Permalink
On 8/25/2014 6:23 AM, "Ola Fosheim GrÞstad"
Post by Don via Digitalmars-d
practice. So a JSON parser should at least be able to lex them.
{"foo": NaN, "bar": Infinity, "baz": -Infinity}
You should also put tests in for what happens when you pass NaN or infinity to
toJSON. It shouldn't silently generate invalid JSON.
I believe you are allowed to use very high exponents, though. Like: 1E999 . So
you need to decide if those should be mapped to +Infinity or to the max value

Infinity. Mapping to max value would be a horrible bug.
NaN also come in two forms with differing semantics: signalling(NaNs) and quiet
(NaN). NaN is used for 0/0 and sqrt(-1), but NaNs is used for illegal values
and failure.
For some reason D does not seem to support this aspect of IEEE754? I cannot find
".nans" listed on the page http://dlang.org/property.html
Because I tried supporting them in C++. It doesn't work for various reasons.
Nobody else supports them, either.
via Digitalmars-d
2014-08-25 20:04:09 UTC
Permalink
Post by Walter Bright via Digitalmars-d
Infinity. Mapping to max value would be a horrible bug.
Yes
 but then you are reading an illegal value that JSON does not
support

Post by Walter Bright via Digitalmars-d
Post by via Digitalmars-d
For some reason D does not seem to support this aspect of
IEEE754? I cannot find
".nans" listed on the page http://dlang.org/property.html
Because I tried supporting them in C++. It doesn't work for
various reasons. Nobody else supports them, either.
I haven't tested, but Python is supposed to throw on NaNs.

gcc has support for nans in their documentation:
https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

IBM Fortran supports it


I think supporting signaling NaN is important for correctness.
via Digitalmars-d
2014-08-25 20:21:24 UTC
Permalink
On Monday, 25 August 2014 at 20:04:10 UTC, Ola Fosheim GrÞstad
Post by via Digitalmars-d
I think supporting signaling NaN is important for correctness.
It is defined in C++11:

http://en.cppreference.com/w/cpp/types/numeric_limits/signaling_NaN
Sönke Ludwig via Digitalmars-d
2014-08-25 14:04:13 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Following up on the recent "std.jgrandson" thread [1], I've picked up
the work (a lot earlier than anticipated) and finished a first version
of a loose blend of said std.jgrandson, vibe.data.json and some
changes that I had planned for vibe.data.json for a while. I'm quite
pleased by the results so far, although without a serialization
framework it still misses a very important building block.
Code: https://github.com/s-ludwig/std_data_json
Docs: http://s-ludwig.github.io/std_data_json/
DUB: http://code.dlang.org/packages/std_data_json
- Lazy lexer in the form of a token input range (using slices of the
input if possible)
- Lazy streaming parser (StAX style) in the form of a node input range
- Eager DOM style parser returning a JSONValue
- Range based JSON string generator taking either a token range, a
node range, or a JSONValue
- Opt-out location tracking (line/column) for tokens, nodes and values
- No opDispatch() for JSONValue - this has shown to do more harm than
good in vibe.data.json
The DOM style JSONValue type is based on std.variant.Algebraic. This
currently has a few usability issues that can be solved by
- Operator overloading only works sporadically
- No "tag" enum is supported, so that switch()ing on the type of a
value doesn't work and an if-else cascade is required
- Operations and conversions between different Algebraic types is not
conveniently supported, which gets important when other similar
formats get supported (e.g. BSON)
Assuming that those points are solved, I'd like to get some early
feedback before going for an official review. One open issue is how to
handle unescaping of string literals. Currently it always unescapes
immediately, which is more efficient for general input ranges when the
unescaped result is needed, but less efficient for string inputs when
the unescaped result is not needed. Maybe a flag could be used to
conditionally switch behavior depending on the input range type.
Destroy away! ;)
[1]: http://forum.dlang.org/thread/lrknjl$co7$1 at digitalmars.com
One missing feature (which is also missing from the existing std.json)
is support for NaN and Infinity as JSON values. Although they are not
part of the formal JSON spec (which is a ridiculous omission, the
argument given for excluding them is fallacious), they do get generated
if you use Javascript's toString to create the JSON. Many JSON libraries
(eg Google's) also generate them, so they are frequently encountered in
practice. So a JSON parser should at least be able to lex them.
{"foo": NaN, "bar": Infinity, "baz": -Infinity}
This would probably best added as another (CT) optional feature. I think
the default should strictly adhere to the JSON specification, though.
You should also put tests in for what happens when you pass NaN or
infinity to toJSON. It shouldn't silently generate invalid JSON.
Good point. The current solution to just use formattedWrite("%.16g") is
also not ideal.
Sönke Ludwig via Digitalmars-d
2014-08-25 15:34:30 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Post by Sönke Ludwig via Digitalmars-d
Following up on the recent "std.jgrandson" thread [1], I've picked up
the work (a lot earlier than anticipated) and finished a first version
of a loose blend of said std.jgrandson, vibe.data.json and some
changes that I had planned for vibe.data.json for a while. I'm quite
pleased by the results so far, although without a serialization
framework it still misses a very important building block.
Code: https://github.com/s-ludwig/std_data_json
Docs: http://s-ludwig.github.io/std_data_json/
DUB: http://code.dlang.org/packages/std_data_json
- Lazy lexer in the form of a token input range (using slices of the
input if possible)
- Lazy streaming parser (StAX style) in the form of a node input range
- Eager DOM style parser returning a JSONValue
- Range based JSON string generator taking either a token range, a
node range, or a JSONValue
- Opt-out location tracking (line/column) for tokens, nodes and values
- No opDispatch() for JSONValue - this has shown to do more harm than
good in vibe.data.json
The DOM style JSONValue type is based on std.variant.Algebraic. This
currently has a few usability issues that can be solved by
- Operator overloading only works sporadically
- No "tag" enum is supported, so that switch()ing on the type of a
value doesn't work and an if-else cascade is required
- Operations and conversions between different Algebraic types is not
conveniently supported, which gets important when other similar
formats get supported (e.g. BSON)
Assuming that those points are solved, I'd like to get some early
feedback before going for an official review. One open issue is how to
handle unescaping of string literals. Currently it always unescapes
immediately, which is more efficient for general input ranges when the
unescaped result is needed, but less efficient for string inputs when
the unescaped result is not needed. Maybe a flag could be used to
conditionally switch behavior depending on the input range type.
Destroy away! ;)
[1]: http://forum.dlang.org/thread/lrknjl$co7$1 at digitalmars.com
One missing feature (which is also missing from the existing std.json)
is support for NaN and Infinity as JSON values. Although they are not
part of the formal JSON spec (which is a ridiculous omission, the
argument given for excluding them is fallacious), they do get generated
if you use Javascript's toString to create the JSON. Many JSON libraries
(eg Google's) also generate them, so they are frequently encountered in
practice. So a JSON parser should at least be able to lex them.
{"foo": NaN, "bar": Infinity, "baz": -Infinity}
This would probably best added as another (CT) optional feature. I think
the default should strictly adhere to the JSON specification, though.
http://s-ludwig.github.io/std_data_json/stdx/data/json/lexer/LexOptions.specialFloatLiterals.html
Post by Sönke Ludwig via Digitalmars-d
You should also put tests in for what happens when you pass NaN or
infinity to toJSON. It shouldn't silently generate invalid JSON.
Good point. The current solution to just use formattedWrite("%.16g") is
also not ideal.
By default, floating-point special values are now output as 'null',
according to the ECMA-script standard. Optionally, they will be emitted
as 'NaN' and 'Infinity':

http://s-ludwig.github.io/std_data_json/stdx/data/json/generator/GeneratorOptions.specialFloatLiterals.html
via Digitalmars-d
2014-08-25 15:46:11 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
By default, floating-point special values are now output as
'null', according to the ECMA-script standard. Optionally, they
ECMAScript presumes double. I think one should base Phobos on
language-independent standards. I suggest:

http://tools.ietf.org/html/rfc7159

For a web server it would be most useful to get an exception
since you risk ending up with web-clients not working with no
logging. It is better to have an exception and log an error so
the problem can be fixed.
via Digitalmars-d
2014-08-25 19:20:56 UTC
Permalink
On Monday, 25 August 2014 at 15:46:12 UTC, Ola Fosheim GrÞstad
Post by via Digitalmars-d
For a web server it would be most useful to get an exception
since you risk ending up with web-clients not working with no
logging. It is better to have an exception and log an error so
the problem can be fixed.
Let me expand a bit on the difference between web clients and
servers, assuming D is used on the server:

* Web servers have to check all input and log illegal activity.
It is either a bug or an attack.

* Web clients don't have to check input from the server (at most
a crypto check) and should not do double work if servers validate
anyway.

* Web servers detect errors and send the error as a response to
the client that displays it as a warning to the user. This is the
uncommon case so you don't want to burden the client with it.

From this we can infer:

- It makes more sense for ECMAScript to turn illegal values into
null since it runs on the client.

- The server needs efficient validation of input so that it can
have faster response.

- The more integration of validation of typedness you can have in
the parser, the better.


Thus it would be an advantage to be able to configure the
validation done in the parser (through template mechanisms):


1. On write: throw exception on all illegal values or values that
cannot be represented in the format. If the values are illegal
then the client should not receive it. It could cause legal
problems (like wrong prices).


2. On read: add the ability to configure the validation of
typedness on many parameters:

- no nulls, no dicts, only nesting arrays etc

- predetermined key-values and automatic mapping to structs on
exact match.

- require all leaf arrays to be uniform (array of strings, array
of numbers)

- match a predefined grammar

etc
Sönke Ludwig via Digitalmars-d
2014-08-25 20:29:01 UTC
Permalink
- It makes more sense for ECMAScript to turn illegal values into null
since it runs on the client.
Like... node.js?

Sorry, just kidding.

I don't think it makes sense for clients to be less strict about such
things, but I do agree with your assessment about being as strict as
possible on the server. I also do think that exceptions are a perfect
tool especially for server applications and that instead of avoiding
them because they are slow, they should better be made fast enough to
not be an issue.
Sönke Ludwig via Digitalmars-d
2014-08-25 20:21:02 UTC
Permalink
Am 25.08.2014 17:46, schrieb "Ola Fosheim GrÞstad"
Post by via Digitalmars-d
Post by Sönke Ludwig via Digitalmars-d
By default, floating-point special values are now output as 'null',
according to the ECMA-script standard. Optionally, they will be
ECMAScript presumes double. I think one should base Phobos on
http://tools.ietf.org/html/rfc7159
Well, of course it's based on that RFC, did you seriously think
something else? However, that standard has no mention of infinity or
NaN, and since JSON is designed to be a subset of ECMA script, it's
basically the only thing that comes close.
Post by via Digitalmars-d
For a web server it would be most useful to get an exception since you
risk ending up with web-clients not working with no logging. It is
better to have an exception and log an error so the problem can be fixed.
Although you have a point there of course, it's also highly unlikely
that those clients would work correctly if we presume that JSON
supported infinity/NaN. So it would really be just coincidence to detect
a bug like that.

But I generally agree, it's just that the anti-exception voices are
pretty loud these days (including Walter's), so that I opted for a
non-throwing solution instead. I guess it wouldn't hurt though to
default to throwing an exception, while still providing the
GeneratorOptions.specialFloatLiterals option to handle those values
without exception overhead, but in a non standard-conforming way.
Sönke Ludwig via Digitalmars-d
2014-08-25 20:37:43 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
that standard has no mention of infinity or
NaN
Sorry, to be precise, it has no suggestion of how to *handle* infinity
or NaN.
via Digitalmars-d
2014-08-25 20:39:53 UTC
Permalink
Post by Sönke Ludwig via Digitalmars-d
Well, of course it's based on that RFC, did you seriously think
something else?
I made no assumptions, just responded to what you wrote :-). It
would be reasonable in the context of vibe.d to assume the
ECMAScript spec.
Post by Sönke Ludwig via Digitalmars-d
But I generally agree, it's just that the anti-exception voices
are pretty loud these days (including Walter's), so that I
opted for a non-throwing solution instead.
Yes, the minimum requirement is to just get "did not validate"
directly as a single value. One can create a wrapper to get
exceptions.
Post by Sönke Ludwig via Digitalmars-d
I guess it wouldn't hurt though to default to throwing an
exception, while still providing the
GeneratorOptions.specialFloatLiterals option to handle those
values without exception overhead, but in a non
standard-conforming way.
What I care most about is getting all the free validation that
can be added with no extra cost.

That will make writing web services easier. Like if you can
define constraints like:

- root is array, values are strings.
- root is array, second level only arrays, third level is numbers
- root is dict, all arrays contain only numbers

What is a bit annoying about generic libs is that you have no
idea what you are getting so you have to spend time creating dull
validation code.

But maybe StructuredJSON should be a separate library. It would
be useful for REST services to specify the grammar and
auto-generate both javascript and D structures to hold it along
with validation code.

However, just turning off parsing of "true", "false", "null",
"[", "{" etc seems like a cheap addition that also can improve
parsing speed if the compiler can make do with two if statements
instead of a switch.

Ola.
Loading...