hippy
Posts: 13793
Joined: Fri Sep 09, 2011 10:34 pm
Location: UK

Re: Is my fear of C++ justified?

Wed May 24, 2023 7:35 pm

dbrion06 wrote:
Wed May 24, 2023 7:02 pm
grep "$TO_BE_FOUND" ./* 2>/dev/null
But is that really programming or simply 'how to use someone else's tool' ? I would say it gives little insight into how the desired task is actually achieved if they wanted to implement the same themselves, but in getting the job done I would accept it as a valid solution.

What I'd like is an explanation as to why it seems to hang when run in my /tmp which is a ram disk, but I'm not expecting you to give me one, unless you do happen to know, am not holding that against it being a solution.

ejolson
Posts: 10957
Joined: Tue Mar 18, 2014 11:47 am

Re: Is my fear of C++ justified?

Wed May 24, 2023 8:54 pm

bensimmo wrote:
Wed May 24, 2023 7:05 pm
So in data analysis what makes C++ fearful or not fearful to a new user, compared to say the most common on used now being Python.
Ignoring speed, what makes C++ better to learn for this job.
Lack of an interactive read-evaluate-print loop coupled with capable graphics and statistics libraries is what makes C++ and even Rust fearful for visualization and the exploratory aspects of data analysis. One is better off with R, Julia or Python. All three of those languages are further supported by Jupyter notebooks which are even less fearful to many.

https://jupyter.org/

Apple's new Swift language includes a just-in-time compiled REPL. However, although Swift is open source and can run on a Raspberry Pi in 64-bit mode, I don't think the ecosystem is developed for statistics and data science. And then there's Matlab which doesn't run on a Pi but can use it as an interactive front end.

My favorite is currently Julia.
Last edited by ejolson on Thu May 25, 2023 1:08 pm, edited 1 time in total.

swampdog
Posts: 1036
Joined: Fri Dec 04, 2015 11:22 am

Re: Is my fear of C++ justified?

Wed May 24, 2023 11:00 pm

tttapa wrote:
Tue May 23, 2023 8:16 pm
swampdog wrote:
Tue May 23, 2023 8:02 pm
When you create your own classes/structs they'll need a ctor,cctor,dtor,ator.
You don't need them. In fact, I would go as far as saying that the most of your own classes should not define custom copy/move constructors, destructors, or assignment operators. See “the Rule of Zero” (C++ Core Guidelines C.20: If you can avoid defining default operations, do).
That's a useful link.
tttapa wrote:
swampdog wrote:

Code: Select all

for (
    std::map<std::string,std::string>::const_iterator i =v.begin();
    i != v.end();
    ++i
     )
Messing around with iterators like this is error prone, a range-based for loop can be useful here:

Code: Select all

    std::map<std::string, std::string> dictionary;
    dictionary["fred"] = "jim";
    dictionary["IP"] = "192.168.1.1";
    for (auto &[key, value] : dictionary)
        std::cout << '[' << key << "]=[" << value << "]\n";
My post did initially have an "auto" but I messed it up. Too many interruptions or I'm too tired most of the time, often both currently. I wrote something like..

Code: Select all

 {const std::deque<foo> &   cd  (d);
  for (auto i : cd) {
    i.s += " ?uh?";
    std::cout<<i.s<<'\n';
  }
..after my initial attempt (code gone) seemed to be modifying a const. I do happen to be vaguely awake atm and can't reproduce what lead me astray! Having "set(CMAKE_CXX_STANDARD 11)" didn't help. Somehow I got from the below to the above.

Code: Select all

for (auto const & [s,i] : d) //blah
..and concluded I'd found a clang bug. Hmm, silly me!!

Anyway. Occasionally multiple errors do lead to a reasonable conclusion - which was "don't post that auto code" because the moment the OP single-steps a debugger into the STL they are going to be faced with iterators.

Given what the OP has expanded on subsequently (getting others more understanding) I think python is the way to go. Sure I can effectively write..

Code: Select all

sdFileLoad(&object,"blah);
sdRegex rx(pattern);
for (whatever)
  //some rx action
;
..but it can be achieved with shell script. You only need the above in the case where shell script is too slow.

At work, sometimes non-programmers would want to know how we could figure things out so quickly. Typically there'd be some spreadsheet or database. We'd export stuff as text then grep/awk/sed it. Almost all would lose interest when faced with a regex. Of the ones who didn't, they could become dangerous in that we'd find shell script had been modified - it only takes one person to (say) issue "select * from foo >/tmp/z.txt" to bring a system down.

Heater
Posts: 19415
Joined: Tue Jul 17, 2012 3:02 pm

Re: Is my fear of C++ justified?

Thu May 25, 2023 8:33 am

ejolson wrote:
Wed May 24, 2023 3:33 pm
Was the goal to check whether all files in a particular directory contain valid utf-8 text? I wonder if there is a language in which such a task would be easy to express so a non-programmer could understand it.
Sure. Rust to the rescue.

Code: Select all

    let b = std::str::from_utf8(&bytes).is_ok();
Will set the "b" to true if the byte sequence "bytes" consists entirely of valid UTF-8 character code points, false otherwise. Note: "b" is a proper boolean type not some integer pretending to be one.

If you actually want to make a proper "String" type from those bytes then this will do it:

Code: Select all

let s = str::from_utf8(&bytes); 
Of course if "bytes" does not contain a valid utf8 sequence the Result "s" will be an error so we had better check for that:

Code: Select all

    let s = match str::from_utf8(&bytes) {
        Ok(v) => v,
        Err(e) => panic!("Invalid UTF-8 sequence: {e}"),
    };  
Where "panic!" will exit the program with a nice error message. Of course you can handle the error however you like.

Or if one is just messing around experimenting and does not care about errors yet then this:

Code: Select all

    let s = match str::from_utf8(&bytes).unwrap();
Will unwrap the Result type produced by "from_utf8" and produce a String result or exit the program if the byte sequence was not valid utf8.

Or if the function you are using this in returns the appropriate Result type you can just do this:

Code: Select all

    let s = str::from_utf8(&bytes)?;
Where the "?" will cause the function to return with the error if the byte sequence is not valid utf8.


NOTE: The String type in Rust is guaranteed to always be valid urf8.
Slava Ukrayini.

User avatar
bensimmo
Posts: 6066
Joined: Sun Dec 28, 2014 3:02 pm
Location: East Yorkshire

Re: Is my fear of C++ justified?

Thu May 25, 2023 8:58 am

Speaking of Matlab, there is of course Mathematica you can use freely on the RaspberryPi for your data analysis needs. Now back with the 64bit OS too.

Less fearful than C++

ejolson
Posts: 10957
Joined: Tue Mar 18, 2014 11:47 am

Re: Is my fear of C++ justified?

Thu May 25, 2023 4:01 pm

Heater wrote:
Thu May 25, 2023 8:33 am
ejolson wrote:
Wed May 24, 2023 3:33 pm
Was the goal to check whether all files in a particular directory contain valid utf-8 text? I wonder if there is a language in which such a task would be easy to express so a non-programmer could understand it.
Sure. Rust to the rescue.

Code: Select all

    let b = std::str::from_utf8(&bytes).is_ok();
Will set the "b" to true if the byte sequence "bytes" consists entirely of valid UTF-8 character code points, false otherwise. Note: "b" is a proper boolean type not some integer pretending to be one.

If you actually want to make a proper "String" type from those bytes then this will do it:

Code: Select all

let s = str::from_utf8(&bytes); 
Of course if "bytes" does not contain a valid utf8 sequence the Result "s" will be an error so we had better check for that:

Code: Select all

    let s = match str::from_utf8(&bytes) {
        Ok(v) => v,
        Err(e) => panic!("Invalid UTF-8 sequence: {e}"),
    };  
Where "panic!" will exit the program with a nice error message. Of course you can handle the error however you like.

Or if one is just messing around experimenting and does not care about errors yet then this:

Code: Select all

    let s = match str::from_utf8(&bytes).unwrap();
Will unwrap the Result type produced by "from_utf8" and produce a String result or exit the program if the byte sequence was not valid utf8.

Or if the function you are using this in returns the appropriate Result type you can just do this:

Code: Select all

    let s = str::from_utf8(&bytes)?;
Where the "?" will cause the function to return with the error if the byte sequence is not valid utf8.


NOTE: The String type in Rust is guaranteed to always be valid urf8.
Okay, suppose a person wants to check all files in, for example, the root partition of the May 3rd 2023 release of Raspberry Pi OS with Desktop to determine what percentage of files are utf-8 text files.

If a file is not utf-8 one can stop reading at the first unrecognizable code point, however, if it is utf-8 then one needs to read until the end to check every character. Since arbitrarily cutting a utf-8 file at a byte boundary can result in two pieces that are not utf-8 strings (when a code point is split in half), I'm not sure how checking whether a file is utf-8 can be done efficiently using the from_utf8 function. In particular, that call acts on a buffer and seems to require reading the whole file before checking.

Heater
Posts: 19415
Joined: Tue Jul 17, 2012 3:02 pm

Re: Is my fear of C++ justified?

Thu May 25, 2023 9:19 pm

ejolson wrote:
Thu May 25, 2023 4:01 pm
If a file is not utf-8 one can stop reading at the first unrecognizable code point, however, if it is utf-8 then one needs to read until the end to check every character. Since arbitrarily cutting a utf-8 file at a byte boundary can result in two pieces that are not utf-8 strings (when a code point is split in half), I'm not sure how checking whether a file is utf-8 can be done efficiently using the from_utf8 function. In particular, that call acts on a buffer and seems to require reading the whole file before checking.
Hmmm.... How about this:

Code: Select all

use std::fs::File;
use utf8_read::Reader;

fn main() -> std::io::Result<()> {
    let in_file = File::open("foo.txt")?;
    let mut file_reader = Reader::new(&in_file);

    for c in file_reader.into_iter() {
        if c.is_err() {
            println!("File contains non-utf8");
            break;
        }
    }

    Ok(())
}
Where:
The utf8-read module provides a streaming char Reader that converts any stream with the std::io::Read into a stream of char values, performing UTF8 decoding incrementally.

If the std::io::Read stream comes from a file then this is just a streaming version of (e.g.) std::fs::read_to_string, but if the it comes from, e.g., a std::net::TcpStream then it has more value: iterating through the characters of the stream will terminate when the TCP stream has stalled mid-UTF8, and can restart when the TCP stream gets more data.

The Reader provided also allows for reading large UTF8 files piecewise; it only reads up to 2kB of data at a time from its stream.
Slava Ukrayini.

ejolson
Posts: 10957
Joined: Tue Mar 18, 2014 11:47 am

Re: Is my fear of C++ justified?

Thu May 25, 2023 11:07 pm

Heater wrote:
Thu May 25, 2023 9:19 pm
ejolson wrote:
Thu May 25, 2023 4:01 pm
If a file is not utf-8 one can stop reading at the first unrecognizable code point, however, if it is utf-8 then one needs to read until the end to check every character. Since arbitrarily cutting a utf-8 file at a byte boundary can result in two pieces that are not utf-8 strings (when a code point is split in half), I'm not sure how checking whether a file is utf-8 can be done efficiently using the from_utf8 function. In particular, that call acts on a buffer and seems to require reading the whole file before checking.
Hmmm.... How about this:

Code: Select all

use std::fs::File;
use utf8_read::Reader;

fn main() -> std::io::Result<()> {
    let in_file = File::open("foo.txt")?;
    let mut file_reader = Reader::new(&in_file);

    for c in file_reader.into_iter() {
        if c.is_err() {
            println!("File contains non-utf8");
            break;
        }
    }

    Ok(())
}
Where:
The utf8-read module provides a streaming char Reader that converts any stream with the std::io::Read into a stream of char values, performing UTF8 decoding incrementally.

If the std::io::Read stream comes from a file then this is just a streaming version of (e.g.) std::fs::read_to_string, but if the it comes from, e.g., a std::net::TcpStream then it has more value: iterating through the characters of the stream will terminate when the TCP stream has stalled mid-UTF8, and can restart when the TCP stream gets more data.

The Reader provided also allows for reading large UTF8 files piecewise; it only reads up to 2kB of data at a time from its stream.
Looks good. Here on the old but recently updated dual PIII server I obtained

Code: Select all

$ rustc u8scan.rs 
Illegal instruction
Maybe some P4 instructions slipped in to the Alpine Linux build of Rust. I'll try on the Pi later.

Heater
Posts: 19415
Joined: Tue Jul 17, 2012 3:02 pm

Re: Is my fear of C++ justified?

Fri May 26, 2023 12:40 am

If you want to build and run that I suggest you use Cargo.

Code: Select all

# cargo new check-utf8-file
# cd check-utf8-file
# cargo add utf8-read
Edit src/main.rs as required.

Code: Select all

# cargo build
# cargo run 
Better to install Rust as described here https://www.rust-lang.org/tools/install than use whatever apt installs.

I have no idea about Alpine Linux.
Slava Ukrayini.

ejolson
Posts: 10957
Joined: Tue Mar 18, 2014 11:47 am

Re: Is my fear of C++ justified?

Fri May 26, 2023 1:37 am

ejolson wrote:
Thu May 25, 2023 11:07 pm
Heater wrote:
Thu May 25, 2023 9:19 pm
ejolson wrote:
Thu May 25, 2023 4:01 pm
If a file is not utf-8 one can stop reading at the first unrecognizable code point, however, if it is utf-8 then one needs to read until the end to check every character. Since arbitrarily cutting a utf-8 file at a byte boundary can result in two pieces that are not utf-8 strings (when a code point is split in half), I'm not sure how checking whether a file is utf-8 can be done efficiently using the from_utf8 function. In particular, that call acts on a buffer and seems to require reading the whole file before checking.
Hmmm.... How about this:

Code: Select all

use std::fs::File;
use utf8_read::Reader;

fn main() -> std::io::Result<()> {
    let in_file = File::open("foo.txt")?;
    let mut file_reader = Reader::new(&in_file);

    for c in file_reader.into_iter() {
        if c.is_err() {
            println!("File contains non-utf8");
            break;
        }
    }

    Ok(())
}
Where:
The utf8-read module provides a streaming char Reader that converts any stream with the std::io::Read into a stream of char values, performing UTF8 decoding incrementally.

If the std::io::Read stream comes from a file then this is just a streaming version of (e.g.) std::fs::read_to_string, but if the it comes from, e.g., a std::net::TcpStream then it has more value: iterating through the characters of the stream will terminate when the TCP stream has stalled mid-UTF8, and can restart when the TCP stream gets more data.

The Reader provided also allows for reading large UTF8 files piecewise; it only reads up to 2kB of data at a time from its stream.
Looks good. Here on the old but recently updated dual PIII server I obtained

Code: Select all

$ rustc u8scan.rs 
Illegal instruction
Maybe some P4 instructions slipped in to the Alpine Linux build of Rust. I'll try on the Pi later.
Woohoo! It's working on the Pi 4B, but my feeble attempts to refactor the code have led to the error

Code: Select all

let in_file = File::open(s)?;
                           ^ cannot use the `?` operator in a function that returns `bool`
I was working on my car today and discovered my fear of rust might be greater than my fear of C++.
Last edited by ejolson on Fri May 26, 2023 3:49 am, edited 1 time in total.

ejolson
Posts: 10957
Joined: Tue Mar 18, 2014 11:47 am

Re: Is my fear of C++ justified?

Fri May 26, 2023 2:58 am

ejolson wrote:
Fri May 26, 2023 1:37 am
I was working on my car today and discovered my fear of rust might be greater than my fear of C++.
The cable broke and my main accomplishment was inserting some boards of wood inside the door under the window so it stays up before the rain came. After checking the rust on the car, I had some time to check Rust on the Pi.
Heater wrote:
Fri May 26, 2023 12:40 am
If you want to build and run that I suggest you use Cargo.
Thanks! Here are my results:

Code: Select all

$ cargo build --release
   Compiling u8scan v0.1.0 (/x/turb/ejolson/code/u8scan)
    Finished release [optimized] target(s) in 2.65s
$ wget https://downloads.raspberrypi.org/raspios_armhf/images/raspios_armhf-2023-05-03/2023-05-03-raspios-bullseye-armhf.img.xz
$ unxz 2023-05-03-raspios-bullseye-armhf.img.xz
$ su
# mkdir rootfs
# mount -o offset=272629760 2023-05-03-raspios-bullseye-armhf.img rootfs
# cd rootfs
# ../target/release/u8scan
Fido's UTF-8 popularity statistics:

    Total files: 112362
    UTF-8 files: 39217
    UTF-8/total: 34.902369128353%
# cd ..
# umount rootfs
# exit
I don't know if 34.902369128353 percent of the files in Raspberry Pi OS are really UTF-8, but it seems plausible.

For reference my code was

Code: Select all

use std::fs::File;
use utf8_read::Reader;
use walkdir::WalkDir;

fn checku8(s:&str) -> bool {
    let in_file = File::open(s).unwrap();
    let mut file_reader = Reader::new(&in_file);
    for c in file_reader.into_iter() {
        if c.is_err() {
            return false;
        }
    }
    return true;
}

fn main() -> std::io::Result<()> {
    let mut ucount=0; let mut tcount=0;
    for entry in WalkDir::new(".").into_iter().filter_map(|e| e.ok()) {
        if entry.path().is_file() {
            let s=entry.path().to_str().unwrap();
            tcount+=1;
            if checku8(s) {
                ucount+=1;
            }
        }
    }
    println!("Fido's UTF-8 popularity statistics:\n");
    println!("\tTotal files: {}\n\tUTF-8 files: {}",tcount,ucount);
    println!("\tUTF-8/total: {}%",100.0*ucount as f64/tcount as f64);
    Ok(())
}
I dedicated the program to the dog developer, who immediately complained about how few people use PETSCII encoding these days. I replied that UTF-8 was better because of the 🐶 emoji.

That almost worked, but then Fido looked at my code and started barking.

Defensively, I explained that I used a bit of cut and paste but not ChatGPT; however, the problem was lack of 🐶 in the code itself. I tried to fix that but got the message

Code: Select all

error: identifiers cannot contain emoji: `🐶`
More barking ensued. I'm fearful the program will have to be translated to some other language.

Heater
Posts: 19415
Joined: Tue Jul 17, 2012 3:02 pm

Re: Is my fear of C++ justified?

Fri May 26, 2023 6:41 am

We can put unicode into Rust identifiers. However it's restricted to the "Unicode Pattern and Identifier Syntax" : https://unicode.org/reports/tr31/ which apparently excludes emoji. Which is likely a good thing.
Slava Ukrayini.

User avatar
HermannSW
Posts: 5919
Joined: Fri Jul 22, 2016 9:09 pm
Location: Eberbach, Germany

Re: Is my fear of C++ justified?

Fri May 26, 2023 9:55 am

Heater wrote:
Fri May 26, 2023 6:41 am
... which apparently excludes emoji. Which is likely a good thing.
I agree on emoji, but not on unicode characters in general.
Yesterday I saw Python code for quadratic regression, and the variable names with "Σ..." nicely match the formulas implemented:
https://github.com/Bumpkin-Pi/Quadratic ... py#L26-L33
summation_variable_names.png
summation_variable_names.png
summation_variable_names.png (44.42 KiB) Viewed 461 times

I just tested, and summation symbol is allowed for C++ variable names as well:

Code: Select all

$ g++ t.cc -o t && ./t
42
$ cat t.cc 
#include <iostream>

int main(){
  int Σx2y = 41;
  ++Σx2y;
  std::cout << Σx2y << "\n";
}
$ 

P.S:
The summation symbol (U+03A3) — U+2211 (∑) generates compile error:

Code: Select all

$ echo -n "Σ"| od -tx1
0000000 ce a3
0000002
$ 
https://github.com/Hermann-SW/RSA_numbers_factored
https://stamm-wilbrandt.de/GS_cam_1152x192@304fps
https://hermann-sw.github.io/planar_graph_playground
https://github.com/Hermann-SW/Raspberry_v1_camera_global_external_shutter
https://stamm-wilbrandt.de/

Heater
Posts: 19415
Joined: Tue Jul 17, 2012 3:02 pm

Re: Is my fear of C++ justified?

Fri May 26, 2023 10:57 am

Rust allows unicode in identifiers. Just not emoji. Basically. So a Σ is OK. (Except it might warn about using upper case in what should be snake_case identifiers. But that warning can be suppressed).

Of course unicode in source can lead to some horrors. For example this compiles but looks very odd:

Code: Select all

struct ﻝ {
    a: u32,
    b: u32,
}

impl ﻝ {
    pub fn ﺍ(self) {
        println!("Hi!");
    }
}

fn feh() {
    let ﻝ = ﻝ {a: 1, b: 2};
    ﻝ.ﺍ();                                 // WTF !!!!
}
Slava Ukrayini.

User avatar
jahboater
Posts: 8687
Joined: Wed Feb 04, 2015 6:38 pm
Location: Wonderful West Dorset

Re: Is my fear of C++ justified?

Fri May 26, 2023 6:23 pm

And for C ...
https://www.open-std.org/jtc1/sc22/wg14 ... /n2836.pdf
which GCC 13 supports apparently.

lurk101
Posts: 2213
Joined: Mon Jan 27, 2020 2:35 pm
Location: Cumming, GA (US)

Re: Is my fear of C++ justified?

Fri May 26, 2023 6:42 pm

jahboater wrote:
Fri May 26, 2023 6:23 pm
And for C ...
https://www.open-std.org/jtc1/sc22/wg14 ... /n2836.pdf
which GCC 13 supports apparently.
Supported in gcc 11.3
History doesn’t repeat itself, it rarely even rhymes.

ejolson
Posts: 10957
Joined: Tue Mar 18, 2014 11:47 am

Re: Is my fear of C++ justified?

Sat May 27, 2023 4:28 am

lurk101 wrote:
Fri May 26, 2023 6:42 pm
jahboater wrote:
Fri May 26, 2023 6:23 pm
And for C ...
https://www.open-std.org/jtc1/sc22/wg14 ... /n2836.pdf
which GCC 13 supports apparently.
Supported in gcc 11.3
Unicode has been supported by gcc for much longer, just not parsing UTF-8.

I gave a work around for gcc 5.2 in

viewtopic.php?t=118161

However, it wasn't until gcc 10 in 2020 that a proper patch (not written by me) was accepted upstream.

viewtopic.php?t=273441

Ever since then my editor has been set to display non-ASCII characters green.

viewtopic.php?p=1676956#p1676956
Last edited by ejolson on Sat May 27, 2023 12:24 pm, edited 2 times in total.

User avatar
HermannSW
Posts: 5919
Joined: Fri Jul 22, 2016 9:09 pm
Location: Eberbach, Germany

Re: Is my fear of C++ justified?

Sat May 27, 2023 7:57 am

Unicode adds freedom in naming stuff, but also creates problems with the "invisible" characters like U+2060 word joiner that I use on twitter to end twitter hashtags, eg for #3Dprinters after the "r" for having the word 3Dprinters, but only #3Dprinter hashtag.
https://github.com/Hermann-SW/RSA_numbers_factored
https://stamm-wilbrandt.de/GS_cam_1152x192@304fps
https://hermann-sw.github.io/planar_graph_playground
https://github.com/Hermann-SW/Raspberry_v1_camera_global_external_shutter
https://stamm-wilbrandt.de/

hippy
Posts: 13793
Joined: Fri Sep 09, 2011 10:34 pm
Location: UK

Re: Is my fear of C++ justified?

Sat May 27, 2023 12:50 pm

ejolson wrote:
Fri May 26, 2023 2:58 am

Code: Select all

    Total files: 112362
    UTF-8 files: 39217
    UTF-8/total: 34.902369128353%
I don't know if 34.902369128353 percent of the files in Raspberry Pi OS are really UTF-8, but it seems plausible.
Seems to me you could be counting executable, binary and object files as UTF-8 when one would not usually count them as such.

And how are you dealing with text files which include Extended ASCII which may appear as malformed UTF-8 sequences ?

And what about the case of a text file which begins with a Byte Order Marker signalling it is UTF-8 when it uses no unicode within its contents, only ASCII ? Is that counted as UTF-8 or not ?

How are other unicode formats counted ? How do you know an apparent Byte Order Marker is that and not just the start of a binary file ?

Rather than do it in Rust I think it would be better done in C++ so we can better assess whether the OP's fear of C++ is justified or not.

ejolson
Posts: 10957
Joined: Tue Mar 18, 2014 11:47 am

Re: Is my fear of C++ justified?

Sat May 27, 2023 1:11 pm

HermannSW wrote:
Sat May 27, 2023 7:57 am
Unicode adds freedom in naming stuff, but also creates problems with the "invisible" characters like U+2060 word joiner that I use on twitter to end twitter hashtags, eg for #3Dprinters after the "r" for having the word 3Dprinters, but only #3Dprinter hashtag.
For me the problem is homomorphic letters with distinct code points. Out of necessity engineers spent some effort making 1 and l visually distinct as well as 0 and O. Then the everything for everybody crowd added a bunch of characters that look exactly the same but are uniquely different.

According to the dog developer, emoji are unlikely to be confused with any existing letter in any alphabet and are therefore safe to use in any language. On the other hand "a" and "а" look the same but respectively come from the Latin and Cyrillic alphabets. Such letters should never be used together in the same program.

Apparently FidoBasic has a localisation option of the form

Code: Select all

10 IDENTIFIERS Greek, Emoji
which, for example, allows only Greek letters and emoji in identifiers.

From what I understand the default is
that only emoji are allowed.

ejolson
Posts: 10957
Joined: Tue Mar 18, 2014 11:47 am

Re: Is my fear of C++ justified?

Sat May 27, 2023 1:29 pm

hippy wrote:
Sat May 27, 2023 12:50 pm
ejolson wrote:
Fri May 26, 2023 2:58 am

Code: Select all

    Total files: 112362
    UTF-8 files: 39217
    UTF-8/total: 34.902369128353%
I don't know if 34.902369128353 percent of the files in Raspberry Pi OS are really UTF-8, but it seems plausible.
Seems to me you could be counting executable, binary and object files as UTF-8 when one would not usually count them as such.

And how are you dealing with text files which include Extended ASCII which may appear as malformed UTF-8 sequences ?

And what about the case of a text file which begins with a Byte Order Marker signalling it is UTF-8 when it uses no unicode within its contents, only ASCII ? Is that counted as UTF-8 or not ?

How are other unicode formats counted ? How do you know an apparent Byte Order Marker is that and not just the start of a binary file ?

Rather than do it in Rust I think it would be better done in C++ so we can better assess whether the OP's fear of C++ is justified or not.
I find it an interesting question whether it's possible to construct an ELF binary consisting only of UTF-8 code points. I don't know the answer.

While I've given the Rust code, it's not clear to me whether it checks only the UTF-8 decoding or if the resulting code point is actually in the Unicode range. Since the library used is open source, it should be possible to read the code to find out. I wonder how easy that would be to do in this case.

I think the main point of Rust is not that the code is easier to read or write, but that once done the code is less likely to have hidden errors than C++. Thus, the simplicity of the visual appearance may not reflect the reliability of a language when used at scale in large projects.

hippy
Posts: 13793
Joined: Fri Sep 09, 2011 10:34 pm
Location: UK

Re: Is my fear of C++ justified?

Sat May 27, 2023 2:04 pm

ejolson wrote:
Sat May 27, 2023 1:29 pm
I find it an interesting question whether it's possible to construct an ELF binary consisting only of UTF-8 code points. I don't know the answer.
Might be worth starting a separate thread on that.

tttapa
Posts: 120
Joined: Mon Apr 06, 2020 2:52 pm

Re: Is my fear of C++ justified?

Sat May 27, 2023 3:00 pm

hippy wrote:
Sat May 27, 2023 12:50 pm
Rather than do it in Rust I think it would be better done in C++ so we can better assess whether the OP's fear of C++ is justified or not.
Doesn't look too scary, IMO:

Code: Select all

#include <algorithm>
#include <filesystem>
#include <fstream>
namespace fs = std::filesystem;
#include <fmt/core.h> // https://github.com/fmtlib/fmt
#include <utf8.h>     // https://github.com/nemtrif/utfcpp

bool contains_valid_utf8(const fs::path &path) {
    std::basic_ifstream<char> file{path, std::ios::binary};
    if (!file)
        return false;
    return utf8::is_valid(std::istreambuf_iterator(file), {});
}

int main() try {
    std::ptrdiff_t total = 0;
    fs::recursive_directory_iterator dir_it{fs::current_path()};
    auto utf8_count = std::ranges::count_if(dir_it, [&](const auto &entry) {
        if (!entry.is_regular_file())
            return false;
        ++total;
        return contains_valid_utf8(entry.path());
    });
    auto ratio = static_cast<double>(utf8_count) / static_cast<double>(total);
    fmt::print("Fido's UTF-8 popularity statistics:\n\n");
    fmt::print("\tTotal files: {}\n\tUTF-8 files: {}\n", total, utf8_count);
    fmt::print("\tUTF-8/total: {}%\n", 100.0 * ratio);
} catch (std::exception &e) {
    fmt::print(stderr, "Uncaught exception: {}\n", e.what());
    return 1;
} catch (...) {
    fmt::print(stderr, "Uncaught error\n");
    return 1;
}

ejolson
Posts: 10957
Joined: Tue Mar 18, 2014 11:47 am

Re: Is my fear of C++ justified?

Sat May 27, 2023 3:12 pm

Heater wrote:
Fri May 26, 2023 12:40 am
If you want to build and run that I suggest you use Cargo.

Code: Select all

# cargo new check-utf8-file
# cd check-utf8-file
# cargo add utf8-read
Edit src/main.rs as required.

Code: Select all

# cargo build
# cargo run 
Better to install Rust as described here https://www.rust-lang.org/tools/install than use whatever apt installs.

I have no idea about Alpine Linux.
Alpine is based on the musl C library and busybox. It's like someone started with an initial RAM filesystem and then added the minimum needed to make a full Linux distribution. My impression is Alpine is popular to use in Docker and other types of containers.

I first used Alpine to make a Singularity container for an ARMv6 executable of the Julia language that runs on a Zero.

I thought it might work bare-metal on a 32-bit PIII system and indeed it does. While most i686 Alpine packages don't require a P4 processor or SSE2, some do and Rust is one of those. From what I can tell the binaries provided by rustup don't support the musl C library, so a build from source may be required.

The rust build system seems to rely on Python scripts. My fear is that Python build scripts often turn out so fragile that they break on any architecture the engineers didn't explicitly include in their testing. This isn't the fault of Python but of the software engineer that uses it for custom automation that works only on their own computer.

My guess is such build scripts are not intended for portability but for the convenience of people fixing bugs and adding features to existing platforms. At any rate, I haven't tried building from source so it maybe it would just work.

ejolson
Posts: 10957
Joined: Tue Mar 18, 2014 11:47 am

Re: Is my fear of C++ justified?

Sat May 27, 2023 4:22 pm

tttapa wrote:
Sat May 27, 2023 3:00 pm
hippy wrote:
Sat May 27, 2023 12:50 pm
Rather than do it in Rust I think it would be better done in C++ so we can better assess whether the OP's fear of C++ is justified or not.
Doesn't look too scary, IMO:

Code: Select all

#include <algorithm>
#include <filesystem>
#include <fstream>
namespace fs = std::filesystem;
#include <fmt/core.h> // https://github.com/fmtlib/fmt
#include <utf8.h>     // https://github.com/nemtrif/utfcpp

bool contains_valid_utf8(const fs::path &path) {
    std::basic_ifstream<char> file{path, std::ios::binary};
    if (!file)
        return false;
    return utf8::is_valid(std::istreambuf_iterator(file), {});
}

int main() try {
    std::ptrdiff_t total = 0;
    fs::recursive_directory_iterator dir_it{fs::current_path()};
    auto utf8_count = std::ranges::count_if(dir_it, [&](const auto &entry) {
        if (!entry.is_regular_file())
            return false;
        ++total;
        return contains_valid_utf8(entry.path());
    });
    auto ratio = static_cast<double>(utf8_count) / static_cast<double>(total);
    fmt::print("Fido's UTF-8 popularity statistics:\n\n");
    fmt::print("\tTotal files: {}\n\tUTF-8 files: {}\n", total, utf8_count);
    fmt::print("\tUTF-8/total: {}%\n", 100.0 * ratio);
} catch (std::exception &e) {
    fmt::print(stderr, "Uncaught exception: {}\n", e.what());
    return 1;
} catch (...) {
    fmt::print(stderr, "Uncaught error\n");
    return 1;
}
How hard would it be to use ICU instead of uftcpp?

https://icu.unicode.org/

Return to “C/C++”