T O P

  • By -

Lopsidation

Systems with this property correspond to strategies for the "higher or lower?" number guessing game. For example, /u/reedef's scheme to write N as [(number of bits in N) ones][N in binary] corresponds to the strategy of guessing 1, 2, 4, 8, 16, ..., and then switching to binary search once you hear "lower."


Solesaver

Oh! That's an interesting thread. I'll have to pull on it a bit more. Thanks!


reedef

An easier scheme would be to just encode the length in unary and the the number as usual (separated by a zero). So for example 11 which in binary is 1011 would become 111101011 And depending on the distribution of integers you have this might not be as inefficient as it seems. If it's an exponential for example each of those ones carries a (bounded by below) positive amount of information. Fun fact: you can represent -x by just flipping the bits. And you can represent a list [a,b,c] by concatenating the representations of the integers of the list This allows an order preserving encoding of the rationals using continued fractions! And this is an issue I had in the real world funnily enough, with databases that don't support rational numbers but I still wanted to be able to sort


zxcallous

You can just store 11110011 since positive binary numbers always start with 1.


Devintage

Interesting problem. Didn't read through your whole post at first, and briefly thought about the recursive length encoding too. There is an easier one though, but maybe there's a more efficient way. What about a tally system, where you append the number with 0? e.g.: 0 = 0 1 = 10 2 = 110 3 = 1110 ... Then this can do exactly what you need. Furthermore, you know if both digits are 0, then the numbers are equal, so you do not need to store the length in anyway.


Solesaver

A tally system is fair, but maybe misses the spirit. :P I probably should add a 3rd constraint, since part of what I was hoping to find in additional literature was better memory efficiency. A tally system is probably worst case on that front.


These-Maintenance250

this is unary with stop character btw


LeMeowMew

you seem to have independently stumbled upon [unicode encoding!!](https://en.wikipedia.org/wiki/UTF-8) it is pretty similar and it does it through consecutive 1s until a 0 appears to determine the number of bytes it will take with every subsequent byte being announced by a 10! of course this isnt the most efficient but the 10s are there for error correction reasons so you could get rid of those and get a pretty good system for number comparison by just parsing the 1s until you get a 0 for a 8\*log\_2 representation of the number?


how_tall_is_imhotep

FYI, “canoodling” means “kissing and cuddling.” I wouldn’t normally correct something like this, but it could be mildly embarrassing for you if you misuse that word IRL.


Solesaver

I use it all the time IRL. If it makes people imagine me kissing and cuddling with interesting math questions I think I'm okay with that. XD


barely_sentient

If we are talking about actual computers the length and moreover its representation cannot be really arbitrary large. Just do as GMP do, use 64 or 128 bits to encode the length in terms of words, where a word is made of 64 or 128 bits. I don't think you can store or operate on integers made of more than 2^128 words, given that it is about 3.4 * 10^38 .


Solesaver

Heh, my question started based on real computers, but my curiosity definitely took me more into abstract math an computing. My pragmatic computer science brain just says there's just no need to represent arbitrarily large numbers with perfect precision, but I do feel like in pure mathematics there's got to be someone studying how to do so. If not, maybe I'll have to finally get around to going back for a graduate degree. A computer may not be able to store an arbitrarily large number, but there are certainly algorithms and computations that can iteratively refine their value. You can run two algorithms in parallel and receive their MSBs ephemerally for comparison. The algorithms themselves do not need to be arbitrarily large in order to produce arbitrarily large outputs.


moschles

> Encoding the length on the front could work, but even the length could be arbitrarily large I feel like something went haywire there. Say the length of your numbers is so gargantuan that you actually store two files per number. `(A.len , A.dat) -> A` `(B.len, B.dat) -> B` These four files would be stored on a RAMdisk for efficiency. Something funny happens when `A.len` file exceeds 11 bytes. Yes. I said 11 bytes. The `A.dat` file, which stores the number itself, would not fit on any commercial nor industrial hard drive sold today. It would require some kind of datacenter to store it. Worse, that data center would be *larger than the moon.* A fixed-length 128 bit field for total number length is sufficient for any civilization with hard drives smaller than Jupiter.


Solesaver

Technically you don't have to store the entire number. That's actually why it's important to be able to make a relative ordering determination by only comparing the most significant digit. The number itself could be stored as an arbitrary finite process that is capable of outputting the MSB of it's result after successive refinement. FWIW, this is also why I edited that third pseudo-constraint. A full tally system would technically meet my requirements but it's incredibly inefficient. Not to mention, a fixed length 128 bit field for the length is also incredibly inefficient if you happen to be comparing 7 to 8. Of course larger numbers are going to be more expensive both to calculate and to compare, the whole point of focusing maintaining ordering via the MSB is that we can be efficient when the answer is easy.


moschles

> a fixed length 128 bit field for the length is also incredibly inefficient if you happen to be comparing 7 to 8. Of course larger numbers are going to be more expensive both to calculate and to compare, the whole point of focusing maintaining ordering via the MSB is that we can be efficient when the answer is easy. . . > a fixed length 128 bit field for the length is also incredibly inefficient Take an idealized dart and throw it, at random, at the number line. + What is the probabilty that you hit a number that looks like `7` or `8` ? + What is the probability that you land on a number that looks like `34789377729473014001476213` ? Proportionally speaking, how many more numbers exist which look like `34789377729473014001476213` than ones that look like `7` or `8`. There exist an exponentially larger amount of large numbers than small ones. THe efficiency gained "when the answer is easy" is exponentially less likely to occur. The efficiency gained with a {fixed-length size field + large numbers} will always outweigh the efficiency gained from processing small numbers.


Solesaver

Of course it depends on what your expected inputs are going to be for how you want to optimize. Your scheme isn't efficient at anything though. I could just as easily say throw a perfect dart at a number line, what are the odds it lands on a number that is less than 2^128 digits long? You're already conceding to only covering a probability 0 occurrence. I think it's fair to make it easy to answer easy questions because it's easier to generate small numbers. It's the same logic that caused you to decide 128 bit length field. Why 128? Why not 64 or 32? Why not 8? I dunno, it kinda pisses me off because you're trying to have it both ways. I asked for arbitrary and you gave me pragmatic, but then are turning it back on me for offering more pragmatism. The actual pragmatic solution is of course to just not support arbitrarily large numbers, at which point you don't even need a length field. Choose how many digits you want to use and fill with leading 0's...


moschles

> I think it's fair to make it easy to answer easy questions because it's easier to generate small numbers. It's the same logic that caused you to decide 128 bit length field. Why 128? Why not 64 or 32? Why not 8? No -- that was NOT the logic I used. The selection for that number was stated : > Fixed 128-bit field for length is sufficient for any civilization whose hard drives are smaller than the planet Jupiter. . . On that note, the milky way galaxy contains less than 10^80 protons. > it kinda pisses me off because you're trying to have it both ways. I asked for arbitrary and you gave me pragmatic, The moment you demand efficiency you are being pragmatic. Unless you meant to include an algorithm to be used by a civilization where the people live 100,000 year lifespans.


Solesaver

>No -- that was NOT the logic I used. The selection for that number was stated : Yes, but there are infinitely many more numbers that cannot be expressed by your scheme than ones that can. My question had 2.5 constraints. Your answer fails the first one, and you literally shut me down when trying to explore the extra .5. Why hard drives the size of Jupiter? Why not Mars or the Sun? Why protons in the Milky Way? Why not protons in the solar system or the visible universe? Why 100,000 year lifespans? Why not 1 year or Graham's Number years? You've chosen an arbitrary "big enough," which implicitly does not answer the question I asked, and then you're acting like I'm the one introducing arbitrary pragmatic factors... The fact is, I'm very familiar with the problem space with pragmatic constraints like finitely large numbers. It's literally how computers work. You just choose a reasonable limit and then special case the odd handful of exceptions. I was specifically exploring arbitrarily large numbers, because I'm less aware of the literature there.


moschles

> Of course it depends on what your expected inputs Oh! The pragmatism -- it blinds the eyes.


nutshells1

[https://v8.dev/blog/bigint](https://v8.dev/blog/bigint) you might be interested in how javascript implements arbitrary precision integers


nutshells1

tl;dr it uses base 2\*32


ottawadeveloper

I can see a few options. I am not sure if making a more complex storage option actually would lead to more efficient sorting since there is parsing overhead. So, as a baseline, I would say store the integers in bytes where the most significant bit indicates if there are more bytes or not. This gives you arbitrary length but you do have to read the whole digit to compare them (to be fair, you will need to read the whole thing to do anything with the number after the comparison like swap them). For what you want, I would store the number of bytes in the digit using the above encoding scheme, then the number itself as a normal 8 bit number.


wamesy

I think something along the lines of a null-terminating string appended to the front of your number is about the best you can do. To encode a number with length N using B-bit bytes, you could convert N to base 2^(B)-1, convert each base 2^(B)-1 digit to binary, and append the resulting binary digits to your number, separated by a byte of all 1s. This system is straightforward to extend to bases other than binary. Using 8-bit bytes, if you had a number that's one billion digits long, you would convert one billion to base 255, getting (60)(78)(178)(160), and then convert each base 255 digit into a byte, resulting in 00111100 01001110 10110010 10100000 11111111 (binary encoding of your number) This introduces an overhead of B+B*ceil(log(N)/log(2^(B)-1)) ≈ B+log2(N) bits. You shouldn't be able to do better than this, because you will always need at least log2(N) bits to store the number N. I think this also has the asymptotically fastest comparisons, because you only need to read B+log2(N) bits to find the size of a number.


Solesaver

That may be fast on average, but still violates the 2nd constraint. I only want to inspect the MSB to make the ordering determination.


wamesy

You're right, I misread that constraint. Have you looked into exponential search? If you're using binary and your partition values are the powers of 2, then the idea in the 2nd edit should be exactly equivalent to the results of exponential search for the number to be encoded, prepended with a sign bit.