byte
No mentions found
This entity hasn't been tracked yet, or Iris is still building its knowledge base.
Related Articles from SNS
Large Byte Model: Teaching Language Models About Compiled Code
Announce Type: new Abstract: Malware analysis starts with the raw bytes of an executable program, and tools to "lift" these to higher-level representations, such as assembly, are expensive and subject to error. Large Language Models (LLMs) cannot process raw byte representations and answer questions about them. To this end, we present the first byte-native LLM.
Every Byte Matters
Every byte matters Published 2026-06-01 on Farid Zakaria's Blog I have spent a large portion of my career working in Java. In that time, you get used to huge classes. Just add a new method and field to the class.
Not Every Byte Gets a Vote
Not Every Byte Gets a Vote In a deterministic game engine, replay starts simple: record inputs, run the same ticks again, and compare the result. When I started wiring replay for the sim, my first instinct was simple: Easy. For the first few fields, that feels right.
Fast and Expressive Multi-Byte Prediction with Probabilistic Circuits
arXiv:2511.11346v2 Announce Type: replace Abstract: Multi-token prediction (MTP) is a prominent strategy to significantly speed up generation in large language models (LLMs), especially in byte-level LLMs, which are tokeniser-free but prohibitively slow. However, many existing MTP methods either assume independence between future tokens, sacrificing expressiveness, or generate tokens one at a time within the window, increasing latency. In this work, we investigate the trade-off between...
Byte Pair Encoding for Efficient Time Series Forecasting
Announce Type: replace Abstract: Existing time series tokenization methods predominantly encode a constant number of samples into individual tokens. This inflexible approach can generate excessive tokens for even simple patterns like extended constant values, resulting in substantial computational overhead. Inspired by the success of byte pair encoding, we propose the first pattern-centric tokenization scheme for time series analysis.
Incremental BPE Tokenization
arXiv:2605.30813v1 Announce Type: new Abstract: We propose a novel algorithm for incremental Byte Pair Encoding (BPE) tokenization. The algorithm processes each input byte in worst-case $\mathcal{O}(\log^2 t)$ time, leading to an overall complexity of $\mathcal{O}(n \log^2 t)$, where $n$ is the input length and $t$ is the maximum token length. The algorithm incrementally maintains BPE tokenization results for every prefix of the input text, implementing the standard BPE merge procedure...
When Entropy Is Not Enough: Multi-Modal Classification of Encrypted and Compressed Data Fragments
arXiv:2605.31337v1 Announce Type: new Abstract: Reliable identification of encrypted data fragments is essential in cybersecurity, with applications to ransomware detection, digital forensics, and large-scale data analysis. Distinguishing encrypted from compressed fragments is particularly challenging, as short fragments lack structural data and exhibit low statistical redundancy. Traditional statistical methods based on byte-level distributions show limited effectiveness on this task.
HalleluBERT: Let Every Token That Has Meaning Bear Its Weight
arXiv:2510.21372v2 Announce Type: replace Abstract: Transformer-based models have advanced NLP, yet Hebrew still lacks a RoBERTa encoder that is trained at scale and released in both base and large variants. We present HalleluBERT, a RoBERTa-based encoder family trained from scratch on 49.1~GB of deduplicated Hebrew web text and Wikipedia using a Hebrew-specific byte-level BPE vocabulary. On native Hebrew benchmarks for named entity recognition (BMC, NEMO) and sentiment classification...
Hierarchical Certified Semantic Commitment for Byzantine-Resilient LLM-Agent Collaboration
arXiv:2606.07316v1 Announce Type: new Abstract: Byzantine collaboration among large-language-model agents requires a finality-control primitive: given delivered stochastic, structured natural-language proposals, the protocol must decide whether the round supports a commit, what kind of commit, or a typed safe abort. Naive aggregation hides this choice behind a single verdict; classical Byzantine fault tolerance hides it behind byte-identity that LLM proposals do not satisfy. We introduce...
"I've Seen How This Goes": Characterizing Diversity via Progressive Conditional Surprise
arXiv:2606.01811v1 Announce Type: new Abstract: Measuring the diversity of creative outputs is central to evaluating post-training mode collapse, comparing decoding strategies, and quantifying creative behavior in both AI and human writing. We propose a new approach to measuring diversity using in-context learning, of which the ``Decan'' metric, $D_{Ca_n} = C \times a_n$, is the working instance we evaluate: a per-byte score read off the per-token log-probabilities of a base model $\theta$...