Home › Knowledge Base › byte

byte

No mentions found

This entity hasn't been tracked yet, or Iris is still building its knowledge base.

Related Articles from SNS

Large Byte Model: Teaching Language Models About Compiled Code

Announce Type: new Abstract: Malware analysis starts with the raw bytes of an executable program, and tools to "lift" these to higher-level representations, such as assembly, are expensive and subject to error. Large Language Models (LLMs) cannot process raw byte representations and answer questions about them. To this end, we present the first byte-native LLM.

arXiv CS 7d ago

Every Byte Matters

Every byte matters Published 2026-06-01 on Farid Zakaria's Blog I have spent a large portion of my career working in Java. In that time, you get used to huge classes. Just add a new method and field to the class.

Hacker News 7d ago

Not Every Byte Gets a Vote

Not Every Byte Gets a Vote In a deterministic game engine, replay starts simple: record inputs, run the same ticks again, and compare the result. When I started wiring replay for the sim, my first instinct was simple: Easy. For the first few fields, that feels right.

Hacker News 8d ago

Fast and Expressive Multi-Byte Prediction with Probabilistic Circuits

arXiv:2511.11346v2 Announce Type: replace Abstract: Multi-token prediction (MTP) is a prominent strategy to significantly speed up generation in large language models (LLMs), especially in byte-level LLMs, which are tokeniser-free but prohibitively slow. However, many existing MTP methods either assume independence between future tokens, sacrificing expressiveness, or generate tokens one at a time within the window, increasing latency. In this work, we investigate the trade-off between...

arXiv CS 7d ago

Byte Pair Encoding for Efficient Time Series Forecasting

Announce Type: replace Abstract: Existing time series tokenization methods predominantly encode a constant number of samples into individual tokens. This inflexible approach can generate excessive tokens for even simple patterns like extended constant values, resulting in substantial computational overhead. Inspired by the success of byte pair encoding, we propose the first pattern-centric tokenization scheme for time series analysis.

arXiv CS 8d ago

Incremental BPE Tokenization

arXiv:2605.30813v1 Announce Type: new Abstract: We propose a novel algorithm for incremental Byte Pair Encoding (BPE) tokenization. The algorithm processes each input byte in worst-case $\mathcal{O}(\log^2 t)$ time, leading to an overall complexity of $\mathcal{O}(n \log^2 t)$, where $n$ is the input length and $t$ is the maximum token length. The algorithm incrementally maintains BPE tokenization results for every prefix of the input text, implementing the standard BPE merge procedure...

arXiv CS 9d ago

When Entropy Is Not Enough: Multi-Modal Classification of Encrypted and Compressed Data Fragments

arXiv:2605.31337v1 Announce Type: new Abstract: Reliable identification of encrypted data fragments is essential in cybersecurity, with applications to ransomware detection, digital forensics, and large-scale data analysis. Distinguishing encrypted from compressed fragments is particularly challenging, as short fragments lack structural data and exhibit low statistical redundancy. Traditional statistical methods based on byte-level distributions show limited effectiveness on this task.

arXiv CS 9d ago

HalleluBERT: Let Every Token That Has Meaning Bear Its Weight

arXiv:2510.21372v2 Announce Type: replace Abstract: Transformer-based models have advanced NLP, yet Hebrew still lacks a RoBERTa encoder that is trained at scale and released in both base and large variants. We present HalleluBERT, a RoBERTa-based encoder family trained from scratch on 49.1~GB of deduplicated Hebrew web text and Wikipedia using a Hebrew-specific byte-level BPE vocabulary. On native Hebrew benchmarks for named entity recognition (BMC, NEMO) and sentiment classification...

arXiv CS 8d ago

Hierarchical Certified Semantic Commitment for Byzantine-Resilient LLM-Agent Collaboration

arXiv:2606.07316v1 Announce Type: new Abstract: Byzantine collaboration among large-language-model agents requires a finality-control primitive: given delivered stochastic, structured natural-language proposals, the protocol must decide whether the round supports a commit, what kind of commit, or a typed safe abort. Naive aggregation hides this choice behind a single verdict; classical Byzantine fault tolerance hides it behind byte-identity that LLM proposals do not satisfy. We introduce...

arXiv CS 2d ago

"I've Seen How This Goes": Characterizing Diversity via Progressive Conditional Surprise

arXiv:2606.01811v1 Announce Type: new Abstract: Measuring the diversity of creative outputs is central to evaluating post-training mode collapse, comparing decoding strategies, and quantifying creative behavior in both AI and human writing. We propose a new approach to measuring diversity using in-context learning, of which the ``Decan'' metric, $D_{Ca_n} = C \times a_n$, is the working instance we evaluate: a per-byte score read off the per-token log-probabilities of a base model $\theta$...

arXiv CS 8d ago