FUTO Swipe: Handling Keyboard Layout Peculiarities
Most swipe typing feels like a guessing game. You slide your finger across the screen and hope the algorithm correctly interprets your imprecise movements. Most keyboards treat the layout as a secondary detail, a set of coordinates that the decoder tries to map back to letters. It's an approximation that often misses the mark.
FUTO is doing something different. They're treating the keyboard layout as a first-class citizen in the decoding process. Instead of a generic approach, they've built a model that is specific to both the language and the layout. It learns the exact peculiarities of how we actually move our thumbs on a QWERTY board.
Because this requires specific swipe data for training, they only have English QWERTY sorted out for now. It's a narrow start, but the accuracy is actually impressive. They've moved the logic on-device for the Android app to kill the latency, though the web demo still relies on a server.
The real question is whether this layout-specific approach can actually scale to other languages, or if the sheer amount of training data needed makes it a niche luxury.
The problem with generic swipe decoders
Standard swipe decoders treat a gesture as a path of coordinates, but they often fail because they ignore the physical geometry of the keyboard. A "gesture" isn't just a line; it's a set of probabilities based on how far a user's thumb actually travels. When a model is too generic, it misses the nuance of "near-misses"—where a user intends to hit 'O' but swipes through 'P'. This creates a gap between the raw input and the intended character that simple distance-based algorithms can't bridge.
This part is genuinely confusing because most libraries hide the coordinate mapping. You aren't just tracking X and Y; you're mapping those points to a weighted grid of key centers. If the weight is off by a few pixels, the decoder guesses a completely different word.
def get_nearest_key(point, key_map):
# point: (x, y), key_map: {(char: (x, y))}
distances = {char: abs(point[0] - x) + abs(point[1] - y)
for char, (x, y) in key_map.items()}
# Returns the character with the smallest Manhattan distance
return min(distances, key=distances.get)
The problem is that a linear search like the one above is too naive for real-world typing. It doesn't account for the fact that people swipe in arcs, not straight lines. To fix this, you need a hidden Markov model or a transformer that considers the linguistic probability of the next letter alongside the physical path.
If you're building your own, start by installing a basic geometry library to handle the point-in-polygon calculations for the keycaps:
pip install shapely
Layout-specific decoding
FUTO trains models on specific language and layout pairs so the decoder understands the physical geometry of the keyboard. Most LLMs treat text as a linear stream of tokens, but a keyboard is a 2D grid. If a model doesn't know that 'S' is physically next to 'A' and 'D' on a QWERTY board, it can't accurately predict the specific types of "fat-finger" errors humans actually make.
This part is genuinely confusing because we're asking a transformer to map a logical character to a physical coordinate. The model has to learn that a typo isn't just a random character substitution; it's usually a shift of one key to the left, right, up, or down.
To implement a basic version of this, you'd map each key to its X and Y coordinates.
layout_map = {
'q': (0, 0), 'w': (0, 1), 'e': (0, 2), 'r': (0, 3),
'a': (1, 0), 's': (1, 1), 'd': (1, 2), 'f': (1, 3),
'z': (2, 0), 'x': (2, 1), 'c': (2, 2), 'v': (2, 3)
}
def get_distance(key1, key2):
# Calculate Manhattan distance between two keys
p1, p2 = layout_map[key1], layout_map[key2]
return abs(p1[0] - p2[0]) + abs(p1[1] - p2[1])
print(get_distance('s', 'a')) # Output: 1 (Adjacent)
The decoder uses these spatial relationships to weight its predictions. If the user typed "Gello" instead of "Hello," the model checks if 'G' is near 'H'. Since they are adjacent on QWERTY, the probability of that being the intended word increases. If the user typed "Xello," the distance is greater, and the model is less likely to treat it as a simple layout error.
The trade-off of high accuracy
The current model is limited to English QWERTY because high accuracy requires layout-specific training data. You can't just tell a model "this is a keyboard" and expect it to map coordinates to characters across different languages or physical arrangements. It has to learn the specific spatial relationship between a key's position and its resulting character. Since the training set only contains QWERTY data, the model is effectively blind to Dvorak or AZERTY.
This part is genuinely confusing because the marketing suggests "universal" compatibility, but the math doesn't support it. To support a new layout, you need thousands of labeled samples where the visual coordinate $(x, y)$ is mapped to the correct character for that specific layout. Without that data, the model guesses based on the most common pattern it knows, which is why it'll likely map a French 'A' to a QWERTY 'Q'.
If you're trying to implement a custom layout mapper, you'll need to handle the translation layer manually in your code.
layout_map = {"q": "a", "w": "s", "e": "d"}
def translate_key(detected_char):
# Convert the model's QWERTY guess to the actual layout
return layout_map.get(detected_char, detected_char)
print(translate_key("q")) # Returns 'a'
Conclusion
FUTO Swipe proves that you can get better accuracy by actually caring about the keyboard layout, but it doesn't erase the fundamental friction of swipe typing. You're still trading a bit of precision for speed, and for some, that trade is never worth it.
I'm still not sure if most people actually want a "perfect" swipe decoder or if they've just accepted that they'll have to manually correct "the" every five sentences.
If you're building this, the question is: does the marginal gain in accuracy actually change the user experience, or are we just polishing a feature that people already tolerate?