At niy.ai we’re fans of the game Semantle, which was one of the rash of Wordle variants that appeared in 2022. It asks you to guess a word, and uses distances in semantic vector space to give clues as to how close your previous guesses are. “Green” is similar to “Red”, partly because they often appear together but also because either one could be replaced by the other in most sentences without too much trouble. “Green” might also be close to “Lawn”, but the range of contexts is more limited so we might see a greater distance between them.
Finding words that are close to each other suggests the existence of a word that is furthest away; a lonely word with the greatest distance to its nearest neighbour. We could say that this is the most unique word in English, since it cannot be substituted for anything else.
To find out, let’s get some embeddings from the Python transformers library for each English word into some NumPy ndarrays:
model_name = 'bert-base-uncased'
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
embeddings_layer = model.get_input_embeddings()
embeddings = embeddings_layer.weight.detach().cpu().numpy()
# returns an array of English dictionary words
english_words = get_english_words('words.csv')
# returns embeddings that match english_words
filtered_embeddings, filtered_words = filter_embeddings_and_vocab(embeddings, tokenizer.get_vocab(), english_words)
Implementing get_english_words() and filter_embeddings_and_vocab() is left as an exercise for the reader.
Now we can find our most unique word:
# Find the word that has the greatest cosine distance from its nearest neighbor.
def find_most_different_word(embeddings, words):
# get the distance of every word to every other word as a NumPy 2D ndarray
distances = cosine_distances(embeddings)
# the diagonal represents the distance of each word to itself, so 'inf' it out
np.fill_diagonal(distances, np.inf)
# take the smallest distance for each row, then, get the index of the largest of those
min_distances = distances.min(axis=1)
most_different_index = np.argmax(min_distances)
return words[most_different_index], most_different_index, min_distances[most_different_index]
And the result is: “than“, with a minimum cosine distance to its nearest word of 0.6315.
We can use scikit-learn‘s NearestNeighbors() function to get the closest friends to “than”.
nbrs = NearestNeighbors(n_neighbors=n_neighbors, metric='cosine').fit(embeddings)
distances, indices = nbrs.kneighbors([new_vector])
similar_words = [words[idx] for idx in indices[0]]
Running this we get:
- compared
- comparable
- influencing
- versus
- comparatively
- exceed
- yellowstone
- markedly
The first six make sense; “than” is used when comparing things, so words that are also used to describe the relations between things are semantically nearby. They’re not as close as other groups of friend words, partly because none of them can grammatically be used as a direct replacement in a sentence.
“Yellowstone” is however a bit of a puzzler. The only entry in the dictionary is of the river and National Park in the USA. My only guess on this would be that it’s a rare enough word in the training corpus that its only examples happened to be randomly disproportionately used with comparatives; such as ‘Yellowstone is more than…”, “older than”, “biggest…” etc, and not consistently with much else.
When sharing this with my local AI group someone pointed out that Claude recently went off on a random Yellowstone bender, which could probably be explained by this phenomenon.
But perhaps someone out there can come up with a better theory yellowstone that?