Natural Language Processing

I have a ton of unpublished work on NLP, for the simple reason that I found absolutely no opportunities to make money from it, despite the fact I think it’s correct, though untested. However, it just dawned on me, that for the as of yet unfinished version of Black Tree, Osmium, which will include basically my full A.I. library, a GUI will probably be unmanageable, for the simple reason that my library is enormous – you can’t have a button for everything. As a consequence, it would probably be more efficient to simply type what you want Black Tree to do, using English. This is non-trivial, but I’ve already done enough work on NLP to make it happen. As such, I thought it worthwhile to at least introduce the basic concepts.

Specifically, every sentence has a subject, verb, and possibly an object. The subject, verb, and object, could all be qualified by other words, specifically, adjectives, adverbs, and quantities (e.g., some, all, one, the, etc.). This sounds trivial and obvious, but now you have an obvious algorithm for parsing a sentence – i.e., look for the verb, then look for the subject, and then look for the object (if it exists). Then look for their respective qualifiers (if they exist). This will cause every sentence (ignoring multiple independent clauses for now) to be reduced to a structure that contains three things, each of which could be associated with qualifiers. You can then compare a given sentence to a dataset of sentences, all stored in that format, which will give you meaning, if you simply return the set of sufficiently similar sentences. You can also give mechanical meaning to a sentence, which is my goal, by comparing a given sentence to a dataset of sentences that are associated with code.

First you load the sentence into a matrix, where each row contains a word, and compare every row to your dictionary, which contains articles of speech, thereby finding the verb. As a consequence, finding the verb in a sentence can be done in constant time, in parallel. Then once you find the verb, you search for its associated object, which must be a noun. You’re looking for all the nouns in the sentence, which again requires looking up each word in the sentence in a dictionary, to obtain its categorization as an article of speech. If you have an existing NLP dataset, you should be able to produce a best answer among all nouns in the sentence, for a given verb. For example, if the sentence is, “The drunk man ran to the dingy bar.”, then “man” is almost certainly going to be more frequently associated as a subject with “ran” as a verb, than “bar” and “drunk” (which can be a noun).

You can already see that it should be easy to produce a score for every noun in a dataset for every verb, which will allow you to quickly produce a best answer, if you calculate the scores beforehand. If the score’s already calculated beforehand, and stored (this will obviously be a big dataset), then it’s a constant runtime operation. In the worst-case, you have to calculate the score for all noun-verb combinations in a new sentence, for a given verb. This can be done by finding all (or at least some) instances of that noun-verb combination in the dataset, and calculating the percentage where the noun is the subject for the relevant verb (as opposed to the object). If the dataset is stored as a collection of matrixes (i.e., each sentence in the dataset is a matrix, with one word per row), then this can be done in two steps for each noun, in parallel (i.e., find the sentences that contain both the noun and verb in question). Then you apply the exact same process, except this time looking for an object, which will obviously return “bar”, rather than “drunk”. If you have independent clauses, then you find all the verbs separately, and apply an analogous process that will still work, because you’re finding the best pairs of subject / object and verb, you’re just doing it over a longer sentence that contains multiple verbs. In all cases, you have a constant runtime search for parsing a single sentence for meaning, and the total runtime will be a multiple of that constant runtime, given by the number of subject-verb, object-verb combinations, which should be low for any reasonable sentence, especially if it’s in a business context (i.e., an instruction or query).

Ultimately, what I’m planning to do, is have ML code be generated by NLP instructions, using already written ML snippets that are modified by the user’s instructions. This is literally what the GUI for Black Tree does, which is generate tailored ML code, using template code, as modified by the user’s selections. I would in this case be substituting the GUI with a set of instructions taken from an English sentence. This is not trivial, but it’s not that hard, at least using this methodology.


Discover more from Information Overload

Subscribe to get the latest posts sent to your email.

Leave a comment