May 25, 2026
The Canadian Patent Database does not (really) supply language metadata for its patent descriptions. That kind of metadata could be useful for all sorts of patent-related analytics.
To generate this metadata, I used two Python scripts: one that does a first pass of automating language detection and another that facilitates manual annotation for particularly unclear patents. The first script was capable of processing 30.4 GB of patent text data (839,882 patents, 2.5 million CSV rows) in under 15 minutes on a laptop (i5-1135G7, 16 GB RAM).
I found 407 patents (0.05% of all patents) with different languages than what is specified by the Canadian Patent Database provided “Language of Filing” field. You can download those 407 patents here: mismatches.csv (8 KB).
In the rest of this post, I describe my foray into automated language detection with the Python packages langdetect, lingua, and duckdb, and some of the curious idiosyncracies of the Canadian Patent Database.
Code: https://github.com/pvelayudhan/patent-language-detection
I have no expertise regarding the Canadian Patent Database or patents in general.
RAGing has become a popular bullet in contemporary data science job postings. In a nutshell, the process involves embedding pieces of useful information into some vector space, finding which of those embedded pieces are similar to a user’s prompt embedded in that space, and then tacking on those pieces to the prompt to help an LLM produce a higher quality answer. To try it out for myself, I set my sights on Canadian patent data. I assume the Canadian Patent Database is powered by lexical (literal text matching) rather than semantic search, so building a vector database for it might actually be useful for finding relevant patent data.
Downloading Canadian patent data was very easy. “Disclosures” in particular seemed to represent thorough descriptions of the inventions being patented and thus seemed like the most useful thing to build a vector database from.
I downloaded all the patent disclosures I could in XML format, got scared of parsing XML data, realized that everything was also provided as a CSV (PT_disclosure.zip), and then downloaded all of that instead.
PT_disclosure.zip unzips from a spooky 8 GB zip file into a ghoulish 30 GB pipe-delimited CSV file. The Unix command tr makes it easy to swap those pipes with new lines and neatly read the long, bilingual column names of this CSV without opening it in a spreadsheet editor.
# IN:
head -1 PT_disclosure.csv | tr '|' '\n'
# OUT:
Patent Number - Numéro du brevet
Disclosure text sequence number - Texte de la divulgation numéro de séquence
Language of Filing Code - Langue du type de dépôt
Disclosure Text - Texte de la divulgationAt the start of my exploration, I realized that many of the descriptions were OCR jank. Here’s an excerpt of the description from patent 1023671 as an example:
“…de WES~ERN E~ECTRIC CO~PANY publi~ le ~ avril 1960). ~a~s ces der- niers appareils, lorsqu~on traite de~ l~quideæ, nota~ment en ultra~ filtration ou osmose inverse”
I dismissed the possibility that I was just worse at French than I thought upon seeing a few similarly funky English patents. ChatGPT assured me that for the purposes of creating vector embeddings, these sorts of artifacts shouldn’t mess with things too badly. I’m pretty sure OpenAI would go bankrupt if they tried to clean all of this text with LLMs, so I decided to just leave it alone for the time being.
When constructing a vector database with the Python package langchain, users have the option to specify metadata for each piece of information (or “document”) being embedded. I figured filing date, patent language, and of course patent number would all be reasonable pieces of metadata to keep associated with each patent description. But after some initial exploration, I noticed something unusual with the “Language of Filing” field that comes attached to each patent.
Consider Patent 1023671. It has an English title, “Liquid fractioning throught hallow fibers located between a wound waterproof strip”, and a French title, “Fractionnement de fluides par fibres creuses disposees entre les spires d’une bande etanche”. The inventors are “Jean Roget”, “Michel Salmon”, and “Bernard Vogt” of Rhône-Poulenc, a French chemical and pharmaceutical company from a hundred years ago that eventually got swallowed up into modern day Sanofi. The abstract is available only in French. The claims are available only in French. The description is available only in French. While all the signs point to it being très français, the metadata field of “Language of Filing” is listed as English. What happened? The “Language of Filing” field is officially described as follows:
“This field indicates whether the document is available in English or French. The LANGUAGE OF FILING field applies only to applications open to public inspection and patents granted on or after August 15, 1978.”
This patent was issued on January 3rd, 1978, which is of course earlier than the August 15 cutoff mentioned in the field’s description. But that cutoff date alone didn’t seem to explain everything; similar mismatches seemed to exist well after the 70s. For example, consider patent 1252295, which has a filing date of March 15th, 1985 and an issued date of April 11th, 1989. This patent also has a “Language of Filing” value of English, despite being available only in French in the Canadian Patent Database.
These mismatches may have instead had something to do with them having earlier associated English applications filed in other countries; patents 1023671 and 1252295 both have English applications filed in the US prior to their Canadian applications. Perhaps this field prioritizes the languages of earlier associated applications.
Another possible explanation can be found elsewhere in the bibliographic help page of the Canadian Patent Database:
“CIPO receives abstracts in both official languages from WIPO as they are filed under the Patent Cooperation Treaty (PCT). However, on occasion, modifications are made to applications after the initial filing. This could lead to discrepancies on the database as no updates are made.”
This disclaimer is only made for abstracts, but it’s possible that it applies to the rest of the patent, too.
Whatever the true cause of these mismatches may be, these observations led me to the main assumption I adopted to build out the rest of this post:
The “Language of Filing” field doesn’t reliably specify the language of a patent’s documents.
I would imagine that CIPO is aware of such language mismatches, but nobody has gotten around to resolving them yet. The publicly downloadable Canadian Patent Database has data for 839,882 patents. At 2 seconds per patent, verifying all the languages would require 467 hours of labour ($8,468.81 at the federal minimum wage). I could understand keeping a task like that low on the priority list.
Using the power of Python scripting, I decided to try correcting these mismatches myself. How hard could it be?
It turns out that doing automated language detection in Python, even on OCR garbled text, is very easy in 2026. Doing it perfectly, well, that is a different story.
The package langdetect was the most popular option I could find for automating language detection.
from langdetect import detect_langs
# 35 characters
print(detect_langs("Bonjour, comment ça va aujourd'hui?"))spits out this:
[fr:0.9999960826391974]Over 99% confidence in the phrase being French.
Even with a little bit of jazz, langdetect gets the right answer:
# 35 characters
print(detect_langs("B~njour, c~mment ça va au~our!'hui?"))[fr:0.9999974661548219]But it isn’t perfect; strings that are too short can trip it up quite easily:
# 14 characters
print(detect_langs("comment ça va?"))[ca:0.9999966352998761]I suppose that phrase did feel pretty Catalan-y.
OK, no problem, I thought. I just need to give langdetect a big enough window of text, and this would not be an issue.
The first thing I had to be mindful of was that PT_disclosure.csv (the file containing all the patent descriptions provided by the Canadian Patent Database) has over 8.5 million rows. If I told Bane that 8.5 million was a lot of rows, I’m sure he would clarify that I was alone in holding that opinion. I would need a plan that wasn’t going to make my laptop burst into flames or take 600 years to finish.
Another thing I had to consider was that the description of each patent was broken up over multiple rows. You can tell what part out of the whole description a row corresponds to for a patent based on the “Disclosure text sequence number” / “Texte de la divulgation numéro de séquence” column.
With these two considerations at the front of my mind, I devised the following plan:
langdetect to check if a fixed-length window of text was English or French with over 95% certaintylangdetect-detected languages as the output CSVI am no DuckDB expert, but I figured it would make life easier by facilitating some data streaming wizardry that would prevent me from needing to load the entire 30.4 GB CSV into memory. Truth be told, I was originally thinking of doing everything with pandas, but ChatGPT convinced me that this was for slow people. I wanted to be fast people, so duckdb it was. At the moment of writing this, I must confess that I could not tell you much about what is going on under the hood for either of these packages. I’ve placed learning that neatly on the to-do list.
I used a lot of free-tier AI chatbot help to get the code I used here working. That being said, I have verified and understood all the code that I’m sharing, so it is as certified non-slop as code I write with AI assistance can be. Yes, I am just copy-pasting into chatbot websites and have not yet gone full gastown or whatever. If you’d like, you can see my unorganized conversations here, but they are extremely unflattering to my technical and problem-solving abilities so I would prefer that you don’t click on them. When I still had to do math tests, I’d often reach for my calculator even for things like at times. I have no idea why. I think that behaviour still shows in these links.
Maybe all the code I generated could have been created all at once with just one good prompt, but where’s the fun and educational value in that?
I eventually built up a language detection script to the point where I was able to process and visualize language detection results for the first 10,000 rows of patent data in PT_disclosure.csv. While it looked like the language detection was working, things were still pretty slow. 100,000 rows took 131 seconds, meaning all 8.5 million rows would take about 3 hours.
Here’s where the script was at (don’t laugh):
import duckdb
import pandas as pd
import time
from langdetect import detect, detect_langs
from multiprocessing import Pool, cpu_count
from langdetect import DetectorFactory
DetectorFactory.seed = 0
input_file = '100000_row_sample.csv'
data_dir = 'csv-data/'
output_file = data_dir + 'lang_detected_' + input_file
input_file = data_dir + input_file
BATCH_SIZE = 200
def detect_language(text, thresh=0.95):
"""
Detect whether text is in English or in French.
Examines a 50-char long sliding window of text to see if EN or FR have
at least a P = `thresh` match from the `detect_langs()` function. If a
match is found, return the windowthat triggered the match and the lang.
If no match is found, return the first 50 text chars and 'UNKNOWN'.
Text comes from the following column in PT_disclosure.csv:
Disclosure Text - Texte de la divulgation
"""
try:
start, end = 0, 50
while start < len(text):
window = text[start:end]
# If there are no letters, move the sliding window manually or
# else detect_langs() will throw an exception
if any(c.isalpha() for c in window):
results = detect_langs(text[start:end])
lang_dict = {r.lang.upper(): r.prob for r in results}
p_en = lang_dict.get('EN', 0)
p_fr = lang_dict.get('FR', 0)
if p_en >= thresh or p_fr >= thresh:
return text[start:end], detect(text).upper()
start += 50
end += 50
# exhausted all windows
return text[0:50], 'UN' # unknown
except:
return text[0:50], 'EX' # exception
def process_row(row):
patent_num, filed_lang, full_text = row
detected_text, detected_lang = full_text[0:50], detect_language(full_text)
lang_mismatch = filed_lang.upper() != detected_lang.upper()
return patent_num,
filed_lang,
detected_text,
detected_lang,
lang_mismatch
con = duckdb.connect()
con.execute(f"""
CREATE VIEW patent_texts AS
SELECT
"Patent Number - Numéro du brevet"::INTEGER AS patent_num,
ANY_VALUE(
"Language of Filing Code - Langue du type de dépôt"
) AS filed_lang,
string_agg(
"Disclosure Text - Texte de la divulgation",
' '
ORDER BY "Disclosure text sequence number - Texte de la divulgation numéro de séquence"
) AS full_text
FROM read_csv('{input_file}',
delim='|',
columns={{
'Patent Number - Numéro du brevet': 'INTEGER',
'Disclosure text sequence number - Texte de la divulgation numéro de séquence': 'INTEGER',
'Language of Filing Code - Langue du type de dépôt': 'VARCHAR',
'Disclosure Text - Texte de la divulgation': 'VARCHAR'
}},
ignore_errors=true)
GROUP BY patent_num
""")
start = time.time()
last_patent = None
with Pool(cpu_count() - 1) as pool: # leave 1 core free
while True:
batch = con.execute(f"""
SELECT patent_num, filed_lang, full_text
FROM patent_texts
ORDER BY patent_num
LIMIT {BATCH_SIZE}
""").df()
if batch.empty:
break
rows = list(batch.itertuples(index=False, name=None))
results = pool.map(process_row, rows)
out = pd.DataFrame(results, columns=['patent_num',
'filed_lang',
'detected_text',
'detected_lang',
'lang_mismatch'])
out.to_csv(output_file, sep=',', index=False,
mode='w' if last_patent is None else 'a',
header=last_patent is None)
last_patent = batch['patent_num'].iloc[-1]
print(f"Processed up to patent {last_patent}")
print(time.time() - start)It had multithreading. It used duckdb. What else could I do to make it go faster? I was running out of buzzwords to throw at it.
Performance improved a little after I used cProfile to spot the unnecessary calling of both detect_langs() and detect() in my detect_language() function. Using cProfile to find that mistake felt a bit like getting an Antiques Roadshow appraiser tell me that my crumpled Gengar from a Costco 2025 Halloween Pokemon booster pack was worth negative 3 cents. I probably didn’t need to go to Antiques Roadshow to hear that.
Performance improved a little bit more by non-systematically fiddling with the sliding window size and batch size. I was left with a script that took around 1 minute to do language detection for 100,000 rows, making the final estimated time for 8.5 million rows somewhere around 85 minutes. I decided that was good enough and let it rip, only to be confronted by
_duckdb.OutOfMemoryException: Out of Memory Error: failed to allocate data of size 16.0 MiB (12.2 GiB/12.2 GiB used)Aw. If you have done this kind of thing before, you may have seen this coming as a result of this step:
string_agg(
"Disclosure Text - Texte de la divulgation",
' '
ORDER BY "Disclosure text sequence number - Texte de la divulgation numéro de séquence"
) AS full_textI guess the memory required to aggregate the separated patent disclosure pieces back together was just too much for my laptop to handle.
…is a solution to this exact problem from a classic data science textbook [1]. I have no doubt that a grizzled veteran could figure out a brilliant strategy to stream this aggregation using nothing more than the power of a TI-86. That grizzled veteran is neither me nor me + all the free-tier AI chatbots in the world.
Instead, I just stopped trying to aggregate the patent disclosure sections.
From looking at a few rows of PT_disclosure.csv here and there, it seemed that each individual patent section was long enough to supply the necessary context to determine if the patent was in English or French. My new query became:
query_result = con.execute(
f"""
SELECT
"Patent Number - Numéro du brevet"::INTEGER AS patent_num,
"Language of Filing Code - Langue du type de dépôt" AS filed_lang,
"Disclosure Text - Texte de la divulgation" AS full_text
FROM read_csv('{input_file}',
delim='|',
parallel=True,
ignore_errors=true)
WHERE "Disclosure text sequence number - Texte de la divulgation numéro de séquence" = 1
"""
)This query would simply filter PT_disclosure.csv down to only the first section of each patent. With that in place, it took approximately 15 minutes to complete the language detection for the entire 30.4 GB file (839,882 patents)!
Some patents were flagged as having an unknown language. For example, patent 2203704:
“ GIZII fZf.GOIL /w.‘ O’ i’O I LP.rsi ‘/’Pi CQI’lfG Ol C’l?II ‘I’r%72P./Z O CL/ OP.I %Si C ‘l?/ 2!I ZD GG f’O(:E’IY, Q,I’P ca r i a cclz stoic a/ ea t her tend to- sauk ‘ ‘ ze aitt G %s be creto- tfi sto co nf u z’In O a e r tlt Eo ar Q.CG 1YN 1’LP.6 172f h2(.‘11172P 5 tot,’ i ! i d .rr e to- reuen tl s b y9′ can, ea! c tlz G -‘’foivn: r o tl s ; c: i e i, a a c c ca I’r, ‘ tot,’ i, etrsy to- i tal.G aru nurr u G ,,,/, y .e u r o e a a a ux ar v u at tl o tl i zeh tio , 1 i are u reetW u eur u 2 i a e e e z t. o o cv si, 3 i a eAeu, rfar eru rf ‘tl . u c a sa cl’, I - I II-II ‘ 3 III-III ‘ , e ction a ‘t lug I V - I V ‘ y u 4 c ‘, u 8 i cv r etacG ‘tlz ‘ogb-a 2 W GIZl7 ‘ I GOfL Clef OtI I ‘ ll/ Pi T C’ / I’16P.6 l ‘I,G
Many unknown-language patents such as this one did look like they were written by a Bluetooth keyboard getting death-rolled by an alligator. Can’t do much about that.
What was much worse was that that there were major mistakes in the language detection output.
For example, this text:
print(detect_langs("CA 02572869 2007-01-04, WO 2006/013242 PCT/FR2005/001543, 1, ASSEMBLAGES SOUDES A HAUTE DENSITE D'EMERGIE D'ACIERS DE, CONSTRUCTION METALLIQUE PRESENTANT UNE EXCELLENTE, TENACITE DANS LA ZONE FONDUE, ET METHODE, DE FABRICATION DE CES ASSEMBLAGES SOUDES, La présente invention concerne les constructio"))was being detected as English with over 99.9% certainty. Why? Well, all the yelling, of course!
Lowercasing the text first:
print(detect_langs("ca 02572869 2007-01-04, wo 2006/013242 pct/fr2005/001543, 1, assemblages soudes a haute densite d'emergie d'aciers de, construction metallique presentant une excellente, tenacite dans la zone fondue, et methode, de fabrication de ces assemblages soudes, la présente invention concerne les constructio"))switched the detected language output as over 99.9% chance of being French. As a native English speaker, I was slightly offended, but ready to move on. It seems like others before me have encountered this problem before, and came up with a similar solution:
detect_langs(string.lower())After ensuring every piece of text was lowercased first, I found a new challenge: bilingual boilerplate.
A manual glance at a new set of detected mismatches revealed hundreds (maybe thousands?) of English and French patents which started with this exact text:
DEMANDES OU BREVETS VOLUMINEUX LA PRESENTE PARTIE I)E CETTE DEMANDE OU CE BREVETS COMPRI :ND PLUS D’UN TOME. CECI EST .E TOME 1 DE 2 NOTE: Pour les tomes additionels, veillez contacter 1e Bureau Canadien des Brevets. JUMBO APPLICATIONS / PATENTS THIS SECTION OF THE APPLICATION / PATENT CONTAINS MORE THAN ONE VOLUME. THIS IS VOLUME 1 OF 2 NOTE: For additional vohxmes please contact the Canadian Patent Oi ice.
No problem; I’ll just skip the first half of each description. Surely that would fix all my problems.
SUBSTR(
"Disclosure Text - Texte de la divulgation",
LEN("Disclosure Text - Texte de la divulgation") / 2)
) AS full_textBut this also failed, as there were many patents for which the second half of the first part of the description was similarly unhelpful text, like:
“AU00/00536,PCT/AU00/00537,PCT/AU00/00538, 25 PCT/AU00/00539, PCT/AU00/00540.PCT/AU00/00541,PCT/AU00/00542,PCT/AU00/00543, PCT/AU00/00544, PCT/AU00/00545,PCT/AU00/00547,PCT/AU00/00546,PCT/AU00/00554, PCT/AU00/00556. PCT/AU00/00557,PCT/AU00/00558,PCT/AU00/00559,PCT/AU00/00560, PCT/AU00/00561, PCT/AU0” - Patent 2414767
Or
“CA 02563058 2006-10-17 WO 2005/102353 PCT/AU2005/000561 ‘ 1″ - Patent 2563058
Darn.
langdetect is over, now lingua is my friendIt was disappointing to see langdetect achieve over 99.9% confidence (incorrectly) on snippets like the ones above. I decided to try a different package.
The most modern option I could find was lingua, which claimed to be both faster and more accurate than langdetect. The code to extract whether a text was English or French was classy:
from lingua import Language, LanguageDetectorBuilder
languages = [Language.ENGLISH, Language.FRENCH]
detector = LanguageDetectorBuilder.from_languages(*languages).build()
lang = detector.detect_language_of("hi")
print(lang.iso_code_639_1.name) # prints ENI was impressed by how well it handled mixed-language text:
from lingua import Language, LanguageDetectorBuilder
languages = [Language.ENGLISH, Language.FRENCH]
detector = LanguageDetectorBuilder.from_languages(*languages).build()
def lang(text):
return detector.detect_language_of(text).iso_code_639_1.name
lang("hello") # English
lang("hello bonjour") # French
lang("hello bonjour, how's it going?") # English
lang("hello bonjour, how's it going? Comment ça va?") # French
lang("Hello, how's it going? Ça va?") # EnglishIt also seemed to be around 5 times faster and much less stochastic in its outputs than langdetect. I figured the best way to take advantage of all these new perks would be to ditch the sliding window entirely and let lingua tank the entire first part of each patent disclosure.
The lingua version of the script took only 8 minutes to run, but also revealed some new problems.
1. Descriptions that weren’t available
When patent descriptions were missing, they would be populated with a placeholder instead of nothing:
. )q La divulgation n’est pas disponible Disclosure Not Yet Available en ce moment
For example, patent 2196110. There was no abstract, no claims, and no description, which leads me to believe the most appropriate answer here is some kind of N/A. lingua gave these descriptions an “English” instead.
2. Descriptions that were mangled to oblivion
Patent descriptions such as 2439210:
CA 02439210 2003-09-05 : , v° a ‘ x ia :“ !7? 5 .r r’S ‘. f° ‘ …. fh’ ;::ø.. , ,. .p.,; f , ‘;.y, Y’:“i . ,t a .n, .S,f ,n i ‘;, v Ss;rv’ :a-. “,( ; ij. (“,r s “ s. :.h L.3v,..”.. .3(,,. …<.. s ;..; 44 a’r »‘: (.r:, 3!’..(, 5 a.”‘t ., >n, .,, a 3. “,r; ;.‘t.:,:v #, ‘.’sla l“ ii’ ?,ia va,v c .:‘. ,,.y .r.. 4 s.i ‘r ,,; ;( fr; r:r- ; “rn nr,”.tv: 5 ‘Y .r’ ., . ,. . ‘ :(f p% .a .i r - r5 p7″ $ , ‘. (o5 .,r yvt n, ‘,- .:r5.5.) ‘ > :a;“.,y:; ;. Slf fr u..5.4 lYr’ h J v .,ta.: #,tau.‘.>.r.
were getting detected seemingly arbitrarily as English or French, when an N/A would have probably been better. This was not a problem in my langdetect attempts due to setting a confidence threshold under which any text would get flagged as “unknown”. Based on my limited understanding of lingua, building a language detector object from English and French meant that the confidence over those two languages would need to sum to 1, with no room for an unknown language option.
One interesting thing about these messy descriptions was that they seemed to (at least for the ones I checked) have perfectly coherent claims and abstract sections. Switching to claim analysis might be helpful for an overall patent language field, but I was interested in creating a vector database specifically from the descriptions.
Overall, things were looking better. Many of the detected mismatches seemed to be accurate, including for patents from the 2000′s like 2572485 (listed as French when the patent text is in English) or 2572654 (listed as English when the patent text is in French). Progress was being made.
Manual intervention is not cool, but occasionally necessary. To overcome challenges related to unavailable or excessively messy descriptions, I established a few hard-coded checks:
def detect_language(text):
try:
# Unknown 1: over 50% non-alphanumeric text
if np.mean([c.isalnum() for c in text]) < 0.5:
return 'U1'
# Unknown 2: The description is under 150 characters
elif len(text) < 150:
return 'U2'
else:
lang = detector.detect_language_of(text)
if lang:
return lang.iso_code_639_1.name
else:
# Unknown 3: lingua error?
return 'U3'
except:
return 'EX'Unknown 1 and 2 would catch excessive OCR jank and missing patent description placeholders respectively. It’s important that I used .isalnum() rather than .isalpha() to minimize false hits from things like lengthy chemical formulae.
After updating and re-running the script, 638 out of 839,882 patents were flagged with an unknown language label due to either being too short or too dense with non-alphanumeric characters. Claude almost one-shotted a script to help me manually annotate these 638 patents, which was reasonably fun to work through:
I flagged 15 of those uncertain patents as French and 116 as English.
I merged the manually annotated uncertain patents with the automated detection results to get my final version of patent language data. I expect it could have been made even better by probing beyond the first part of each patent description, but as a fun thing to do for free, this is where my personal sense of work integrity permitted me to let it be.
At some point before patent 2,000,000, the entity behind Canadian patent numbers decided to start numbering patents with a 2. To avoid having a large flat gap caused by the change in patent numbering, I’ve plotted the cumulative distribution of detected mismatches here across two plots: one before the change in numbering and one after.
Patents from before the change:
Patents from after the change:
I originally wanted to include dates associated with the patent numbers in these plots. Unfortunately, the date metadata for these patents exists in some high-dimensional non-Euclidean plane of reality:
So just don’t ask about that.
Here are my observations:
In case you were wondering, those first two little steps in the English-patents-filed-as-English-patents line correspond to patents 1028923 and 1035200. Did something magical happen for those patents? I do not know. What I will say is that patent 1028923 (a “liquid line shock absorber” invented by Richard D. Dirks and Louis Blendermann of Josam Manufacturing Co.) shows up as having been filed in 1975 on Google patents, which is 6 years after the Official Languages Act established English and French as the official languages of Canada and 5 years after the Patent Cooperation Treaty was established enabling a unified patent application process across many different countries. But that is where I will leave that tangent.
I can’t report an accuracy or AUROC or confusion matrix or anything like that because the ground truth accuracy of all the labels is still a bit of a mystery. Ironically, there’s no way of knowing with certainty the accuracy of the automated process without doing the entire thing again manually. I suspect my results improve considerably on the “Language of Filing” field, but are still not perfect. Expanding the language detection to beyond the first patent section could be a valuable next step for further improving the quality of detection.
Manual labeling revealed some sections of text that reasonably were quite difficult to assign a language to. Should this count as French?
Patent: 2201035: 2 () i o , !!I s agit d’un système de boîte aux lettres facilitant le dépot et’le re’ r-aït du courrier, - -, et permettant de visualiser plus facilement son contenu. L’o jet est cons+itu-‘ de, trois par,ies. soit: la plaque arrière 1 ) s. fixant au mur. Ie bJ tiev ce r._cs-ptlol1 du, cGurriar (2j et ie capGt 3) rrur d un d rsitif de f rme- tu,e c’ u tier. _e vapot, r;t, fcrrletule fixé à iia par+i- sul éirf,ur_ de ia r que arriàre rGr r,ieui, pivv s sur, un r rr,_ aAe décentré ar rOipr rt –1 s ri cer,tre de aravitF, (_ errne.te.n+, t r S UIèVement , 2r ,J, _r de i-r,cepti ,.ri. - –irri r pa!! d uA r ,vr t s i u .r i- r x ntr PG’ ,- p- r+ c;, r _ _ j ra v i t e ( m e t _ I t !! a b a c I r; _!! L ( f j r ,; r i F - r r –r, r’ + ai r;ti_ rl IrvG,F_ndi ,L_il.aire C Ia plc juc? crri-re. !!ua !!2. iJ ît , i t r. ,i_ e-st er ositi n ouverte (fi v). Erl cr_cédairlt i- 12i !!evéi_ d l cnp.- t. vr , - Ir, i ‘’s iosit f da re+LFnue , j pr Or !!e !! ta r, l!!e e!! !! c v t, - .r, ‘J ‘L n- ît’er dr- r; r,p+ jor . f w ‘ ;’ ‘’ ret’‘ Ji_V_ r, rV rl’ +- nL i c’ ‘2 ldi , -G v: r dans s– r Js ,n u J r res !! d-, r,t 3l’ r rrj 2l r“ j, _j r r r. ri !! S , !! C t _ î + e r v e !! s I r r r- rf j l - u .- – r!!j U , rr ci, s Fi i s, . f - - r . } re i , r cie r ,che!! t .r sl r,r jr –<r-r . v d r-r-ptio, jr ,_ (tj /lr). Toutes ces op r2 tii-,r:s ,r t s eff r r u r de !!j?j r,. j -n? rj Irl : scj, r (Ii3. i0-1 i- i2), ‘ ; -I s t l l !! u s l r i o !! l s , . r e s
Available patent text data lies somewhere on a spectrum of OCR-messiness, ranging from perfectly legible to incomprehensible gibberish. Somewhere in between those two extremes exist text that even a human would struggle to determine if there’s meaningful information present or not. In cases like that, I suppose the best answer is one that is tied to the downstream purpose of the language detection. If we’re trying to teach somebody French, then no, the above text probably should not count as French. If we’re trying to extract every ounce of information we can get from this database for generating the most comprehensive vector database possible, then perhaps the above text should count as French.
Something something judgement, accountability, something.
I’ll send over my code and results to CIPO and update here if anything ever comes of it. Thanks for reading!