Spaces:
Runtime error
Runtime error
Alex Strick van Linschoten commited on
Commit ·
5002a2b
1
Parent(s): 02f3e99
update text
Browse files- app.py +3 -1
- article.md +37 -3
app.py
CHANGED
|
@@ -206,7 +206,9 @@ demo = gr.Interface(
|
|
| 206 |
label="Confidence",
|
| 207 |
optional=False,
|
| 208 |
),
|
| 209 |
-
gr.inputs.Checkbox(
|
|
|
|
|
|
|
| 210 |
],
|
| 211 |
outputs=[
|
| 212 |
gr.outputs.Textbox(label="Document Analysis"),
|
|
|
|
| 206 |
label="Confidence",
|
| 207 |
optional=False,
|
| 208 |
),
|
| 209 |
+
gr.inputs.Checkbox(
|
| 210 |
+
label="Analyse and extract redacted images", default=True
|
| 211 |
+
),
|
| 212 |
],
|
| 213 |
outputs=[
|
| 214 |
gr.outputs.Textbox(label="Document Analysis"),
|
article.md
CHANGED
|
@@ -6,7 +6,34 @@ models out in the world as some kind of demo or application.
|
|
| 6 |
[Huggingface Spaces](https://huggingface.co/spaces) makes it super easy to get a
|
| 7 |
prototype of your model on the internet.
|
| 8 |
|
| 9 |
-
This
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
|
| 11 |
## The Dataset
|
| 12 |
|
|
@@ -14,14 +41,21 @@ I downloaded a few thousand publicly-available FOIA documents from a government
|
|
| 14 |
website. I split the PDFs up into individual `.jpg` files and then used
|
| 15 |
[Prodigy](https://prodi.gy/) to annotate the data. (This process was described
|
| 16 |
in
|
| 17 |
-
[a blogpost written last
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
## Training the model
|
| 20 |
|
| 21 |
-
I trained the model with fastai's flexible `vision_learner`, fine-tuning
|
| 22 |
`resnet18` which was both smaller than `resnet34` (no surprises there) and less
|
| 23 |
liable to early overfitting. I trained the model for 10 epochs.
|
| 24 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
## Further Reading
|
| 26 |
|
| 27 |
This initial dataset spurred an ongoing interest in the domain and I've since
|
|
|
|
| 6 |
[Huggingface Spaces](https://huggingface.co/spaces) makes it super easy to get a
|
| 7 |
prototype of your model on the internet.
|
| 8 |
|
| 9 |
+
This MVP app runs two models to mimic the experience of what a final deployed
|
| 10 |
+
version of the project might look like.
|
| 11 |
+
|
| 12 |
+
- The first model (a classification model trained with fastai, available on the
|
| 13 |
+
Huggingface Hub
|
| 14 |
+
[here](https://huggingface.co/strickvl/redaction-classifier-fastai) and
|
| 15 |
+
testable as a standalone demo
|
| 16 |
+
[here](https://huggingface.co/spaces/strickvl/fastai_redaction_classifier)),
|
| 17 |
+
classifies and determines which pages of the PDF are redacted. I've written
|
| 18 |
+
about how I trained this model [here](https://mlops.systems/fastai/redactionmodel/computervision/datalabelling/2021/09/06/redaction-classification-chapter-2.html).
|
| 19 |
+
- The second model (an object detection model trained using [IceVision](https://airctic.com/) (itself
|
| 20 |
+
built partly on top of fastai)) detects which parts of the image are redacted.
|
| 21 |
+
This is a model I've been working on for a while and I described my process in
|
| 22 |
+
a series of blog posts (see below).
|
| 23 |
+
|
| 24 |
+
This MVP app does several things:
|
| 25 |
+
|
| 26 |
+
- it extracts any pages it considers to contain redactions and displays that
|
| 27 |
+
subset as an [image carousel](https://gradio.app/docs/#o_carousel). It also
|
| 28 |
+
displays some text alerting you to which specific pages were redacted.
|
| 29 |
+
- if you click the "Analyse and extract redacted images" checkbox, it will:
|
| 30 |
+
- pass the pages it considered redacted through the object detection model
|
| 31 |
+
- calculate what proportion of the total area of the image was redacted as
|
| 32 |
+
well as what proportion of the actual content (i.e. excluding margins etc
|
| 33 |
+
where there is no content)
|
| 34 |
+
- create a PDF that you can download that contains only the redacted images,
|
| 35 |
+
with an overlay of the redactions that it was able to identify along with
|
| 36 |
+
the confidence score for each item.
|
| 37 |
|
| 38 |
## The Dataset
|
| 39 |
|
|
|
|
| 41 |
website. I split the PDFs up into individual `.jpg` files and then used
|
| 42 |
[Prodigy](https://prodi.gy/) to annotate the data. (This process was described
|
| 43 |
in
|
| 44 |
+
[a blogpost written last
|
| 45 |
+
year](https://mlops.systems/fastai/redactionmodel/computervision/datalabelling/2021/09/06/redaction-classification-chapter-2.html).)
|
| 46 |
+
For the object detection model, the process was quite a bit more involved and I
|
| 47 |
+
direct you to the series of articles referenced below in the 'Further Reading' section.
|
| 48 |
|
| 49 |
## Training the model
|
| 50 |
|
| 51 |
+
I trained the classification model with fastai's flexible `vision_learner`, fine-tuning
|
| 52 |
`resnet18` which was both smaller than `resnet34` (no surprises there) and less
|
| 53 |
liable to early overfitting. I trained the model for 10 epochs.
|
| 54 |
|
| 55 |
+
The object detection model is trained using IceVision, with VFNet as the
|
| 56 |
+
model and `resnet50` as the backbone. I trained the model for 50 epochs and
|
| 57 |
+
reached 89% accuracy on the validation data.
|
| 58 |
+
|
| 59 |
## Further Reading
|
| 60 |
|
| 61 |
This initial dataset spurred an ongoing interest in the domain and I've since
|